Wednesday, September 11, 2024

Information, Data, and Knowledge

(This is part 10 in a series on the scientific method.)

In 1966, well within living memory as I write this in 2024, Digital Equipment Corporation released the PDP-10, later rebranded as the DECsystem-10 or, more colloquially, the DEC-10.  The base model cost over $100,000 in 1966 dollars, well over a million dollars today.  For that price you got 8,192 words of memory, each being 36 bits wide, or just a hair over 32 kilobytes by modern reckoning.  A top-of-the-line model gave you about 144 kilobytes (32,768 36-bit words) of memory and cost three times the base model price.  At the time, the DEC-10 was considered a "low cost" computer.  Nowadays a computer with vastly superior specifications fits on a board the size of your little finger and can be purchased at retail for less than less than one 1966 dollar.  If you buy just the chip in modest volume, you can get them for less than one 2024 dollar.

Notice that despite decades of technological improvement, the specifications of a computer's capabilities were given in essentially the same terms in 1966 as they are today, namely, by specifying the amount of memory it contains, and this will always be the case.  Memory has a utility that transcends technology and culture and the whims of fashion: it can store information.

Information is both ubiquitous and ineffable.  We live in an "information age", inundated by information around the clock, but almost no one understands what it actually is.  It's challenging to find a definition that is accessible and not circular.  Wikipedia says, "Information is an abstract concept that refers to something which has the power to inform," which doesn't seem particularly helpful or enlightening to me.

So let's set this as our Problem: there is this thing, information, that everyone talks about, that permeates our modern existence, but that no one seems to actually understand.  Why?  Is information real, an essential part of objective reality, or is it just a social construct like the law or the value of fiat currency?  What is information made of?  Can it be created and destroyed like computers and chairs, or is it conserved like mass and energy?

Let's start by taking stock of some of the (apparent) properties of this thing we're trying to explain.

Information does not exist by itself.  It is always contained in in or bound to something, like a sheet of paper or a computer memory or a DNA molecule or a brain.  This container doesn't necessarily need to be a material object made of atoms, it can be an electromagnetic wave.  But it has to be something.  There is no such thing as information in isolation, not bound to any material thing.

Despite the fact that this containment is essential, the identity of a piece of information is independent of its binding.  The document you are reading right now is in some sense "made of" information.  Most likely, you are reading it on a computer, so that information is contained in a computer memory.  It is transferred into your brain -- i.e. the information is copied -- through the act of reading.  You could likewise copy the information in this document onto a sheet of paper by printing it, or into sound waves in the atmosphere by reading it aloud.  But regardless of whether the information in this document is contained in a computer memory or a sheet of paper or your brain or the air around you, it is, at least in some sense, the same information in all four cases.

How much can information be changed and still be "the same information"?  If this document is translated into a different language, is it still "the same information"?  What about if it is compressed or encrypted?  When information is compressed or encrypted, the result bears virtually no resemblance to the original.  (In the case of encryption, that is the whole point!)  But in each case, there is still some sense in which the result is "the same" as the original.  A compressed or encrypted version of this document is still unambiguously (it would seem) a version of this document and not a version of any other document.

I'm going to put a pin in that for now and invite you to consider a different question, which will eventually lead us to the answers to the preceding ones: what is the difference between information and data?  To persuade you that there is indeed a difference, here is an example of some data:

77 72 17 56 75 22 50 76 20 49 29 16 4 61 33 71 87 65 56 92

The difference between that and all of the surrounding text should be obvious: the text conveys some sort of "meaning" while the data is "just a bunch of numbers".  There might be some meaning there, but it's not readily apparent the way it is with the surrounding text.

So the difference between "information" and "data" is that information has "meaning" and data doesn't.  But what is "meaning"?  That word seems at least as troublesome to define as "information", and so reducing "information" to "meaning" doesn't represent a lot of progress.  But in this case looks are deceiving because the meaning of "meaning" can actually be made precise.  To see how, consider another example:

55 83 56 74 55 70 55 73 56 75 54 77 54 72 55 68 54 66 52 70

At first glance, this too looks like "just a bunch of numbers" but if you look closer you might start to notice some patterns in the second list that weren't there in the first.  For example, if you consider the list as pairs of numbers, the first number is always smaller than the second.  If you look at the first number in the pair, they all fall in a fairly narrow range: 52-56.  The second number in the pairs also fall in a narrow range: 66-83.  That might be enough for you to figure out what these numbers actually are without my telling you: they are the low and high temperature forecasts for a particular location (Redwood City, California) on a particular date (September 8, 2024) according to the Apple Weather app on my iPhone.  So those numbers contain information.

The difference between data and information is that information has a referent -- it is about something.  In the case of the second example, the referent is a weather forecast.  Notice that in order to make sense of that information, to extract its meaning, requires that you know something about what numbers are, how they are represented as marks on a page (or a computer screen), and how those numbers relate to temperatures and geography and time.  But the essential feature, the thing that confers information-ness to those numbers, is that they have a referent, that they correspond to something, in this case, temperatures.

It is this correspondence that is the defining characteristic of information.  Information is a correspondence between two physical things, like marks on a page and current events, or the reading on a thermometer and the temperature of the surrounding material, or the order of base pairs in a DNA molecule and the shape of a protein.

Notice that recognizing this correspondence can require quite a bit of effort.  There is nothing at all obvious about the relationship between the shapes of numerals and temperatures, nothing about the visual appearance of "23" that corresponds to "cold", or the visual appearance of "98" that corresponds to "hot".  To see the correspondence requires going through a lot of intermediate steps, like learning what numbers are, understanding what "hot" and "cold" are, what a temperature is, and so on.  There is nothing at all obvious about the relationship of base pairs in a DNA molecule and the shapes of proteins.  That correspondence goes through a relatively straightforward mapping of base pairs to amino acids, and then a much more complicated mapping of sequences of amino acids to shapes in three-dimensional space.  There are similar processes at work in compressed and encrypted data.  To see the correspondence between compressed and uncompressed data you have to know the compression algorithm.  To see the correspondence between encrypted and unencrypted data you have to know both the algorithm and the encryption key.

There is one additional requirement for data to qualify as information: the correspondence between the information and its referent must be causal.  It can't be simply due to chance.  I could produce a plausible-looking weather forecast simply by picking numbers at random.  That forecast might even be correct if I got lucky, but it would not contain any information.  The correspondence between the forecast and the actual future temperatures has to be the result of some repeatable process, and the correlation between the information and its referent has to be better than what could be produced by picking numbers at random.

How can you be sure that some candidate information is actually information and not just a coincidence?  You can't!  The best you can do is calculate the probability that a correlation is a coincidence and decide if those odds are small enough that you are willing to ignore them.  Calculating those odds can be tricky, and it's actually quite easy to fool yourself into thinking that something is information when it's not.  For example, there is a classic scam where a stock broker will send out market predictions to (say) 1024 people.  512 of the predictions say a a stock will go up, and 512 say it will go down.  For the group that got the correct predictions he sends out a second set of predictions.  256 people get an "up" prediction, and 256 get "down".  This process gets repeated eight more times.  At the end, there will be one person who has seen the broker make correct predictions ten times in a row.  The odds of that happening by chance is 1 in 1024.  Of course it did not happen by chance, but neither did it happen because the broker actually has a way of predicting which way a stock will move.  The existence of information depends on the assumption that the universe is not trying to scam us, which seems like a reasonable assumption, but we can't actually know it for certain.

The existence of information depends on another assumption as well.  Information is a correlation of the states of two (or more) systems, but what counts as a "system"?  There are some choices that seem "natural", but only because of our particular human perspective.  A bit in a computer memory, for example, only looks like a binary system because that is how it is intended to be viewed.  In actual fact, computer memories are analog systems.  When a 0 changes to a 1, that change does not happen instantaneously.  It takes time for the electrons to move around, and while that is happening the state of that bit is neither 0 nor 1 but somewhere in between.  We generally sweep that detail under the rug when we think about computer memory, but that doesn't change the fact that the underlying reality is actually much more complicated.

Some aspects of the apparent behavior of information actually depend on this.  It appears that information can be created and destroyed.  Human creativity appears to create information, and human carelessness or malice appears to destroy it.  There's a reason that backups are a thing.  But if we view the universe as a whole, and if our current understanding of the laws of physics is correct, then information can neither be created nor destroyed.  The currently known laws of physics are deterministic and time-symmetric.  If you knew the current state of the universe, you could in theory project that forward and backward in time with perfect fidelity.  When you "destroy" information, that information isn't actually destroyed, it is merely "swept under the rug" by some simplifying assumption or other.  When you "create" new information, it is not really new, it is pre-existing information being changed into a form that we can recognize under a certain set of simplifying assumptions.

It is tempting then to say that information is in some sense not real but rather a figment of our imagination, an artifact of some arbitrary set of simplifying assumptions that we have chosen for purely economic reasons, or maybe for political or social reasons.  Choosing a different set of simplifying assumptions would lead us to completely different conclusions about what constitutes information.  And that is true.

But our simplifying assumptions are not arbitrary.  The reality of information is grounded in the fact that everything you know is bound to one particular physical system: your brain.  You can imagine taking a "God's-eye view" of the universe, but you cannot actually do it!  In order to take the God's-eye view you would need a brain capable of containing all of the information in the universe, including the information contained in your own brain.  God might be able to pull that trick off, but you can't.

Could we perhaps build a machine that can take the God's-eye view?  At first glance that would seem impossible too for the same reason: that machine would have to contain all of the information in the universe.  But the machine would itself be part of the universe, and so it would have to contain a copy of all of the information contained within the machine, including a copy of that copy, and a copy of the copy of the copy, and so on forever.

There is a clever trick we can do to get around this problem.  See if you can figure out what it is.  Here is a hint: suppose you wanted to make a backup of your hard drive, but you didn't have any extra storage.  If your drive is less than 50% full this is easy: just copy the data on the used part of the drive onto the unused portion.  But can you think of a way to do it if your drive is full?)

Notwithstanding that it might be possible in principle to build a God's-eye-view machine, there are two problems that make it impossible in practice.  First, the engineering challenges of actually collecting all of the information in the universe are pretty daunting.  You would need to know at the very least the position of every subatomic particle, including those inside planets and stars in galaxies billions of light years away.  We can't even access that information for our own planet, let alone the countless trillions of others in the universe.  And second, throughout this entire discussion I've assumed that the universe is classical, that is, that it is made of particles that exist in specific locations at specific times.  It isn't.  At root, our universe is quantum, and that changes everything in profound ways.  Quantum information has a fundamentally different character from classical information, and the biggest difference is that quantum information is impossible to copy.  If you could do it, you could build a time machine.  But that's a story for another day.

The takeaway for today is that ignorance is a fundamental part of the human condition.  Indeed, as we will see when we finally get around to talking about quantum mechanics, ignorance is actually the mechanism by which the classical world emerges from the underlying quantum reality.  (Note that this is a controversial statement, so take it with a big grain of salt for now.)  But even in a classical world, taking the God's-eye view, while possible in principle, is impossible in practice, at least for us mortals.

Can we mere mortals ever actually know anything?  No, we can't, not with absolute certainty.  We are finite beings with finite life spans and finite brains.  We can only ever be in possession of a finite amount of data, and that data will always be consistent with an infinite number of potential theories.   We can never be 100% certain of anything.  It is always possible that tomorrow we will discover that our entire existence is some kind of long con -- a simulation, maybe, running in a tiny corner of a vast data center built by some unfathomably advanced alien civilization, and the only reason that the laws of physics appear to be the same from day to day is that no one has bothered to update our software.  But who knows what tomorrow may bring?  Maybe you or I will live to see the upgrade to laws-of-physics 2.0.

But I'll give long odds against.

6 comments:

  1. Information?

    @Ron:
    >Is information real, an essential part of objective reality, or is it just a social construct like the law or the value of fiat currency?

    A shorthand way of asking this question is: is information conventional?

    > Human creativity appears to create information, and human carelessness or malice appears to destroy it.

    It can be a virtue to destroy information. When I destroy my grocery shopping list, that's good for the world. Not all information is beneficial to keep around (I would benefit if I finally got around to deleting a lot of information off my computer storage).

    Now a question for you: if someone writes, "Ron Garret was born on Mars," is that information? If you were, in fact, born on Mars, substitute the statement "Squares have 5 sides" instead.

    You didn't mention computation on information. I believe the conventional view is that computation doesn't create new information, it just processes existing information.

    ReplyDelete
    Replies
    1. > if someone writes, "Ron Garret was born on Mars," is that information?

      Well, we don't have to consider this as a hypothetical, because someone actually *did* (almost certainly) write "Ron Garret was born on Mars". So yes, that text contains information, it's just not information about anyone's birth place.

      > computation doesn't create new information, it just processes existing information

      Yes, that's a good point. I should have mentioned that. Thanks for reminding me.

      Delete
    2. Puzzling

      @Ron:
      >So yes, that text contains information, it's just not information about anyone's birth place.

      Read it again -- it's specifically about your birth place.

      The larger question is: to be information, does it have to be true?

      >> computation doesn't create new information, it just processes existing information

      >Yes, that's a good point.

      Which is puzzling. If I take your first data set:

      77 72 17 56 75 22 50 76 20 49 29 16 4 61 33 71 87 65 56 92

      . . . and I compute the mean to be 51.4, it seems like I have created new information -- namely, that the mean of those values is 51.4.

      Delete
    3. > it's specifically about your birth place.

      What difference does that make?

      > to be information, does it have to be true?

      That question is a category error. Information is just a causal correlation between or among states of physical systems. Truth is a property of *propositions*. Information doesn't need semantics, only a referent.

      The referent of the information contained in your statement is *you*. That information includes things like that that you speak English, that you have access to a computer, that you know my name, etc. It's chock-full of information, but none of that information has anyone's birth place as its referent.

      > P: computation doesn't create new information, it just processes existing information

      > R: Yes, that's a good point.

      > P: Which is puzzling...

      Yeah, it's kind of unintuitive. Remember, information is just a causal correlation between two systems. That correlation is allowed to go through a transformation, like when DNA is translated into proteins, or data is compressed or encrypted. Taking the average of a set of numbers is one of those transformations. Because it is deterministic, it cannot possibly introduce correlations that were not in the original data, and so it cannot introduce new information. In fact, in the case of taking the average, it's the exact opposite: averaging actually *destroys* information because it's irreversible -- you can't recover the original data from the average.

      Delete
  2. information ≠ knowledge

    @Ron:
    >That question is a category error. Information is just a causal correlation between or among states of physical systems. Truth is a property of *propositions*. Information doesn't need semantics, only a referent.

    I'm just trying to flesh out your theory of information. It only appears as a category error to you because you're not well versed in the philosophy of information. Some theories of information have a truth condition, such as the Theory of Strongly Semantic Information.

    ReplyDelete
    Replies

    1. Ah. I'm using Shannon's information theory and specifically adopting joint information entropy as my measure of information content. I'm doing this to set up the Big Reveal in quantum mechanics, which is that entangled particles can have a correlation coefficient greater than 1.

      Delete