The Wikipedia, Knowledge Preservation and DNA

I had an interesting thought today about the long-term preservation and transmission of human knowledge.

The Wikipedia may be on its way to becoming the one of the best places in which to preserve knowledge for future generations. But this is just the beginning. What if we could encode the Wikipedia into the Junk DNA portion of our own genome? It appears that something like this may actually be possible — at least according some recent studies of the non-coding regions of the human genome.

If we could actually encode knowledge, like the Wikipedia for example, into our genome, the next logical step would be to find a way to access it directly.

At first we might only be able to access and read the knowledge stored in our DNA through a computationally intensive genetic analysis of an individual’s DNA. In order to correct any errors in the data from mutuation, we would also need to cross-reference this individual data with similar analyses from the DNA of other people who also carry this data in their DNA. But this is just the beginning. There are however ways to stored data such that there is enough redundancy to protect against degradation. Assuming we could do this we might be able to eliminate the need for cross referencing as a form of error correction — the data itself would be self-correcting so to speak. If we could accomplish this then the next step would be to find a way for an individual to access the knowledge stored in their DNA in real-time, directly. That’s a long way off but there may be a way to do this using some future nano-scale genomic-brain interface. This opens up some fascinating areas of speculation to say the least.

Why The Wikipedia?

The Wikipedia has certain qualities that make it better than other forms of knowledge preservation and transmission:

  • The Wikipedia exists primarily in electronic form. It is not subject to age or decay like a physical encyclopedia or document. This means it can persist forever, and will not be lost to time, if it continues to be maintained electronically in the future.
  • The Wikipedia is replicated in multiple locations around the world. The fact that it is so easy to replicate, and is so widely replicated means that it is less at risk of being lost due to a local disaster at any given storage location. It also means it is more likely to continue, somewhere, as a living document that goes on to reflect majority consensus reality into the distant future. It is highly improbable that it will ever suffer the same fate as certain ancient
    documents which only existed in one place and were subsequently lost in
    floods, fires, or wars, etc. At this point only a planet-wide extinction level event could erase the Wikipedia and/or prevent future generations from finding it.
  • The Wikipedia is highly viral, it’s content is increasingly cited and it is far ahead of any competing system in terms of coverage and brand-recognition. Because so many other pieces of content on the Web and in other media refer to the Wikipedia as the world’s global authority for knowledge, it is considered increasingly authoritative and is increasingly visible and increasingly cited. The Law of Increasing Returns indicates that this will continue to self-amplify, making the Wikipedia the best candidate for an authoritative global repository of knowledge.

What this means is that if you have any knowledge that you want to preserve for future generations, a good place to put it is in the Wikipedia. Putting it there almost guarantees that it will propagate around the world and throughout the human-explored universe (in the future, if we become a spacefaring civilization), and into the distant future of human civilizations.

The Potential For Storing Knowledge in DNA

Is it possible to store knowledge — such as the Wikipedia — in human DNA? It would certainly be useful if we could do this. By storing knowledge in human DNA of living humans, or of common bacteria for that matter, it could then potentially be passed down and spread through generations into the far future. However the mutability of DNA over time might gradually introduce errors that would degrade the information within particular lines of DNA over long periods of time.

Perhaps this could however be mitigated by comparing DNA samples from a large cross-section of individuals within the population of descendants of original holders of DNA-knowledge-archives in the future — this would effectively enable statistical error cancellation. The farther in the future from the date at which the knowledge is “written” to the DNA of some number of humans, the more people’s DNA would be needed to eliminate the errors statistically. This would however in
principle counteract mutations and enable the reliable recovery of messages in DNA even very far in the future.

The fact that it is in principle possible to encode knowledge into human (or other) DNA begs the question of whether there is already knowledge stored there? It’s certainly worth a look! Maybe there is already a message there for us? One can only wonder if there is already an ancient “Wikipedia” of sorts already written there.

Interestingly enough, when certain statistical tests are run against human DNA,  it does seem to have properties that are indicative of written language, but only in the “junk” regions of the genome. Maybe it’s not “junk” after all. Below is an article that discusses a recent discovery related to this:

Language in junk DNA

You’ve probably heard of a molecule called DNA, otherwise known as “The Blueprint Of Life”. Molecular biologists have been examining and mapping the DNA for a few decades now. But as they’ve looked more closely at the DNA, they’ve been getting increasingly bothered by one inconvenient little fact – the fact that 97% of the DNA is junk, and it has no known use or function! But, an usual collaboration between molecular biologists, cryptoanalysists (people who break secret codes), linguists (people who study languages) and physicists, has found strange hints of a hidden language in this so- called “junk DNA”.

Only about 3% of the DNA actually codes for amino acids, which in turn make proteins, and eventually, little babies. The remaining 97% of the DNA is, according to conventional wisdom, not gems, but junk.

The molecular biologists call this junk DNA, introns. Introns are like enormous commercial breaks or advertisements that interrupt the real program – except in the DNA, they take up 97% of the broadcast time. Introns are so important, that Richard Roberts and Phillip Sharp, who did much of the early work on introns back in 1977, won a Nobel Prize for their work in 1993. But even today, we still don’t know what introns are really for.

Simon Shepherd, who lectures in cryptography and computer security at the University of Bradford in the United Kingdom, took an approach, that was based on his line of work. He looked on the junk DNA, as just another secret code to be broken. He analysed it, and he now reckons that one probable function of introns, is that they are
some sort of error correction code – to fix up the occasional mistakes that happen as the DNA replicates itself. But even if he’s right, introns could have lots of other uses.

The next big breakthrough came from a really unusual collaboration between medical doctors, physicists and linguists. They found even more evidence that there was a sort-of language buried in the introns.

According to the linguists, all human languages obey Zipf’s Law. It’s a really weird law, but it’s not that hard to understand. Start off by getting a big fat book. Then, count the number of times each word appears in that book. You might find that the number one most popular word is “the” (which appears 2,000 times), followed by the second most popular word “a” (which appears 1,800 times), and so on. Right down at the bottom of the list, you have the least popular word, which might be “elephant”, and which appears just once.

Set up two columns of numbers. One column is the order of popularity of the words, running from “1” for “the”, and “2” for “a”, right down “1,000” for “elephant”. The other column counts how many times each word appeared, starting off with 2,000 appearances of “the”, then 1,800 appearances of “a”, down to one appearance of “elephant”.

If you then plot on the right kind of graph paper, the order of popularity of the words, against the number of times each word appears you get a straight line! Even more amazingly, this straight line appears for every human language – whether it’s English or Egyptian, Eskimo or Chinese! Now the DNA is just one continuous ladder of squillions of rungs, and is not neatly broken up into individual words (like a book).

So the scientists looked at a very long bit of DNA, and made artificial words by breaking up the DNA into “words” each 3 rungs long. And then they tried it again for “words” 4 rungs long, 5 rungs long, and so on up to 8 rungs long. They then analysed all these words, and to their surprise, they got the same sort of Zipf Law/straight-line-graph for the human DNA (which is mostly introns), as they did for the human languages!

There seems to be some sort of language buried in the so-called junk DNA! Certainly, the next few years will be a very good time to make a career change into the field of genetics.

So now, around the edge of the new millennium, we have a reasonable understanding of the 3% of the DNA that makes amino acids, proteins and babies. And the remaining 97% – well, we’re pretty sure that there is some language buried there, even if we don’t yet know what it says. It might say “It’s all a joke”, or it might say “Don’t worry, be happy”, or it might say “Have a nice day, lots of love, from your friendly local DNA”.

Now to complete this thought: what if the information-carrying capacity of the so-called Junk DNA of the human genome is sufficient to hold the content of the Wikipedia? Then all we would need is some way of writing to it — perhaps via gene therapy via infection by a virus that carries a copy of the Wikipedia.

This would enable volunteers to accept copies of the Wikipedia into their DNA and become vectors for the Wikipedia. They and their descendants would become walking encyclopedias and would preserve human knowledge for future generations. If only some people had this done then they and their lineages would be a sort of priesthood with particular importance for the future of humanity. It
sounds like the basis for a really great science-fiction thriller!

By copying the Wikipedia into our own DNA we might be able to ensure that wherever human beings end up in the universe, the Wikipedia will go with them. Even if in some distant world humans destroy their civilization in a nuclear holocaust or are almost wiped out by an asteroid and have to rebuild from the stone-age again, they will eventually rediscover genomics and soon after that they will find the Wikipedia in their genome.

This is a kind of “backup strategy” for our civilization and all the knowledge we consider to be most important. Of course it is not clear yet whether the Junk DNA could carry enough information to encode the entire Wikipedia, nor is it clear that the Junk DNA is actually “junk” — perhaps there is already something there that should not be overwritten? Or perhaps it serves some other purpose in human
development and evolution that we shouldn’t mess around with. It remains to be seen.

6 thoughts on “The Wikipedia, Knowledge Preservation and DNA”

  1. I think there is a big problem with this idea conceptually. First off, the entire genome is only about 750MB, so not a whole lot of storage to begin with. Second, we already know that the 98% non-coding regions are not at all junk. Recent papers have demonstrated that important transcription factor binding sites and other regulatory elements lay in the non-coding region. Furthermore, non-coding microRNA is now known to play a large role in gene regulation. Non-coding regions of the genome also become methylated in important epigenetic events. I don’t think there is much, if any, “free space” in the human genome. You would get much more mileage with this idea if you apply it to bacteria.

  2. At first flush, the most obvious way a sequence can survive being broken into words of 3, 4 and 5 and have the same distribution of word frequency is if the most common subsequences are the same letter repeated. If you have AAAAAAAAAAA twice as often as CCCCCCCCC then it’s stable, if you have ACATATAGACAT then the 3 groups are all 1, but the 4 groups are 2,1 and the 5 groups are all 1.
    I’m not good enough at math to prove that after a sunday dinner and a couple of glasses of wine, but it would be interesting what other sequences have a stable Zipf distribution under different subdivisions. You’d expect the most common case in the length 3 split to cause there to be 3 equally common cases in the length 4 split.

  3. Good comments. But would the non-coding portion of bacterial DNA be any more unusued in those organisms than the non-coding areas of our human DNA is used by us (if that is the case)?
    It would be very cool to encode the Wikipedia, or a good portion of it, into bacterial DNA — perhaps the DNA of a common human stomach bacteria for example — and set it loose to preserve this information for the long-term future. Of course this would only be worthwhile if introducing this data into the bacterial genome did not harm the bacterium or negatively impact its environmental fitness or ability to replicate without errors.

  4. I couldn’t help but think of the film, Johnny Mnemonic, when reading your last blog post regarding the transfer/preservation of data through DNA; I question the implications that it would have on the human psyche given that internal, all-encompassing information storage would inevitably develop into the ability to access and process it. As we live today, my generation’s unbridled access to knowledge, specifically the depth of human suffering and disaster, has crippled the ability for emotion.

  5. This all sounds well and good, very noble attempt. But considering Global Catastrophe, where possiby only 3,2 or only 1 % of the population survives. Of this percentage that survives, who will they be? Will they be endowed and capable?
    What about a “Norad Concept” ? Strategically placed Mega Computers , Nuclear Powered buried deep in the Earth beginng with simple interpretations that could lead to eventual determinations. A sort of “Noah’s Ark” in multiples. I have actually heard of an actual “Noah’s Ark” concept concerning every seed of every plant of our Planet being constructed somewhere in Iceland or Greenland or some place there abouts.

Comments are closed.