Tuesday 21 October 2008

There's a monkey sitting at a typewriter

Mathematics is the language of science, so having a grasp on the mathematical concepts used is vital in order to understand much of what is written. Understanding the difference between coincidental and causal events is vital in order to understand statistics. Our brains work on linking causal events, if I press the 'x' key on my keyboard I expect there to be an 'x' written in my text editor. Likewise, all the keys on the keyboard have function and using that same causal relationship I can write entire blog posts. I am conveying information in a causal manner. Now what if it were random? Could a random generator recreate a post of mine, or even a single sentence?


Generating improbability
Now there's a few ways of going about this, take the sentence "mathematics is the language of science", taking the lower-case alphabet and the space character, there are 38 spaces and 27 different characters to go in each. Note that Dawkins did a similar computational experiment in The Blind Watchmaker.
  • random chance - suppose you tried to generate every position and every character at once. To get the first character correctly would be a 1 in 27 chance. As would generating the second character. To generate both characters correctly would be (1/27)2 or 1 in 729. To generate three characters correctly would be (1/27)3 or 1 in 19,683. To generate all 38 characters correctly would be 1 in 2738 or 1 in 2.5*1054. So for this one off event to happen, it would take an extraordinary amount of time to generate it by chance with purely random input.

  • cumulative chance - instead of trying to generate it all in one go, generating each character individually could work better. Start with the first character, and there's a 1 in 27 chance of generating 'm'. So when 'm' is finally generated, move onto trying to generate the 'a'. Now each subsequent character generation is dependant on earlier chance encounters. To generate 'math' now becomes (1/27) + (1/27) + (1/27) + (1/27), or 1 in 108. For all 38 characters, it's (1/27) * 38 or 1 in 1026. So by progressively doing each step along the way, it takes away an extreme amount of improbability.

  • evolving chance - back to generating in one go. By starting at an arbitrary point and tweaking the string, eventually it will come up with the right answer. The mathematics for this is not easily expressed, though it can be expressed in code.


The results
I wrote a java program to simulate these different processes. The source code is available here for anyone wishing to run it themselves. Feel free to modify it, and push it to it's limits. It's very much a cobbled together hack-job, there wasn't much focus on having a clean interface. It's there to show I didn't make the results up, and while the randomness of the PC will mean that the numbers will not turn out the exactly same as mine, they are a good approximation of the procedure.
  • random chance will not generate a computational answer, computer randomness is based on seeds so it will run infinitely without ever finding the answer. If it were truly random, then it would still take an almost infinitely long time. I ran the program quite a few times and there wasn't even a fragment of any word that could be considered English.

  • cumulative chance has yielded similar results to the mathematical prediction of an average of 1026 iterations. Running it 10 times, I had the following results: 1058, 867, 1077, 1403, 776, 943, 1081, 893, 945, 880. This is an average of 992 iterations for the result. Doing it again for 10 iterations yielded the result 1028. A third time with 100 iterations yielded an average of 1008. A few more times and I did get 1026 as the average over 100 iterations. The practical application of statistics correlated with the theoretical application.

  • evolutionary chance brought an even quicker result. By adjusting the amount of mutation (a mutation rate of 1 would mean 'd' could only change to either 'e' or 'c', 2 would cover the spread from 'b' to 'f'), the length of time could decrease. The results for running it 100 times on different mutation rates were as follows:

A run through with a mutation variance of 5
Iteration: 1 tptvelwhiqt irsirfpokxpub skrjcgrhcslt
Iteration: 2 - qrtyekzgimy iunhtirkibrubetjuhelueerot
Iteration: 3 - nqt efxfiru irmkojmmkdquddqesjeoxieoov
Iteration: 4 - rmtzedthiov ivmoohqrkdsufencsfhquiesly
Iteration: 5 - whteedtiity iyktnkpulfnudjmcrfgnuiewpz
Iteration: 10 - tftjeclaiww ivdtocswahguajjo fodwieurb
Iteration: 20 - hvtheixaifs in tfwlladguarexffqjpievcr
Iteration: 30 - natheoetias is thyhlamguadeohfinsieucx
Iteration: 40 - mathematigs is th tlacguaoepcflnniescg
Iteration: 50 - mathematics is th uladguauewdformiemce
Iteration: 60 - mathematics is thdilajguabe lfdskiebce
Iteration: 70 - mathematics is thwtlacguaje ofksniefce
Iteration: 80 - mathematics is thkjlauguape of syievce
Iteration: 90 - mathematics is the language of snience
Found iteration 97 - mathematics is the language of science


And what of those monkeys?
The first two methods would be simulations of how a monkey would type: the first method would be the equivalent of letting a monkey type the entire post and have it start from scratch over and over if there would be any errors. The 2nd method is like the monkey using the backspace key each time it got something wrong. The evolving algorithm is unlike random chance, it's to illustrate that information of almost infinite improbability can emerge over a far shorter space of time.

Including white space, my posts average around 7,000 characters. By the time a post is finished, the amount of characters I press is probably a lot higher when taking into account typos, spelling and grammatical errors, deletion of poor sentences, and proof reading. All up, it wouldn't be entirely unfair to say I do about 10,000 key strokes per post. Now if I do 200 keystrokes a minute, then it would still take me 50 minutes or so to get the post to where it is now. To evolve or generate something like this by chance is theoretically possible, but practically impossible.

This is why when we see a code or information we know there's intelligence behind it. To look on the great pyramids, the symbols contained therein are the product of intelligence and intent. Anyone who has had previous dealings with the written word would be able to understand the symbols are the same construct. To go back further into human history, the same could be said of cave paintings. It is not the refined symbolism of the written word, but it's evidently communication. Other forms of communication do exist that are not so obvious, smoke signals for example. There are times too when we can mistake natural order and randomness as communication, the stars say astrology is a joke and our futures are not etched into the palms of our hands.

We can infer meaning from non-meaningful processes, this is the problem with the statement "all codes are a product of intelligence". We see patterns all through nature, improbable shapes, repetition, improbable assortment, all of which come from natural processes. One of these patterns is the pattern of DNA, the double-helix structure that has the instructions on the building blocks of life. How could this occur naturally? Well, that's for another entry. The importance of the tests above what to highlight the difference between random chance and a cumulative or an evolutionary process. It was to show that improbable events can be created quite quickly using select processes.

10 comments:

Anonymous said...

But...

Selection in biological organisms doesn't work very much like the example (I've always wanted to ask Dawkins about this).

Rather, often several changes have to be made in order for the change to increase fitness. Thus, keeping every letter that matches the target creates a fitness landscape where, say, only 1e-9 and 1 exist for each site, and in which the total fitness is the sum of each of the 38 fitness components. Real landscapes just doesn't look like that.

While a totally bad example when it is used to get people who don't believe in evolution onto the right track, to describe evolutionary processes even remote realistically, the example must be changed significantly.

K said...

Yes, selection doesn't work like this. But this wasn't about evolution, it was about information theory and how we can determine what is and what isn't a product of an intelligent agent - something important to distinguish when tackling the subject of DNA. There are some who say "DNA is a code, and all codes are a product of intelligence", and this along with a future post are to address that issue.

As for how to simulate natural selection and evolution, I'm still trying to figure out how to adequately represent the process in a computational sense. But I'll save that for another time.

Randy Stimpson said...

One of the problems I see with your approach is that you lock in small changes that match part of the pattern you are trying match.

A much larger pattern would have to emerge before you could lock it in. Consider that the average length of a gene is 2510bp. Also consider that multiple genes need to be working together to create a trait that could be selected for. If a significant change to a life form was made that could cause it to be selected for it wouldn't be the result of a short sequence change.

K said...

This method does not work in the way evolution does, that was the point of the entire post. Language and DNA are simply not comparable. The whole point was to show how we can recognise specified information because it can't be generated any other way.

Randy Stimpson said...

Kel,

Your last two algorithms are pointless. They both rely on specified information. Your last two algorithms and Dawkin’s Weasel program suffer from the same fallacies.

K said...

Exactly, Dawkins even points that out in The Blind Watchmaker.

The last two algorithms do have a point, they are to show that cumulative processes are to be distinguished from blind chance. It's an analogy, not a descriptor.

Randy Stimpson said...

We agree on one thing:

"... when we see a code or information we know there's intelligence behind it."

DNA is a code and it is a type of information. You can hardly equate it with rings on a tree or the pattern of a snowflake. The sequence below is a pattern.

abcabcabcabcabcabcabcabcabcabcabcabc

This sentence is information. Note the difference.

K said...

DNA is more a cypher than a code, it's not specified information in the way that language is. Remember that the ACTG that are the 4 characters in DNA are us imposing language onto it. It's four nucleotide substances that evidentially can occur naturally. Adenine was able to be synthesised in an experiment in 1961.

Language as information is a product of the mind, the symbols used are obviously products of intelligence. The building blocks of life and naturally occurring.

Anonymous said...

Randy, you twit, you know Nothing about information. Information is anything and everything. It does not rely on intelligence! Obviously, because if it did, you wouldn't generate ANY!

Quantum mechanics deals explicitly with information at ALL TIMES, and yet, none of it is intelligently designed. The information merely exists.

Second, you meant message. Sentences have messages that are carried in a medium, language. Information is generated by synapses randomly as we sleep, and the attempt at trying to make sense of this information is what causes dreams.

Furthermore, That meaning is stored on Chemicals, not even DNA, so your appeal to there being a correlation to Constructions of chemicals, and information is just stupid. There was always information in it, otherwise no meaning could be given to them. Originally, none of these chemicals had the meanings that our brain gives them, and it was the same for proto-genes in a Pre-biotic earth. information exists in all forms, in every substance. It is simple that information becomes meaningful in any sufficiently powerful system.

But regardless of these simple things, you continue to prattle on about your psuedoscientific arguments, with not one iota of substance behind any of it. That is why YOU ALWAYS make claims, and NOTHING ELSE.

Anonymous said...

After reading some of Richard Dawkin's books, I was concerned about that same problem. I wrote a MATLAB code that tries to evolve an eye by generating random variants starting from a flat plate and evaluating how well they would focus light rays from different directions. I posted my code, and starting posting results on my blog, but haven't looked back at it in a while.