Showing posts with label statistics. Show all posts
Showing posts with label statistics. Show all posts

Tuesday, 21 October 2008

There's a monkey sitting at a typewriter

Mathematics is the language of science, so having a grasp on the mathematical concepts used is vital in order to understand much of what is written. Understanding the difference between coincidental and causal events is vital in order to understand statistics. Our brains work on linking causal events, if I press the 'x' key on my keyboard I expect there to be an 'x' written in my text editor. Likewise, all the keys on the keyboard have function and using that same causal relationship I can write entire blog posts. I am conveying information in a causal manner. Now what if it were random? Could a random generator recreate a post of mine, or even a single sentence?


Generating improbability
Now there's a few ways of going about this, take the sentence "mathematics is the language of science", taking the lower-case alphabet and the space character, there are 38 spaces and 27 different characters to go in each. Note that Dawkins did a similar computational experiment in The Blind Watchmaker.
  • random chance - suppose you tried to generate every position and every character at once. To get the first character correctly would be a 1 in 27 chance. As would generating the second character. To generate both characters correctly would be (1/27)2 or 1 in 729. To generate three characters correctly would be (1/27)3 or 1 in 19,683. To generate all 38 characters correctly would be 1 in 2738 or 1 in 2.5*1054. So for this one off event to happen, it would take an extraordinary amount of time to generate it by chance with purely random input.

  • cumulative chance - instead of trying to generate it all in one go, generating each character individually could work better. Start with the first character, and there's a 1 in 27 chance of generating 'm'. So when 'm' is finally generated, move onto trying to generate the 'a'. Now each subsequent character generation is dependant on earlier chance encounters. To generate 'math' now becomes (1/27) + (1/27) + (1/27) + (1/27), or 1 in 108. For all 38 characters, it's (1/27) * 38 or 1 in 1026. So by progressively doing each step along the way, it takes away an extreme amount of improbability.

  • evolving chance - back to generating in one go. By starting at an arbitrary point and tweaking the string, eventually it will come up with the right answer. The mathematics for this is not easily expressed, though it can be expressed in code.


The results
I wrote a java program to simulate these different processes. The source code is available here for anyone wishing to run it themselves. Feel free to modify it, and push it to it's limits. It's very much a cobbled together hack-job, there wasn't much focus on having a clean interface. It's there to show I didn't make the results up, and while the randomness of the PC will mean that the numbers will not turn out the exactly same as mine, they are a good approximation of the procedure.
  • random chance will not generate a computational answer, computer randomness is based on seeds so it will run infinitely without ever finding the answer. If it were truly random, then it would still take an almost infinitely long time. I ran the program quite a few times and there wasn't even a fragment of any word that could be considered English.

  • cumulative chance has yielded similar results to the mathematical prediction of an average of 1026 iterations. Running it 10 times, I had the following results: 1058, 867, 1077, 1403, 776, 943, 1081, 893, 945, 880. This is an average of 992 iterations for the result. Doing it again for 10 iterations yielded the result 1028. A third time with 100 iterations yielded an average of 1008. A few more times and I did get 1026 as the average over 100 iterations. The practical application of statistics correlated with the theoretical application.

  • evolutionary chance brought an even quicker result. By adjusting the amount of mutation (a mutation rate of 1 would mean 'd' could only change to either 'e' or 'c', 2 would cover the spread from 'b' to 'f'), the length of time could decrease. The results for running it 100 times on different mutation rates were as follows:

A run through with a mutation variance of 5
Iteration: 1 tptvelwhiqt irsirfpokxpub skrjcgrhcslt
Iteration: 2 - qrtyekzgimy iunhtirkibrubetjuhelueerot
Iteration: 3 - nqt efxfiru irmkojmmkdquddqesjeoxieoov
Iteration: 4 - rmtzedthiov ivmoohqrkdsufencsfhquiesly
Iteration: 5 - whteedtiity iyktnkpulfnudjmcrfgnuiewpz
Iteration: 10 - tftjeclaiww ivdtocswahguajjo fodwieurb
Iteration: 20 - hvtheixaifs in tfwlladguarexffqjpievcr
Iteration: 30 - natheoetias is thyhlamguadeohfinsieucx
Iteration: 40 - mathematigs is th tlacguaoepcflnniescg
Iteration: 50 - mathematics is th uladguauewdformiemce
Iteration: 60 - mathematics is thdilajguabe lfdskiebce
Iteration: 70 - mathematics is thwtlacguaje ofksniefce
Iteration: 80 - mathematics is thkjlauguape of syievce
Iteration: 90 - mathematics is the language of snience
Found iteration 97 - mathematics is the language of science


And what of those monkeys?
The first two methods would be simulations of how a monkey would type: the first method would be the equivalent of letting a monkey type the entire post and have it start from scratch over and over if there would be any errors. The 2nd method is like the monkey using the backspace key each time it got something wrong. The evolving algorithm is unlike random chance, it's to illustrate that information of almost infinite improbability can emerge over a far shorter space of time.

Including white space, my posts average around 7,000 characters. By the time a post is finished, the amount of characters I press is probably a lot higher when taking into account typos, spelling and grammatical errors, deletion of poor sentences, and proof reading. All up, it wouldn't be entirely unfair to say I do about 10,000 key strokes per post. Now if I do 200 keystrokes a minute, then it would still take me 50 minutes or so to get the post to where it is now. To evolve or generate something like this by chance is theoretically possible, but practically impossible.

This is why when we see a code or information we know there's intelligence behind it. To look on the great pyramids, the symbols contained therein are the product of intelligence and intent. Anyone who has had previous dealings with the written word would be able to understand the symbols are the same construct. To go back further into human history, the same could be said of cave paintings. It is not the refined symbolism of the written word, but it's evidently communication. Other forms of communication do exist that are not so obvious, smoke signals for example. There are times too when we can mistake natural order and randomness as communication, the stars say astrology is a joke and our futures are not etched into the palms of our hands.

We can infer meaning from non-meaningful processes, this is the problem with the statement "all codes are a product of intelligence". We see patterns all through nature, improbable shapes, repetition, improbable assortment, all of which come from natural processes. One of these patterns is the pattern of DNA, the double-helix structure that has the instructions on the building blocks of life. How could this occur naturally? Well, that's for another entry. The importance of the tests above what to highlight the difference between random chance and a cumulative or an evolutionary process. It was to show that improbable events can be created quite quickly using select processes.

Sunday, 21 September 2008

Miracles and Statistics

"There are three kinds of lies: lies, damned lies, and statistics." - Benjamin Disraeli
Highly improbable events can often seem miraculous, our minds are not equipped for intense statistical analysis and certainly not for dealing with large numbers. Couple this with the very limited scope in which to view the world and suddenly blind chance can seem fate, divine intervention looks to be the only explanation. Certainly some events are too statistically improbable to happen by chance, and in those intelligence must be sought as an explanation. But given enough time and enough of a sample base, miraculous events can and do happen in every walk of life.


Winning the Lottery
Now this is a statistically improbable event. In a system where 6 of 44 numbers are drawn out, the chance of winning is approximately 1 in 5 billion. Given that in the Australian lottery system each ticket has 12 different combinations printed on it, that chance drops to about 1 in 40 million. So if there were a million tickets bought each week, it would take about 40 weeks for the statistical probability to match the unlikelihood. Run the lottery enough times and there's bound to be a winner somewhere.

Now what would it look like from the winner's perspective? They have won something that is incredibly unlikely, a statistical fluke. The pay-off for that statistical fluke is a huge financial gain and the cost was very little. Buying a ticket each week certainly improves the odds, but not by enough for it to be considered anything other than a statistical fluke. If someone bought a ticket every week for two decades, the chance of winning once is approximately 1 in 40,000. So while it's a far better chance than just buying one ticket once, the difference in likelihood of outcome is practically the same. So the person who wins had an extreme bit of luck, it's nothing more than the inevitability that someone has to win some time. For the extremely fortunate individual this could be taken as a significant event, as something more than blind chance.

This is how in the ordinary course of events, people can lead extraordinary lives. The luck that a few have is inevitable in a large enough population size; the scope of statistical insignificance is wiped out. It's with the global media that improbable events are broadcast into our daily lives. We are exposed on a daily basis thanks to the news, something that almost never happens is being portrayed to us as a regular occurrence. So by extension, winning the lottery would sound a lot more a frequent purely because the sample size is extraordinary. This can be applied further.

On a simplistic level, take rolling the dice. Now guessing one roll is 1 in 6. Guessing two rolls is 1 in 36, three rolls is 1 in 216, 10 rolls is 60,466,776. Now if everyone in the world were to guess the sequence, it's inevitable that people would be able to guess the exact sequence. It would be unlikely that no-one was able to guess it; statistically it should have happened 100 times over. This is why picking a single individual event is a good indication of psychic power. Statistically the improbable does occasionally happen, it's only with consistency that meaning can be derived.


Meaning from consistency
Now given the premise that an individual can experience improbable occurrences given a large enough sample size and time, the role of causality needs to be explored. While there certainly are events that befall one for simply nothing more than being in the right place at the right time, there is also the possibility that the individual is rigging the results. This is the train of thought leading to psychic powers, that there are those who can mentally affect the selected outcomes through divine insights. It can also lead to the belief in a personal God (or guardian angel) is watching over the affected individual.

Having a one-off event of improbable chance is statistically explainable; having repeated bouts of chance would indicate there has to be some underlying cause. Say for instance if someone could correctly predict a shuffled deck of cards in order with no foresight (1 in 8.0658*10^67) there must be something to explain it. The obvious answer is that the deck was rigged, that the person knew in advance what order the cards were in. Without that, there is calls for further study to find out why.

As noted previously, appealing to the paranormal is not an explanation; it explains nothing and just raises further questions. Rather a mechanism of how someone could see into the future needs to be explored. But in order to get to that state, there needs to be some consistent statistical significance. Without the ability to reproduce improbable events, then chance is the only viable explanation. We don't ever get something like 52 cards being picked, instead events with higher probability are sought after. Instead it might be something like the value on the card (1 in 13) or pick the suit on the card (1 in 4). It makes the probability of having hits higher, but it also takes away from the significance of picking an incredibly improbable event.

Take picking the suit on the card. For a deck there the average person should pick 13 of the 52 cards by guessing before any card is dealt. If the predictions are made after each card, the odds would change (i.e. if 51 cards are picked out, the observer should be able to deduce the final card's suit). So there should be scores around 13, but not everyone will fall exactly on 13. There should be people who score 15 or 16, some who score 8 or 9, there may be one who scores a 24. If that individual consistently scores in the 20s while others average out over a long period of time to around chance, then there is something significant to derive from the event. If someone scores 22 one turn then 11 the next, then it falls within the normal course of events.

Consistency is the only way to gain meaning from statistics, it's the foundation of statistical analysis. The problem when looking at the paranormal is the tautological nature of the phenomena; it's never that the phenomena doesn't work, it's that the user wasn't in the right frame of mind. As the saying goes: If someone wants to use statistics to appeal to the credence of a phenomena, then all statistics and their context must be shown. If someone can pick the suit of a card 1 in 2 times as opposed to 1 in 4 consistently, then there is good cause to trumpet that stat. But if it happened once and the rest it averages out to 1 in 4, then there is nothing special to report. The improbable does happen, it can adequately be explained by statistics. The sooner people learn to use stats properly the better.