Poisson Example.

Suppose on planet Gnome all genes are coded by sequences of length 20 using a 10 letter alphabet. Sequence comparisons are done by a straightforward 20 on 20 alignment, with a score of +1 being given to a match, and 0 to a mismatch. So the score = # of matches. Given two sequences,
P(S=k) = (the probability the alignment score = k)
is a simple binomial probability. I'll let the reader figure out the formula. Having done so we determine that

P(S=0) + P(S=1) +...+ P(S=6) = 0.997614 = P(S<7).

Therefore, the probability of a score greater than 6 is

P(S>6) = 1 - P(S<7) = 0.002386.

We were able to use the binomial distribution in this case because in aligning two sequences we are asking 20 questions each of which has 2 possible answers: yes a match; or no a mismatch. ----------->

Ok, now let's do this 1000 times. That is, suppose we have a 20 codon Gnomic gene, the query sequence, and we throw it into a database of 1000 other Gnomic genes, and score it against each of them. Now we can define the notion of an expectation value. For example, for each query/database pair there is a probability of 0.002386 of having 7 or more matches. Therefore, if we compare the query to all 1000 database sequences, we expect on average to find about

E = 1000x0.002386 = 2.386

out of the thousand will have 7 or more matches with the query. So, what if we find there are 10 out of 1000 matching the query in 7 or more residues? Significant?

Let N = # of database sequences matching our query in 7 or more residues. And let P(N=k) be the probability that N = k. Well, P is again a binomial probability, since again we are asking a yes/no question: yes = there are 7 or more matches; no = there are 6 or fewer matches. Since the probability of a yes is 0.002386, and we are asking the question 1000 times,

The problem is, a calculator, even many computers, will be unable to easily handle this computation (try taking 1000! (factorial) on your calculator or computer). This is a job for the Poisson distribution, which will give a good approximation in this case. Using np = 2.386 (the mean of the binomial distribution), we get

P(N=k) = [e-2.386(2.386)k] / k!

Let's look at some values:
  • P(N=0) = 0.091997
  • P(N=1) = 0.219505
  • P(N=2) = 0.261869
  • P(N=3) = 0.208273
  • P(N=4) = 0.124235
  • P(N=5) = 0.059285
  • P(N=6) = 0.023576
  • P(N=7) = 0.008036
  • P(N=8) = 0.002397
  • P(N=9) = 0.000635

Therefore,

P(N=0) + P(N=1) +...+ P(N=9) = 0.999808 = P(N<10).

So P(N>9) = 0.000192. That is, there is less than 1 chance out of 5000 that 10 or more of the 1000 database sequences will match our query sequence in 7 or more residues (out of 20). Got it?

So, yes, finding our query matching 10 or more sequences in 7 or more residues is rather significant, which in this context gives the Gnomics reason to believe that the query sequence has a relationship to the database that is not random, and indeed it is unlikely the database itself is random.

And that completes this portion fo the statistics chapter. The remaining pages will look at the messier statistical context of earth-based genome databases.