Suppose on planet Gnome all genes are coded by sequences of length 20 using
a 10 letter alphabet. Sequence comparisons are done by a straightforward
20 on 20 alignment, with a score of +1 being given to a match, and 0 to a
mismatch. So the score = # of matches. Given two sequences,
P(S=k) = (the probability the alignment score = k)
is a simple binomial probability. I'll let the reader figure out the
formula. Having done so we determine that
P(S=0) + P(S=1) +...+ P(S=6) = 0.997614 = P(S<7).
Therefore, the probability of a score greater than 6 is
P(S>6) = 1 - P(S<7) = 0.002386.
We were able to use the binomial distribution in this case because in aligning
two sequences we are asking 20 questions each of which has 2 possible
answers: yes a match; or no a mismatch. ----------->
Ok, now let's do this 1000 times. That is,
suppose we have a 20 codon Gnomic gene, the query sequence, and we throw it
into a database of 1000 other Gnomic genes, and score it against each of them.
Now we can define the notion of an expectation value. For example, for each
query/database pair there is a probability of 0.002386 of having 7 or more matches.
Therefore, if we compare the query to all 1000 database sequences, we expect on
average to find about
E = 1000x0.002386 = 2.386
out of the thousand will have 7 or more matches with the query. So, what if we find
there are 10 out of 1000 matching the query in 7 or more residues? Significant?
Let N = # of database sequences matching our query in 7 or more residues. And let
P(N=k) be the probability that N = k. Well, P is again a binomial probability,
since again we are asking a yes/no question: yes = there are 7 or more matches;
no = there are 6 or fewer matches. Since the probability of a yes is 0.002386, and
we are asking the question 1000 times,
The problem is, a calculator, even many computers, will be unable to easily handle
this computation (try taking 1000! (factorial) on your calculator or computer).
This is a job for the Poisson distribution, which will give a good approximation
in this case. Using np = 2.386 (the mean of the binomial distribution), we get
P(N=k) = [e-2.386(2.386)k] / k!
Let's look at some values:
- P(N=0) = 0.091997
- P(N=1) = 0.219505
- P(N=2) = 0.261869
- P(N=3) = 0.208273
- P(N=4) = 0.124235
- P(N=5) = 0.059285
- P(N=6) = 0.023576
- P(N=7) = 0.008036
- P(N=8) = 0.002397
- P(N=9) = 0.000635
P(N=0) + P(N=1) +...+ P(N=9) = 0.999808 = P(N<10).
So P(N>9) = 0.000192. That is, there is less than 1 chance out of 5000 that 10
or more of the 1000 database sequences will match our query sequence in 7 or more
residues (out of 20). Got it?
So, yes, finding our query matching 10 or more sequences in 7 or more residues is
rather significant, which in this context gives the Gnomics reason to believe that
the query sequence has a relationship to the database that is not random, and indeed
it is unlikely the database itself is random.
And that completes this portion fo the statistics chapter. The remaining pages will look
at the messier statistical context of earth-based genome databases.