In the real world the statistical analysis of alignment scores is seldom as
clean as has been outlined on the previous pages. As an example, let's suppose
you're a biologist working in a secret laboratory 37 stories beneath the Nevada
desert. You've sequenced a gene from an extraterrestial sold to the lab by a group
of children who wouldn't let it phone home. The gene consists of 300 amino acid
codons (why the alien's amino acids should be the same as ours was explained on
Star Trek). In order to test its relatedness to earthbound genes you type it into
a computer and let the BLAST program run it up against each of 250,000 sequences
in a genome database. The BLAST program finds local alignment scores, but to save
time it doesn't use a straightforward dynamic programming algorithm. In particular,
BLAST local alignments are gapless.
Anyway, each query/database comparison may result in several local alignments. The
maximum segment pair (MSP) is the one with the highest score. It's an extreme value.
The statistical distribution that best approximates the distribution of MSP scores is
the extreme value distribution. Even so it does not give us exact probabilities,
only lower limits. Let P(s>S) be the probability of randomly encountering a score s
greater than some cut-off S. Then
|
|
K and v are constants dependent upon the make-up of the database, m is the
length of the query, and n depends on the context (for a single query/database
comparison, n is the length of the database sequence).
We're interested in high scores S. Note that the bigger S gets, the smaller
e-vS gets, and the smaller that gets, the closer exp(-Kmne-vS)
gets to 1, and the closer the lower bound for P(s>S) gets to zero. That is,
big S yield small P.
|
In addition, if Kmne-vS is close to zero, then exp(-Kmne-vS) is
well approximated by 1 - Kmne-vS. In that case the lower bound above can
be well approximated by Kmne-vS. This value is called the expect.
According to Setabul and Meidanis in their book "Introduction to Computational
Molecular Biology", it is interpreted as the expected number of distinct segment
pairs between two random sequences with score above S. If this is not all clear,
don't fret, it will either be clarified in another section (if it's funded), or will
be clarified in class.
|