Catching some Z's.
There's a lot of statistical theory we won't touch on,
but a great deal of what happens in statistics is testing
the truth of hypotheses by showing how unlikely it is
that the converses are true. There are few absolutes in statistics,
but if 99.99% of all sequence alignment scores fall below a
certain level, and you happen to have a score above that level,
then there's reason to believe that the source of the high
score may be biological and not the result of randomness.
Quite often, instead of dealing with alignment scores we
deal with z-scores, which as we've seen measure how many
standard deviations we are from the mean, and since we'll be
interested exclusively in high scores, all of our interesting
z-scores will be positive.
As we saw on the previous page, when dealing with the Normal
distribution about 99.9% of all values lie below z = 3. So
being above z = 3 is pretty unlikely (p = 0.001 = one chance
out of a thousand). But while the Normal distribution is quite
common, it doesn't do us much good. The three distributions
we will use most frequently are the Binomial distribution, which
we've already looked at, the Poisson distribution, which we're
about to look at, and the Extreme Value distribution, the most
arcane of the lot, and the most important.
One thing each of those three distributions has in common is that
they're skewed to the right. This unsymmetric behavior makes it
more likely to find high z-scores that occur randomly
than is the case with the Normal distribution. In a Normal
distribution a score z = 10 would be in a range of such
extreme improbability that it should never occur randomly. But
as we will see, z = 10 is not as significant in the Extreme Value
Before we get to that, let's take a look at the Poisson distribution.
Recall that the binomial distribution depends on two numbers:
In those cases where n is very large calculating the binomial distribution
can be quite problematic (in particular, the binomial coefficient that is
part of that calculation is difficult to calculate for large n).
However, if n is large, and p is small, then the Poisson distribution is
a good approximation to the binomial distribution and may be used in
- n, the number of times an experiment is being done (like comparing residues);
- p, the probability that any given experiment will give a yes (or match) answer.
np = mean of a binomial distribution. X is the number of yesses you'll have
out of n experiments, and P(X=x) is the probability that X = some value x.
For reasonably small values of np this distribution dies off pretty quickly,
and only small values of x have sizable probabilities. Let's take a look.
More Java on the next page.