Catching some Z's.

There's a lot of statistical theory we won't touch on, but a great deal of what happens in statistics is testing the truth of hypotheses by showing how unlikely it is that the converses are true. There are few absolutes in statistics, but if 99.99% of all sequence alignment scores fall below a certain level, and you happen to have a score above that level, then there's reason to believe that the source of the high score may be biological and not the result of randomness.

Quite often, instead of dealing with alignment scores we deal with z-scores, which as we've seen measure how many standard deviations we are from the mean, and since we'll be interested exclusively in high scores, all of our interesting z-scores will be positive.

As we saw on the previous page, when dealing with the Normal distribution about 99.9% of all values lie below z = 3. So being above z = 3 is pretty unlikely (p = 0.001 = one chance out of a thousand). But while the Normal distribution is quite common, it doesn't do us much good. The three distributions we will use most frequently are the Binomial distribution, which we've already looked at, the Poisson distribution, which we're about to look at, and the Extreme Value distribution, the most arcane of the lot, and the most important.

One thing each of those three distributions has in common is that they're skewed to the right. This unsymmetric behavior makes it more likely to find high z-scores that occur randomly than is the case with the Normal distribution. In a Normal distribution a score z = 10 would be in a range of such extreme improbability that it should never occur randomly. But as we will see, z = 10 is not as significant in the Extreme Value distribution.

Before we get to that, let's take a look at the Poisson distribution. Recall that the binomial distribution depends on two numbers:
  • n, the number of times an experiment is being done (like comparing residues);
  • p, the probability that any given experiment will give a yes (or match) answer.
In those cases where n is very large calculating the binomial distribution can be quite problematic (in particular, the binomial coefficient that is part of that calculation is difficult to calculate for large n). However, if n is large, and p is small, then the Poisson distribution is a good approximation to the binomial distribution and may be used in it's place.
np = mean of a binomial distribution. X is the number of yesses you'll have out of n experiments, and P(X=x) is the probability that X = some value x. For reasonably small values of np this distribution dies off pretty quickly, and only small values of x have sizable probabilities. Let's take a look. More Java on the next page.