# Populations and Samples

The section on probability has prepared us well for the start of this section. Consider the following problem: given a 6 residue DNA code sequence, CATTAG, what are the probabilities that a random 6 residue sequence using the same alphabet will match the given sequence in 0,1,2,3,4,5 or 6 residues? The probability of matching in a single residue is p = 1/4 = 0.25. The probability of not matching is 1-p=0.75. So the probability of k matches is given by the equation on the right above.
(Note: there's a simple proof that these probabilities sum to 1. The seven terms in the binomial expansion of (0.25+0.75)6 are just the pk above. That is,
1=16=(0.25+0.75)6 = p0+p1+...+p6.)
The seven probabilities are:
• p0=0.1780,
• p1=0.3560,
• p2=0.2966,
• p3=0.1318,
• p4=0.0330,
• p5=0.0044,
• p6=0.0002.
These probabilities are plotted at right (the vertical grooves; scroll the righthand frame to the bottom to see it better).
Suppose we take a random collection of sequences of the same length and compare each to our query sequence (CATTAG), and then collect and plot the scores (ie., plot how frequently each score occurs, the result being a frequency distribution). For example, CATCAT would score 4, and GGGGGA would score 0. The relative heights of the grooves represent the way we expect our plotted frequencies to be distributed. For example, if we compare the query sequence, CATTAG, to 225 random 6 residue sequences, then we expect 17.8%, or about 40, of those 225 sequences to have no matching residues; and 35.6%, or about 80, to have one matching residue; 29.66%, or about 67, to have two matching residues, etc. Plotting the frequencies of a real trial should give us columns with the same relative heights as the probability grooves shown at right. Ok, so push the 225 button, and let's see what happens. Come back here when it finishes.

What you just saw, if everything worked correctly, was 225 random sequences being generated on the right. As each was generated it was compared to CATTAG. The resulting score was then used to push up the gold rod by one increment over the relevant score. After all 225 sequences have been created and compared, a new histogram of gold rods overlays the grooves. The increments are adjusted so that if the experimental frequencies occur with exactly the theoretical percentages, then each gold rod will exactly fill the corresponding groove.
But that is very unlikely. I can not say precisely what you're seeing, as each time you reset the panel and push another buttom the results will be different. But it is highly likely that some of your gold rods go over the top of their corresponding grooves, and some fall short. The basic shape of the experimental distribution (gold rods) should look similar to the theoretical distribution (grooves), but it won't be exact.
The shapes are similar because mathematics forces them to be; the theoretical distribution is an ideal - not unattainable, but unlikely. That's because the sample size of 225 is just too small. Each gold rod that goes one increment over the theoretical distribution groove means there's another rod that must be one increment under. With bigger increments, you get greater variation from the theoretical distribution. And the increment size increases as the sample size decreases.

Let's test this. Reset the panel, then push the button for a sample size of 15 (choices are 15, 75, 135, 225, and 675). Do it a few times. What you should be observing is that the experimental distribution is now having much more trouble trying to fit the theoretical distribution. Ok, try the 75 button a couple of times, then work through all the higher buttons. You should observe that as the sample size increases, the experimental distribution has less trouble conforming to our theoretical expectations. If we were to use an infinite sample size - the entire population of every such random sequence that could ever be generated (whatever that means) - then the experimental and theoretical percentages would be identical. That's the difference between populations and samples, and that's part of the reason we need statistics. Very often an entire population of experimental results is unattainable - or even lacking in meaning. But we can approximate a population with a sample of experimental results, and if the sample is big enough, and chosen carefully enough, it can tell us something about the theoretical population, although with some uncertainty. As we have seen, this uncertainty decreases as the sample size increases. Ah, the wonders of statistics. Anyway, play around a bit with the panel at right. I'm going to go get a cup of coffee. See you on the next page.