Histograms; Means;
Skewed Data
The Plots
produced by the application on the previous page are called histograms. It's a very useful way of organizing a finite amount of data into a graphical presentation. It highlights the fact that while the sequences we compared to the query sequence were random, the resulting data values (number of matches) are less so. Of course we know the values are constrained to fall from 0 to 6, but even within that range one will far more often encounter 1 or 2 matches than 4 or 5. The result is a distribution of values looking typically like a mountain that tails off in the extremes, which are low probability regions.
One of the first things we'd like to determine about such a distribution is its location, and the most common way of doing that is to compute the mean, or average value. The mean of n data values x1, x2, x3,..., xn, is
_
|
x1+x2+x3+...+xn
|
X =
|
----------------------------
|
|
n
|
This value will in general lie roughly underneath the peak of the distribution. It is a good measure of central tendancy.
.
|
Another measure of central tendancy is the median, which I will use here only to make a point. First, the median of the values x1, x2, x3,..., xn is the value of xk picked out from the middle of the list arranged in ascending order, if n is odd; if n is even then it's the average of the two middle values. So the median of the data values {1,2,2,5,6} is 2, and of the values {1,2,5,6} is 3.5.
At the bottom of the page is a simple application for building histograms. You're to imagine that you're counting gall stones (and if there's nothing on TV, why not?), and that if a patient has 5 gall stones, then you click the "5" button at the bottom of the screen, etc. So let's say you click 5 twice, 6 four times, 7 three times, 8 once and 9 once. That gives us a distribution that looks vaguely binomial. Click the "When done" button when you're done and the mean and median pop onto the screen. Note that the mean is a bit to the right of the median. This typically happens when the data distribution is skewed to the right, which means the right side fades off more gradually than the left. We can emphasize that effect by clicking the buttons for 10 through 14, once each. We see that the mean drifts further off from the defining peak of the distribution.
Anyway, play around with it a while. The distributions we encounter in the study of genetic alignments are almost always skewed right with long tails that are very often the regions of greatest interest to us, for they are the regions of low probability, and things that happen despite having a low probability of happening are quite often the things that happen to be of greatest interest to us. Such happenings are significant.
|