The big picture...
- Given a pair of genes we pass them through a computer.
One may already reside on a computer in a genome database. The other would be the query sequence.
- In the computer they encounter a dynamic programming algorithm that
- optimally aligns the two sequences using a predetermined scheme for
scoring codon pairs, and
- spits out on request the optimal alignment(s) and optimal score.
- This score is a measure of the similarity of the sequences, but
- the significance of that similarity can not be determined
without a statistical context.
- It must be placed in a statistical distribution of scores
- from which it may be determined if there is a reasonable expectation
for such a score to occur randomly, ie., by chance,
- and if it is very unreasonable, then we might conclude that the sequences are homologous,
- at which point white-coated biologists take over and do their version of mysterious stuff.
|
But first... we have to find the optimal alignments and corresponding
optimal score. The brute force approach to this problem is to find all possible alignments
and choose those with the best scores. Since we allow gaps to be aligned with codons (but
not with other gaps), this yields lots and lots of alignments (for two single residue sequences
there are 3 alignments ... think about it). I did a calculation, and allowing for the
introduction of all possible gaps, the number of ways of aligning two
sequences of equal length k grow faster than 22k+1. For k=70, with a computer
capable of computing a trillion
(1012) alignments per second, the entire age of the universe wouldn't be enough
time.
Those white-coated biologists would be white-haired by then, and probably have got bored
and tootled off somewhere.
Dynamic programming algorithms cut the age of the universe down to mere minutes. They do
this by doing the alignments step by step, and with each step throwing out vast numbers
of alignments unchecked, because the algorithm assures us that even without checking them,
the discarded alignments can't be optimal. In the end, for two sequences of length 100,
it reduces the number of alignments we need to consider from more than 1030
down to something of the order of 1002=104. Even a desktop can handle that.
|