The big picture...
  • Given a pair of genes we pass them through a computer. One may already reside on a computer in a genome database. The other would be the query sequence.
  • In the computer they encounter a dynamic programming algorithm that
  • optimally aligns the two sequences using a predetermined scheme for scoring codon pairs, and
  • spits out on request the optimal alignment(s) and optimal score.
  • This score is a measure of the similarity of the sequences, but
  • the significance of that similarity can not be determined without a statistical context.
  • It must be placed in a statistical distribution of scores
  • from which it may be determined if there is a reasonable expectation for such a score to occur randomly, ie., by chance,
  • and if it is very unreasonable, then we might conclude that the sequences are homologous,
  • at which point white-coated biologists take over and do their version of mysterious stuff.

But first... we have to find the optimal alignments and corresponding optimal score. The brute force approach to this problem is to find all possible alignments and choose those with the best scores. Since we allow gaps to be aligned with codons (but not with other gaps), this yields lots and lots of alignments (for two single residue sequences there are 3 alignments ... think about it). I did a calculation, and allowing for the introduction of all possible gaps, the number of ways of aligning two sequences of equal length k grow faster than 22k+1. For k=70, with a computer capable of computing a trillion (1012) alignments per second, the entire age of the universe wouldn't be enough time. Those white-coated biologists would be white-haired by then, and probably have got bored and tootled off somewhere.

Dynamic programming algorithms cut the age of the universe down to mere minutes. They do this by doing the alignments step by step, and with each step throwing out vast numbers of alignments unchecked, because the algorithm assures us that even without checking them, the discarded alignments can't be optimal. In the end, for two sequences of length 100, it reduces the number of alignments we need to consider from more than 1030 down to something of the order of 1002=104. Even a desktop can handle that.


Ok, let's jump into the deep end of the pool. When you click the "Next" button you're going to be taken to a page that will first give you a scary window that looks like an error message. Actually it's just me horsing around. It's an explanation of what is about to happen, which is, you will be given two more little blue windows into each of which you're to type a sequence of letters. After the second sequence has been entered, and you've clicked ok, your computer will do a calculation and produce a table. As a start, let sequence 1 be "dixon", and sequence 2 be "ions". Then we'll talk again later.