Scoring.
Certain substitutions may have a limited effect on gene function, while others may prove so deleterious that the gene goes bye-bye entirely. Consequently a more elaborate scoring scheme is needed when aligning amino acid codons, one that allows for positive scores for those mismatches that Nature has little trouble accepting, and in general provide scores that reflect the level of that acceptance.

Determining that level of acceptance is the problem, and in particular, given a substitution like (A-->C), we'd like a measure of its acceptance after one evolutionary step. That is, we'd like to differentiate between the one step change (A-->C), and multistep changes like (A-->?...?-->C), where none of the intermediate codons is A or C.

We do this by narrowing our attention to pairs of sequences in our database that satisfy a certain heavy alignment criterion, like 80% identity. The assumption is that with each step more mutations occur and identity percentages decline, and the more likely it becomes that any particular codon has mutated more than once.

For example, let's suppose we have a two letter alphabet, {A,B}, and after each step 20% of the A's change to B's, and 20% of the B's change to A's. Suppose we start with a gene containing 100 of each letter. What happens after two steps?

Start.One step.Two steps.
100 A's -->80 A's -->64 A's
36 B's
20 B's -->4 A's
16 B's
100 B's -->20 A's -->16 A's
4 B's
80 B's -->36 A's
64 B's
We still have 100 A's and 100 B's at the end, but if we align our first sequence with the final sequence we'll find that 68 of our original A's align with A's, and 32 are now B's. That's ok, but 4 of those 68 A-to-A matches result from an (A-->B-->A) mutation, so they don't represent stability. And the more steps we take, the greater the number of problems of this sort we encounter, ie., not knowing how we got from the initial codon to the final codon with which it is aligned.

Let's carry on...