Certain substitutions may have a limited effect on gene function, while others
may prove so deleterious that the gene goes byebye entirely. Consequently a more
elaborate scoring scheme is needed when aligning amino acid codons, one that allows for
positive scores for those mismatches that Nature has little trouble accepting, and in general
provide scores that reflect the level of that acceptance.
Determining that level of acceptance is the problem, and in particular, given a substitution
like (A>C), we'd like a measure of its acceptance after one evolutionary step.
That is, we'd like to differentiate between the one step change (A>C), and multistep
changes like (A>?...?>C), where none of the intermediate codons is A or C.
We do this by narrowing our attention to pairs of sequences in our database that satisfy a
certain heavy alignment criterion, like 80% identity. The assumption is that with each step
more mutations occur and identity percentages decline, and the more likely it becomes that
any particular codon has mutated more than once.
For example, let's suppose we have a two letter alphabet, {A,B}, and after each step 20% of the
A's change to B's, and 20% of the B's change to A's. Suppose we start with a gene containing
100 of each letter. What happens after two steps?
