Scoring More.
Ok, so anyway, we rummage in our database and choose all pairs of sequences that once aligned have 80% or greater matches. The majority of the mismatches we assume result from single substitutions (and likewise we assume that the vast majority of the matches occur as a result of stability). We use all these aligned pairs to build a one-step mutation probability matrix (recall: we saw an example of this on page 6 of the probability chapter).

Let's investigate the process by example. Say we have a 3-letter codon alphabet, {A,B,C}, and two 30 residue sequences aligned as follows:

Sequence Q: ABCCCABAAC ACCCBACBCC CAABBBAABB
Sequence L: CBCCCABCBC BBCABCBBCC ACABBAACBC
Let kA = 16, kB = 20, kC = 24, be the total number od A's, B's and C's, respectively, in the total of 60 residues. Therefore,
kA+kB+kC = 60.

Let's suppose these sequences are representative of the entire population of genes arising from the 3-letter alphabet. In that case the three fractions,

pA = kA/60 = 16/60, pB = kB/60 = 20/60, pC = kC/60 = 24/60,

are the occurance probabilities for A, B and C in the total population. Note that pA+pB+pC = 1. That is, there is a probability of 1 (certainty) that any given residue will be filled with an A, B or C.
Construction of the Probability Matrix.
The probability matrix for this case has the following nine elements:
MAAMABMAC where, for example, MBA = probability that if we start with A, we'll end up with B after one step (NOT visa versa!).
MBAMBBMBC
MCAMCBMCC
So, for example, the three probabilities, MAA, MBA, MCA, of the first column represent the probabilities of everything that can happen to A:

MAA - (A-->A); and MBA - (A-->B); and MCA - (A-->C).

Since one of these must happen, MAA + MBA + MCA = 1. By a similar argument the sum down each of the three columns must be 1.

We calculate MBA - the probability that A will mutate to B - as follows:
# times A aligned with B 3 3
MBA = -------------------------------- = --- = ---.
# times A occurs total kA 16
Note that (# times A aligned with B) = (# times B aligned with A) = 3, indicated above in red. But kB is not equal kA, so MAB = 3/20 is not equal MBA = 3/16. The matrix is not symmetric.

Finally, having determined MBA = 3/16, and MCA = 7/16, we set MAA = 1 - MBA - MCA = 6/16. (In reality when dealing with genome databases, diagonal probabilities like MAA, which measure the likelihood a codon will NOT change, will be much closer to 1, and all off-diagonal probabilities, like MBA, much closer to zero.)

On the next page we'll write out the full one-step mutation probability matrix for this case and show how to use it to get multistep matrices.