Sequence Q: |
A | B | C | C | C | A | B | A | A | C |
A | C | C | C | B | A | C | B | C | C |
C | A | A | B | B | B | A | A | B | B |
Sequence L: |
C | B | C | C | C | A | B | C | B | C |
B | B | C | A | B | C | B | B | C | C |
A | C | A | B | B | A | A | C | B | C |
Random Alignment Probabilities.
The two sequences above are assumed to be biological, and to be subject to biological constraints.
One place we see this is in the probabilities pA=16/60, pB=20/60 and pC=24/60. In a
perfectly random world we might expect the A, B and C codons to occur with equal probability, which
would be 20/60 = 1/3. But in this imaginary world with its imaginary genes, the C codon is more useful
biologically than the A, so the A's occur less frequently than the C's.
There is yet another level at which biology reduces randomness, and that is in having a preference for some
mutations over others. Let's suppose we have a large pool of A, B and C codons with which we want to
build a 30 residue gene, and each time we draw a codon to fill a residue we have a chance of 16/60 that
the codon is A, 20/60 it's a B, and 24/60 it's a C. Then on average such genes will have the same
fraction of A's, B's and C's as the genes Q and L above. Now let's make another random
30 residue gene with the same codon probabilities, and align it with the first. Take a look at position 1.
What is the probability that it's an A aligned with an A? This probability, denote it PAA, is
just the probability that the first sequence has an A in that position times the probability that the
second sequence has an A in that position, because the sequences are independent. The same
argument can be made for all the pairwise random alignment probabilities. This gives us a matrix of
random pairwise alignment probabilities:
PAA=pApA=0.07111 | PAB=pApB=0.08889 | PAC=pApC=0.10667 |
Note that this P-matrix is symmetric.
|
PBA=pBpA=0.08889 | PBB=pBpB=0.11111 | PBC=pBpC=0.13333 |
PCA=pCpA=0.10667 | PCB=pCpB=0.13333 | PCC=pCpC=0.16 |
Since these nine alignments are everything that can happen, the sum of these nine probabilities must
be 1 (NOTE: this is not a probability matrix of the kind developed on the previous pages; in that case
the sum of the probabilities down each column was 1).
|
NonRandom Alignment Probabilities.
This is the mutation probability matrix from the previous page:
6/16 | 3/20 | 7/24 |
3/16 | 14/20 | 3/24 |
7/16 | 3/20 | 14/24 |
From this we will form a new matrix the components of which are defined below:
qAA=MAApA=(6/60)=0.1 | qAA=MABpB=(3/60)=0.05 | qAC=MACpC=(7/60)=0.11667 |
qBA=MBApA=(3/60)=0.05 | qBA=MABpB=(14/60)=0.23333 | qBC=MACpC=(3/60)=0.05 |
qCA=MCApA=(7/60)=0.11667 | qCA=MABpB=(3/60)=0.05 | qCC=MACpC=(14/60)=0.23333 |
This matrix, like the P-matrix, is symmetric, and also like the P-matrix the sum of all nine components is 1.
Let's take a look at one of the components and make sense of it. For example,
qBA=MBApA
= [(# times A aligned with B)/(# times A occurs)] x [(# times A occurs)/(# of total codons)]
= (# times A aligned with B)/(# of total codons) = (# times B aligned with A)/(# of total codons)
= qAB.
This would seem to be exactly the same as PBA, that is, the probability of finding an
A aligned with a B. The difference is PBA has no connection to the idea that some
alignments are biologically prefered; it's a random alignment probability. qBA on the
other hand is defined in terms of MBA, the value of which is determined by studying
a biological genome database (which in this case has two genes).
|
The Point.
Suppose qBA > PBA. That means the probability of finding an A as a result
of a mutation from a B occurs more frequently than we would expect given purely random mutations.
That is, Nature likes this idea, and if Nature likes it, then we had better score it positively,
despite the fact it's a mismatch.
Likewise, if qBA < PBA, then Nature dispproves. To keep her mollified we
choose to score this mismatch negatively. The way we achieve reasonable scores is to produce
from P and q a log-odds matrix, which we'll do on the next page.
|
|