End Gaps

Consider the following alignment:

A C G G A C C - A G G T G A
- - - - A C C T A - - - - -

There are four matches, yielding +4 points, and 10 alignments with gaps, which, given a gap penalty of -2, yields -20 points, for a total score of -16.

Now realign the sequences as follows:

A C G G A C C A G G T G A
A C - - - C - - - - T - A

Now we have 5 matches, yielding +5 points, and 8 alignments with gaps, yielding -16 points, for a total of -11 points. This is better?

Well, -11 is a better score than -16 mathematically, but the context of our comparison is biology. We're interested in the idea that the similarity of the two sequences may be due to a series of changes over time: insertions, deletions, etc. In that light the first sequence is more interesting than the second despite its lower score. It is more interesting to view the string ACCTA as serving some biological purpose that, in being inserted in the first string almost intact, improved that sequence. The letters of a small sequence can always be shuffled about as was done in the second alignment, but there's almost no way to determine if that is significant.

The problem lies with the end gaps. We shouldn't be punishing end gaps, those that lie fully to the left or right of one of the two sequences, if we're not interested in the global alignment, ie., the alignment of full sequences. And most of the time we're not. And as a first step to fully local alignments, we'll look at partially local alignments, those that don't punish for end gaps, and we'll do that on the next page. This page will be much like page 3 of this section, a JavaScript that will ask you to input two sequences.