Assistant Professor of Biology
Procrustes Meets Theseus: Maximum Likelihood Superpositions
The Protein Data Bank is a collection of protein structures that labs can use to compare new structures to ones already identified. This is done by fitting the new structure to old using “superpositioning,” a way of fitting the atomic structures as closely as possible. But models are imprecise, and variance can be very high. To interpret whether superpositioning has found the most accurate fit, scientists use a statistical analysis called the least squares method. This method assumes that the best fit is that with the minimum sum of squared differences between the structures—but this method assumes equal variance. Since variance is so high, scientists often just trim off the regions that are unsuperimposible, losing information that could be important.
Dr. Theobald discussed a new method of analyzing the similarity of structures based on the maximum likelihood principle. Maximum likelihood downweights the variable regions instead of trimming them off, which gives a better statistical fit. To use this method, first a statistical model (Gaussian, for example) is chosen to best represent the data. Parameters for the model are set to predict the data with the highest probability, such as the mean structure and a covariance matrix: each atom in the structure has its own variance and it co-varies with other atoms. Maximum likelihood uses the covariance matrix to downweight the variable regions.
Dr. Theobald has written a program, Theseus, which uses maximum likelihood for superpositioning. In a simulation using simulated structures, a comparison between the two statistical methods showed that maximum likelihood to be much more accurate than least-squares, that less information is lost due to variance, and that this method will prove very helpful for use in structural biology.