Jun Zhu, Roland Luethy and Charles E. Lawrence
Department of Computational Biology,
MS-29-2-A, Amgen, Inc
One Amgen CenterDrive, Thousand Oaks, CA 91320
Wadsworth Center for Laboratories & Research, NYSDOH
P.O. Box 509, Empire State Plaza, Albany, NY 12201
Abstract
The size of protein sequence
database is getting larger each day. One common challenge is to
predict protein structures or functions of the sequences in databases.
It is easy when a sequence shares direct similarity to a well-characterized
protein. If there is no direct similarity, we have to rely on
a third sequence or a model as intermediate to link two proteins
together. We developed a new model based method, called Bayesian
search, as a means to connect two distantly related proteins.
We compared this Bayesian search model with pairwise and multiple
sequence comparison methods on structural databases using structural
similarity as the criteria for relationship. The results show
that the Bayesian search can link more distantly related sequence
pairs than other methods, collectively and consistently over large
protein families. If each query made one error on average against
SCOP database PDB40D-B, Bayesian search found 36.5% of related
pairs, PSI-Blast found 32.6%, and Smith-Waterman method found
25%. Examples are presented to show that the alignments predicted
by the Bayesian search agree well with structural alignments.
Also false positives found by Bayesian search at low cutoff values
are analyzed.