Rolf Olsen, Ralf Bundschuh, Terence Hwa
Abstract
The statistical significance
of gapped local alignments is characterized by analyzing the extremal
statistics of the scores obtained from the alignment of random
amino acid sequences. By identifying a complete set of linked
clusters, "islands," we devise a method which accurately
predicts the extremal score statistics by using only one to a
few pairwise alignments. The success of our method relies crucially
on the link between the statistics of island scores and extremal
score statistics. This link is motivated by heuristic arguments,
and firmly established by extensive numerical simulations for
a variety of scoring parameter settings and sequence lengths.
Our approach is several orders of magnitude faster than the widely
used shuffling method, since island counting is trivially incorporated
into the basic Smith-Waterman alignment algorithm with minimal
computational cost, and all islands are counted in a single alignment.
The availability of a rapid and accurate significance estimation
method gives one the flexibility to fine tune scoring parameters
to