Walter L. Ruzzo and Martin Tompa
Abstract
Given a sequence of real numbers
("scores"), we present a practical linear time algorithm
to find those nonoverlapping, contiguous subsequences having greatest
total scores. This improves on the best previously known algorithm,
which requires quadratic time in the worst case. The problem arises
in biological sequence analysis, where the high-scoring subsequences
correspond to regions of unusual composition in a nucleic acid
or protein sequence. For instance, Altschul, Karlin, and others
have used this approach to identify transmembrane regions, DNA
binding domains, and regions of high charge in proteins.