Jan Gorodkin, Ole Lund, Claus A. Andersen, and Soren Brunak
Abstract
Correlations between sequence
separation (in residues) and distance (in Angstrom) of any pair
of amino acids in polypeptide chains are investigated. For each
sequence separation we define a distance threshold. For pairs
of amino acids where the distance between C-alpha atoms is smaller
than the threshold, a characteristic sequence (logo) motif, is
found. The motifs change as the sequence separation increases:
for small separations they consist of one peak located in between
the two residues, then additional peaks at these residues appear,
and finally the center peak smears out for very large separations.
We also find correlations between the residues in the center of
the motif. This and other statistical analyses are used to design
neural networks with enhanced performance compared to earlier
work. Importantly, the statistical analysis explains why neural
networks perform better than simple statistical data-driven approaches
such as pair probability density functions. The statistical results
also explain characteristics of the network performance for increasing
sequence separation. The improvement of the new network design
is significant in the sequence separation range 10--30 residues.
Finally, we find that the performance curve for increasing sequence
separation is directly correlated to the corresponding information
content. A WWW server, distanceP, is available at
http://www.cbs.dtu.dk/services/distanceP/.