Isidore Rigoutsos, Yuan Gao, Aris Floratos, Laxmi Parida
Abstract
We have used the Teiresias algorithm to carry out unsupervised
pattern discovery in a database containing the unaligned ORFs
from the 17 publicly available complete archaeal and bacterial
genomes and build a 1D dictionary of motifs. These motifs which
we refer to as seqlets account for and cover 97.88% of this genomic
input at the level of amino acid positions. Each of the seqlets
in this 1D dictionary was located among the sequences in Release
38.0 of the Protein Data Bank and the structural fragments corresponding
to each seqlet?s instances were identified and aligned in three
dimensions: those of the seqlets that resulted in RMSD errors
below a pre-selected threshold of 2.5 Angstroms were entered in
a 3D dictionary of structurally conserved seqlets. These two dictionaries
can be thought of as cross-indices that facilitate the tackling
of tasks such as automated functional annotation of genomic sequences,
local homology identification, local structure characterization,
comparative genomics, etc.