Christian Iseli, C. Victor Jongeneel, Philipp Bucher
Abstract
One of the problems associated
with the large-scale analysis of unannotated, low quality EST
sequences is the detection of coding regions and the correction
of frameshift errors that they often contain. We introduce a new
type of hidden Markov model that explicitly deals with the possibility
of errors in the sequence to analyze, and incorporates a method
for correcting these errors. This model was implemented in an
efficient and robust program, ESTScan. We show that ESTScan can
detect and extract coding regions from low-quality sequences with
high selectivity and sensitivity, and is able to accurately correct
frameshift errors. In the framework of genome sequencing projects,
ESTScan could become a very useful tool for gene discovery, for
quality control, and for the assembly of contigs representing
the coding regions of genes.