ISMB99 - Tutorial 6

 


EST clustering

Robert Miller, Alan Christoffels, Winston Hide

The tutorial will be based on what we have learned over the past three years about clustering technologies and issues surrounding the field. We will teach users how to perform clustering using the STACK_PACK series of tools, which employ widely known algorithms such as PHRAP, Cross_match and also less well known tools such as CONTIGPROC and CRAW. In addition we will seek to inform on the major problems involved in the clustering process, suggest appropriate strategies, and provide a framework upon which a clustering strategy can be built. The tools will be made available to all registered users.

With the easy access to technology to generate expressed sequence tags (ESTs), several groups have sequenced from thousands to several hundred thousands of ESTs. Currently the majority of the coding portion is in the form of expressed sequence tags (ESTs), and the need to discover the full length cDNAs of each human gene is frustrated by the partial nature of this data delivery. There is significant value in attempting to consolidate gene sequences as they are produced, in lieu of a yet-to-be-completed reference sequence. ESTs offer a rapid and inexpensive route to gene discovery, reveal expression and regulation data (Vasmatzis, et al, 1998), highlight gene sequence diversity and splicing (Wolfberg and Landsman, 1997), and may identify more than half of known human genes (Hillier, et al, 1996). The price of the high-volume and high-throughput nature of the data, however, is that ESTs contain high error rates (Aaronson, et al 1996), do not have a defined protein product, are not curated in a highly annotated form and present only a raw substrate for sequence matching. Unfortunately, most EST data remains unprocessed, and thus does not provide the important high value sequence consensus information that it contains. The low quality sequence data provided can be much improved on, but in order to achieve quality information, pre-processing, clustering and post-processing of the sequences is required. In addition, the sequences resulting need a great deal of alignment post-processing.

References:

Vasmatzis, G., M. Essand, U. Brinkmann, B. Lee, and I. Pastan.
Discovery of three genes specifically expressed in human prostate by expressed sequence tag database analysis.
Proc. Natl. Acad. Sci. USA 95(1):300-304, 1998

Wolfberg, T.G. and D. Landsman
A comparison of expressed sequence tags (ESTs) to human genomic sequences.
Nucleic Acids Research 25(8):1626-1632, 1997

Hillier,L., N. Clark, T. Dubuque, K. Elliston, M. Hawkins, M. Holman, M. Hultman, T. Kucaba, M. Le, G. Lennon, M. Marra, J. Parsons, L. Rifkin, T. Rohlfing, M. Soares, F. Tan, E. Trevaskis, R. Waterston, A. Williamson, P. Wohldmann, and R. Wilson.
Generation and Analysis of 280,000 Human Expressed Sequence Tags
Genome Research 6:807-828, 1996

Aaronson, J.S., B. Eckman, R.A. Blevins, J.A. Borkowski, J. Myerson, S. Imran, and K.O. Elliston.
Toward the Development of a Gene Index to the Human Genome: An Assessment of the Nature of High-throughput EST Sequence Data.
Genome Research 6:829--845, 1996

-> Tutorial Handouts
-> Tutorial Program
-> Tutorial Program
-> ISMB 99