ISMB99 - Tutorial 4

 


Computational and statistical challenges involved in reconstructing evolutionary trees


Tandy Warnow, Junhyong Kim

All biological disciplines are united by the idea that species share a common history. The genealogical history of life - also called an evolutionary tree - is usually represented by a bifurcating, leaf-labelled tree. The use of evolutionary trees is a fundamental step in many biological problems, such as multiple sequence alignments, protein structure and function prediction, and drug design. Unfortunately, inferring evolutionary trees is an enormously difficult problem for several reasons. For one, the phylogeny problem is a difficult statistical problem because its parameter space has a complicated structure, and there is no 'off the shelf' solution to the phylogeny problem that can be applied. The phylogeny problem also presents a considerable computational challenge. Typical data sets now consist of several hundred species, and presently availably tree reconstruction methods are inadequate to the task of analyzing such datasets. For example, an rbcL DNA sequence data set of 500 plants has been analyzed for several years now, without solution. The explanation for why these analyses are so difficult is simple: the optimization problems are NP-hard, and the heuristics used in an attempt to solve these optimization problems use hill-climbing techniques to search through an exponentially large space of phylogenetic trees.

Over the last decade or so, computer scientists have become increasingly fascinated by the algorithmic challenges involved in phylogeny reconstruction, and the major conferences (STOC, FOCS, ESA, SODA, etc.) and journals have had many papers addressing phylogeny estimation. Statisticians and biologists have also recently made significant progress towards understanding the evolutionary process that operates on biomolecular sequences, and analyzing the performance of phylogenetic methods under these models. One of the exciting outcomes of just the last few years is the use of statistical models to guide algorithm development, and the validation of new methods through experimental performance studies. At the same time, it is clear that even the best of the newly developed methods are unlikely in themselves to solve the problem of large scale tree reconstruction, and that further algorithm development is needed before we can have accurate phylogenetic reconstruction of very large trees within reach.

The goal of this tutorial is to make such progress possible. We will teach the basic issues involved in phylogenetic analysis, survey standard methods, and describe the basic proof techniques that have been developed in the last few years to analyze these methods for their expected performance under stochastic models of evolution. We will then teach the new methods that have been developed by the algorithms community, and present the theoretical performance guarantees they provide. Experimental performance studies (some based upon simulating evolution on large trees, and others based upon real data sets) will play a primary role in this tutorial. Open problems and critical application areas will be discussed in depth in the last portion of the tutorial.

No background in probability, statistics, or phylogenetics will be assumed.

-> Tutorial Handouts
-> Tutorial Program
-> ISMB 99