ISMB99 - Tutorial 1
ABSTRACT Large databases of biological information create both challenging data-mining problems and opportunities. Probabilistic Graphical Models (PGMs) have emerged has one of the most practical and unifying concept in statistics in the 90s, and provide a general foundation for addressing biological modeling problems. Biological data are best modeled within a Bayesian probabilistic framework. The framework provides a solid mathematical foundation as well as axiomatic unicity. More importantly, probabilistic modeling offers a consistent and systematic way of dealing with the pervasive noise and variability evolutionary tinkering has etched into biological systems and data. Across a variety of bioinformatics problems--from gene finding to protein structure and function prediction--the idea is to construct parameterized probabilistic models and fit them to the data using the tools of probability theory. This machine learning approach is ideally suited for domains characterized by the presence of large amounts of data, ``noisy'' patterns, and the absence of general theories. The fundamental idea behind these approaches is to learn the theory automatically from the data, through a process of inference, model fitting, or learning from examples. By necessity, such probabilistic models are often complex and lead to the manipulation of high dimensional probability distributions over the space of data, model parameters, and hidden variables. In general, such distributions are intractable. PGMs provide a systematic set of techniques for approximating such distributions by factorizing them. This is achieved by introducing independence assumptions between subsets of the variables that is reflected in the sparse connectivity of the resulting graphs. Many of the most useful statistical models routinely used in computational molecular biology such as neural networks, Kalman filters, hidden Markov models (HMMs), Bayesian networks, and stochastic grammars are special cases of PGMs. The PGM framework allows for a better understanding of these models and what the models and the associated fitting or inference algorithms have in common. Many forms of dynamic programming algorithms routinely used in sequence analysis can be viewed, in a unified way, as special cases of propagation of information in graphical models. The PGM framework is also very useful in the search for new models, such as Input-Output HMMs, interpolated HMMs, bidirectional HMMs, factorial HMMs, HMM/NN hybrids, etc. PGMs and the related machine-learning methods are computationally intensive and benefit greatly from progress in computer speed. It is remarkable that both computer speed and sequence volume have been growing at roughly the same rate since the late 1980s, and continue to do so, at the moment doubling about every 16 months or so. In this tutorial, the theory and application of PGMs will be presented and illustrated with several practical problems including: multiple sequence alignments, data base searches, gene finding, protein secondary structure prediction, protein functional site prediction, and RNA structure modeling. Hands-on applications will be demonstrated using a PGM package with graphical interface.
Reference: P. Baldi and S. Brunak (1998) "Bioinformatics: the Machine Learning Approach" MIT Press. This book is the handout for this tutorial. It is net included in the tutorial fee but can be obtained at the strongly reduced price of DM 40,00 in connection with a registrration for Tutorial 1 (see conference registration form).
DETAILED OUTLINE:
|
-> Tutorial Handouts
(
HTML,
PS,
HHM
)
-> Tutorial
Program
-> ISMB 99