ISMB99 - Tutorial 1

 


Probabilistic graphical models


Pierre Baldi

ABSTRACT Large databases of biological information create both challenging data-mining problems and opportunities. Probabilistic Graphical Models (PGMs) have emerged has one of the most practical and unifying concept in statistics in the 90s, and provide a general foundation for addressing biological modeling problems. Biological data are best modeled within a Bayesian probabilistic framework. The framework provides a solid mathematical foundation as well as axiomatic unicity. More importantly, probabilistic modeling offers a consistent and systematic way of dealing with the pervasive noise and variability evolutionary tinkering has etched into biological systems and data. Across a variety of bioinformatics problems--from gene finding to protein structure and function prediction--the idea is to construct parameterized probabilistic models and fit them to the data using the tools of probability theory. This machine learning approach is ideally suited for domains characterized by the presence of large amounts of data, ``noisy'' patterns, and the absence of general theories. The fundamental idea behind these approaches is to learn the theory automatically from the data, through a process of inference, model fitting, or learning from examples. By necessity, such probabilistic models are often complex and lead to the manipulation of high dimensional probability distributions over the space of data, model parameters, and hidden variables. In general, such distributions are intractable. PGMs provide a systematic set of techniques for approximating such distributions by factorizing them. This is achieved by introducing independence assumptions between subsets of the variables that is reflected in the sparse connectivity of the resulting graphs. Many of the most useful statistical models routinely used in computational molecular biology such as neural networks, Kalman filters, hidden Markov models (HMMs), Bayesian networks, and stochastic grammars are special cases of PGMs. The PGM framework allows for a better understanding of these models and what the models and the associated fitting or inference algorithms have in common. Many forms of dynamic programming algorithms routinely used in sequence analysis can be viewed, in a unified way, as special cases of propagation of information in graphical models. The PGM framework is also very useful in the search for new models, such as Input-Output HMMs, interpolated HMMs, bidirectional HMMs, factorial HMMs, HMM/NN hybrids, etc. PGMs and the related machine-learning methods are computationally intensive and benefit greatly from progress in computer speed. It is remarkable that both computer speed and sequence volume have been growing at roughly the same rate since the late 1980s, and continue to do so, at the moment doubling about every 16 months or so. In this tutorial, the theory and application of PGMs will be presented and illustrated with several practical problems including: multiple sequence alignments, data base searches, gene finding, protein secondary structure prediction, protein functional site prediction, and RNA structure modeling. Hands-on applications will be demonstrated using a PGM package with graphical interface.

 

Reference: P. Baldi and S. Brunak (1998) "Bioinformatics: the Machine Learning Approach" MIT Press.

This book is the handout for this tutorial. It is net included in the tutorial fee but can be obtained at the strongly reduced price of DM 40,00 in connection with a registrration for Tutorial 1 (see conference registration form).

 

DETAILED OUTLINE:

  1. INTRODUCTION: BIOLOGICAL PROBLEMS AND DATA Biological Data in Digital Symbol Sequences Database Annotation Quality Database Redundancy Genomes---Diversity, Size, and Structure Proteins and Proteomes From Genome to Proteome
  2. MACHINE LEARNING FOUNDATIONS: THE PROBABILISTIC FRAMEWORK
    Introduction: Bayesian Modeling Bayesian Inference and Induction Priors (Maximum Entropy, Group Theoretic Considerations, Useful Data Likelihood Parameter Estimation and Model Selection Prediction, Marginalization of Nuisance Parameters, and Class Comparison Model Structures: Graphical Models and Other Tricks
  3. SIMPLE EXAMPLES
    The Simplest Sequence Models The Single Die Model with Sequence Data The Single Die Model with Counts Data The Multiple Dice Model with Sequence Data Mixtures of Distributions Markov Models of Time Series
  4. THE UNDIRECTED CASE: MARKOV RANDOM FIELDS
    Markov Properties (Pairwise, Local, and Global) Factorization Properties Examples Statistical Mechanics and Boltzman Machines
  5. THE DIRECTED CASE: BAYESIAN NETWORKS
    Markov Properties (Pairwise, Local, and Global) Factorization Properties Belief Propagation Algorithms Learning Examples Helmholtz Machines
  6. CLASSICAL MODELS REVISITED
    Neural Networks: Prediction of Signal Peptides and Protein Secondary Structure Markov Models and Forward-Backward Algorithms Kalman Filters HMMs: Modeling Protein Families, Promoter Regions, Gene Finding Stochastic Grammars and RNA Modeling
  7. NEW MODELS
    Bidirectional HMMs/Factorial HMMs/Input-Output HMMs/ HMM/NN Hybrids/Interpolated Markov Models/Factorial Trees Applications to Protein Secondary Structure Prediction Parsing Graphical Models Beyond SCFGs: RNA Structure Prediction and Pseudo-Knots

-> Tutorial Handouts ( HTML, PS, HHM )
-> Tutorial Program
-> ISMB 99