Gene Myers
Celera Genomics
Rockville, MD 20850
E-mail: MyersGW@celera.com
Abstract
Simulated data sets have been
found to be useful in developing software systems because (1)
they allow one to study the effect of a particular phenomenon
in isolation, and (2) one has complete information about the true
solution against which to measure the results of the software.
In developing a software suite for assembling a whole human genome
shotgun data set, we have developed a simulator, celsim, that
permits one to describe and stochastically generate a target DNA
sequence with a variety of repeat structures, to further generate
polymorphic variants if desired, and to generate a shotgun data
set that might be sampled from the target sequence(s). We have
found the tool invaluable and quite powerful, yet the design is
extremely simple, employing a special type of stochastic grammar.