The Sequence of the Human Genome
A consensus sequence of the euchromatic portion of human genome has been
generated by the whole genome shotgun sequencing method that was developed
while sequencing the genomes of Haemophilus influenzae and Drosophila
melanogaster. The 2.9 billion bp sequence, was generated over nine months
from 27,271,853 high quality sequence reads (~5X coverage of the genome)
from both ends of plasmid clones made from the DNA of five individuals:
three females and two males of African-American, Asian-Chinese, Hispanic and
Caucasian ethnicity. The coverage of the genome in cloned DNA represented
by paired end-sequences exceeds 37X. Two assembly methods, a whole genome
assembly and a regional hybrid assembly were utilized, combining BAC data
from GenBank with Celera data. Over 90% of the genome is in scaffold
assemblies of 500,000 bp or greater and 25% of the genome is in scaffolds of
10 million bp or larger. Analysis of the genome sequence reveals - 26,178
protein-encoding genes for which there is strong corroborating evidence and
an additional 12,000 computationally derived genes with mouse homologues or
other weak supporting evidence. Comparative genomic analysis indicates
vertebrate expansions of genes associated with neuronal function,
tissue-specific developmental regulation, and in the hemostasis and immune
systems. DNA sequence comparisons among the five individuals provided
locations of 2.6 million single nucleotide polymorphisms (SNPs). The
haploid genomes of a randomly drawn pair of humans differ at a rate of one
per 1,250 bp on average but there is marked heterogeneity in the level of
polymorphism across the genome. Only 0.75% of the SNPs led to possibly
dysfunctional proteins.