Traditionally, the science of genetics concerned itself
with the study of one gene at a time, or on occasion,
the interaction of a very few genes. However, we increasingly
understand that the regulation of gene function and expression
is governed through the complex interaction of the entire
set of proteins expressed in any given cell. Hence a single
gene view necessarily limits our ability to understand
complex biological systems. Despite a century's worth
of research in genetics, it was only with the advent of
high-through-put DNA sequencing that we were able to generate
a complete list of the genes present in any organism.
Our ability to sequence entire genomes thus represents
a dramatic new turn in biology and the technical advances
that make this possible, and the implications of this
wealth of data bear some reflection.
The history of research involving the nematode C.
elegans shows three distinct phases of work, each
representing increasing power to understand basic biological
processes. In the first, the genetic era, we identified
mutants with readily observed phenotypes, including the
large class with uncoordinated movement. In the molecular
era, individual genes could be cloned and sequenced and
transformation allowed the direct study of gene function.
However, in the genomic era, when the entire genome sequence
is known, we suddenly recognize that the complete cast
of genetic players contains what were yet unimagined elements.
Similarly, despite the intense study of the fruit fly
Drosophila over the past 70 years, by last year
when the genome sequence became available, only 2,500
genes had been identified via traditional means. With
the completion of the sequence we not only knew that there
were 13,600 genes, but we could classify them into functional
categories and immediately recognize genes capable of
driving biological processes for which no evidence had
ever been seen. In addition, from the sequence we also
recognize that many functions that were thought to be
performed by a single protein are carried out by a family
of related proteins, which has obvious implications for
our attempts to understand any one of them.
One of the most challenging technical aspects of the
Human Genome Project has been to scale up what had originally
been a complex series of laboratory steps performed by
highly trained scientists. The first need was to deeply
understand the laboratory processes, the causes of variation
in the process, and the main drivers of cost and data
quality. From this analysis we were able to design simple,
robust procedures that could be performed in an automated
fashion. In this way we were able to generate large amounts
of high quality data in a highly predictable manner. In
a large scale project that takes place over a sustained
period, a great deal of work occurs once the learning
curve has been climbed. Thus the energy invested in thoroughly
understanding the process is paid back in increased efficiency
and economies of scale. Automated data capture and analysis
become equally essential to the success of a large project.
A central goal of the Human Genome Project has been to
not only obtain the human genome sequence but in so doing
develop a process for sequencing that will allow us to
efficiently obtain sequence for any large genome. Increasing
the ease with which we obtain further genome sequence
will fundamentally change the way we approach biological
research. For example, having the complete gene list of
an organism allows us to study gene expression by simultaneously
examining all genes, rather than a single gene at a time.
The sequence of each new organism DNA sequence is beginning
to reveal the full depth of the diversity of life on earth,
a difficult problem given that only a tiny minority of
the species can be cultured in a lab. Comparative genomics,
in which sequence from one organism is compared to another,
not only allows us to recognize the evolutionary history
of species and t heir component biochemical pathways but
provides an important new opportunity to identify the
regulatory signals embedded in DNA sequence. Non-coding
sequences that have preserved similarity over evolutionary
time can imply conserved function associated with gene
regulation. We continue to develop faster and cheaper
ways to sequence DNA with the understanding that these
data and associated technologies will form the foundation
of biological research in the next few decades.