Shaping Biomedicine as an Information Science
in Proceedings of the 1998 Conference on the History and Heritage of Science Information Systems, edited by Mary Ellen Bowden, Trudi Bellardo Hahn, and Robert V. Williams. ASIS Monograph Series. (Medford, NJ: Information Today, Inc., 1999), pp. 27-45.

by
Timothy Lenoir
Program in History & Philosophy of Science
Stanford University


A NEW BIOLOGY FOR THE INFORMATION AGE

Sometime in the mid-1960s biology became an information science. While Jacob and Monod’s work on the genetic code is usually credited with propelling biology into the Information Age, in this essay I explore the transformation of biology by what have become essential tools to the practicing biochemist and molecular biologist; namely, the contributions of information technology. About the same time as Jacob and Monod’s work, developments in computer architectures and algorithms for generating models of chemical structure and simulations of chemical interactions were created that allowed computational experiments to interact with and draw together theory and laboratory experiment in completely novel ways. The new computational science [1] linked with visualization [2] has had a major impact in the fields of biochemistry, molecular dynamics, and molecular pharmacology. [3] Computational approaches have substantially transformed and extended the domain of theorizing in these areas in ways unavailable to older, non-computer based forms of theorizing. But other information technologies have also proven crucial to bringing about this change. In the period of the 1970s through the 1990s, armed with the new tools of molecular biology, such as cloning, restriction enzymes, protein sequencing, and gene product amplification, biologists were awash in a sea of new data. They deposited this data in large and growing electronic databases of genetic maps, atomic coordinates for chemical and protein structures, and protein sequences. These developments in technique and instrumentation launched biology onto the path of becoming a data-bound science, a “science” in which all the data of a domain—such as a genome—are available before the laws of the domain are understood. Biologists have coped with this data explosion by turning to information science: applying artificial intelligence and expert systems and developing search tools to identify structures and patterns in their data.



The aim of this paper is to explore early developments in the introduction of computer modeling, tools from artificial intelligence (AI) and expert systems into biochemistry in the 1960s and 1970s, and the introduction of informatics techniques for searching databases and extracting biological function and structure in the emerging field of genomics during the 1980s and 1990s. I have two purposes in this line of inquiry. First I want to suggest that by introducing tools of information science biologists have sought to make biology a unified theoretical science with predictive powers analogous to other theoretical disciplines. But I want also to suggest that along with this highly heterogeneous and hybrid form of computer-based experimentation and theorizing has come a different conception of theorizing itself: one based on models of information-processing and best captured by the phrase “knowledge engineering” developed within the AI community. My second concern is to contribute to recent discussions on the transformation of biology to an information science. Lily Kay, Evelyn Fox Keller, Donna Haraway, and Rich Doyle have explored the role of metaphor, disciplinary politics, economics and culture in shaping the context in which the language of "DNA code," "genetic information," "text" and "transcription" have been inserted into biological discourse, often in the face of resistance from some of the principal actors themselves. [4] I am more interested in hardware and the computational regimes it enables. Elaborating the recent science and technology studies theme of "tools to theory" in light of the linkage of communications and computer technology in biology, I am interested in exploring the role of the computational medium itself in shaping biology as an information science.



COMPUTERS AND BIOCHEMISTRY: MOLECULAR MODELING

The National Institutes of Health have been active at every stage in making biology an information science. NIH support was crucial to the explosive take-off of computational chemistry and the promotion of computer-based visualization technologies in the mid-60’s. The agency sponsored a conference at UCLA in 1966 on “Image Processing in Biological Science.” The NIH’s Bruce Waxman, co-chair of the meeting, set out the NIH agenda for computer visualization by sharply criticizing the notion of mere “image processing” as the direction that should be pursued in computer-enhanced vision research. The goal of computer assisted “vision,” he asserted, was not to replicate relatively low-order motor and perceptual capabilities even at rapid speeds. “I have wondered whether the notion of image processing is itself restrictive; it may connote the reduction of and analysis of ‘natural’ observations but exclude from consideration two-or three-dimensional data which are abstractions of phenomena rather than the phenomena themselves.” [5] Waxman suggested “pattern recognition” as the subject they should really pursue—and in particular where the object was what he termed “non-natural.” In general, Waxman asserted, by its capacity to quantize massive data sets automatically the computer, linked with pattern recognition methods of imaging the non-natural, would permit the development of stochastically based biological theory.

Waxman’s comments point to one of the important and explicit goals of the NIH and other funding agencies: to mathematize biology. That biology should follow in the footsteps of physics had been the centerpiece of a reductionist program since at least the middle of the nineteenth century. But the development of molecular biology in the 1950s and 1960s encouraged the notion that a fully quantitative theoretical biology was on the horizon. The computer was to be the motor for this change. Analogies were drawn between highly mathematized and experimentally based Big Physics and a hoped-for “Big Biology”: as Lee B. Lusted, the Chairman of the Advisory Committee to the National Research Council on Electronic Computers in Biology and Medicine argued, because of the high cost of computer facilities for conducting biological research computer facilities would be to biology what SLAC and the Brookhaven National Laboratory were to physics. [6] Robert Ledley, author of the volume expressing the Committee’s interest in fostering computing, insisted that biology was on the threshold of a new era. New emphasis on quantitative work and experiment was changing the conception of the biologist: the view of the biologist as an individual scientist, personally carrying through each step of his investigation and his data-reduction processes, was rapidly broadened to include also the biologist as a part of an intricate organizational chart that partitions scientific, technical, and administrative responsibilities. In the new organization of biological work, modeled on the physicists working at large national labs, the talents and knowledge of the biologist “must be augmented by those of the engineer, the mathematical analyst, and the professional computer programmer.” [7]

At the UCLA meeting, Bruce Waxman held up as a model work on three-dimensional representations of protein molecules done by Cyrus Levinthal. Levinthal worked with the facilities of Project MAC at MIT, one of the first centers in the development of graphics. Levinthal’s project was an experiment in computer time-sharing linking biologists, engineers, and mathematicians in the construction of big biology. Levinthal’s work at MIT illustrates the role of computer visualization as a condition for theory development in molecular biology and biochemistry.

Since the work of Linus Pauling and Robert Corey on the α-helical structure of most protein molecules (1953), models have played a substantial role in biochemistry. Watson and Crick’s construction of the double helix model for DNA depended crucially upon the construction of a physical model. Subsequently, work in the field of protein biology has demonstrated that the functional properties of a molecule depend not only on the interlinkage of its chemical constituents but on the way in which the molecule is configured and folded in three dimensions. Much of biochemistry has focused on understanding the relationship between biological function and molecular conformational structure.

A milestone in the making of physical 3D models of molecules was Kendrew’s construction of myoglobin. The power of models in investigations of biomolecular structure was evident from work such as this, but such tools had limitations as well. Kendrew's model for instance was a first successful attempt to build a physical model into a Fourier map of the electron densities derived from x-ray crystallographic sources. As a code for the electron density, clips of different colors were put at the proper vertical positions on a forest of steel rods. A brass wire model of the different sheets of the molecule was then built in among the rods. Mechanical interference made it difficult to adjust the structure, and the model was hard to see because of the large number of supporting rods. The model incorporated both too little and too much: too little in that the surface of the molecule was not represented; too much in that while bond connectivity was represented, the forest of rods made it difficult to see the three-dimensional folding of the molecule. [8] Perhaps the greatest drawback was the model’s size: it filled a large room. It was obvious that such 3D representations would only become really useful when it was possible to manipulate them at will—to size and scale them arbitrarily from actual x-ray crystallographic and electron density map data. Proponents of computer graphics argued that this is exactly what computer representations of molecular structure would allow. Cyrus Levinthal first illustrated these methods in 1965.

Levinthal reasoned that since protein chains are formed by linking molecules of a single class, amino acids, it should be relatively easy to specify the linkage process in a form mathematically suitable for a digital computer. [9] Initially the computer model considers the molecule as a set of rigid groups of constant geometry linked by single bonds around which rotation is possible. [10] Program input consists of a set of coordinates consistent with the molecular stereochemistry as given in data from x–ray crystallographic studies. Several constraints delimit stable configurations among numerous possibilities resulting from combinations of linkages among the 20 different amino acids. These include bond angles, bond lengths, van der Waals radii for each species of atom, and the planar configuration of the peptide bond.

Molecular biologists, particularly the biophysicists among them, were motivated to build a unified theory, and the process of writing a computer program that could simulate protein structure would assist in this goal by providing a framework of mental and physical discipline from which would emerge a fully mathematized theoretical biology. In non-mathematized disciplines such as biology the language of the computer program would serve as the language of science. [11] But there was a hitch: in an ideal world dominated by a powerful central theory one would like, for example, to use the inputs of xyz coordinates of the atoms, types of bond, etc., to calculate the pairwise interaction of atoms in the amino acid chain, predict the conformation of the protein molecule, and check this against its corresponding x-ray crystallographic image. As described, however, the parameters used as input in the computer program do not provide much limitation on the number of molecular conformations. Other sorts of input are needed to filter among the myriad possible structures. Perhaps the most important of these is energy minimization. In explaining how the thousands of atoms in a large protein molecule interact with one another to produce the stable conformation, the basic physical hypothesis is that, like water running downhill, the molecular string will fold to reach a lowest energy level. To carry out this sort of minimization would entail calculating the interactions of all pairs of active structures in the chain, minimizing the energy corresponding to these interactions over all possible configurations, and then displaying the resulting molecular picture. Unfortunately, this could not be done, as Levinthal noted, for a formula describing such interactions could not in the state of molecular biological theory of 1965 be written down and manipulated with a finite amount of labor. In Levinthal’s words:

The principal problem, therefore, is precisely how to provide correct values for the variable angles. ...I should emphasize the magnitude of the problem that remains even after one has gone as far as possible in using chemical constraints to reduce the number of variables from several thousand to a few hundred. ...I therefore decided to develop programs that would make use of a man-computer combination to do a kind of model-building that neither a man nor a computer could accomplish alone. This approach implies that one must be able to obtain information from the computer and introduce changes in the way the program is running in a span of time that is appropriate to human operation. This in turn suggests that the output of the computer must be presented not in numbers but in visual form. [12]

In Levinthal’s view, visualization generated in real-time interaction between human and machine can assist theory construction. The computer becomes in effect both a microscope for examining molecules as well as a laboratory for quantitative experiment. Levinthal's program, CHEMGRAF, could be programmed with sufficient structural information as input from physical and chemical theory to produce a trial molecular configuration as graphical output. A subsystem called SOLVE then packed the trial structure by determining the local minimum energy configuration due to non-bonded interactive forces. A subroutine of this program, called ENERGY, calculated the torque vector due to the atomic pair interactions on rotatable bond angles. An additional procedure for determining the conformation of the model structure was "cubing." This procedure searched for nearest neighbors of an atom in the center of a 3x3x3 cube and reported whether any atoms were in the 26 adjacent cubes. The program checked for atom pairs in the same or adjacent cubes and for atoms within a specified distance. It maintained a list of pairs that were, for instance, in contact violation, while another routine calculated energy contribution of the pair to the molecule. This enabled the possibility of rejecting as early as possible all those atom pairs whose interatomic distance was too great to be of more than negligible contribution, and it enabled more efficient use of computer time.

Levinthal emphasized that interactivity was a crucial component of CHEMGRAF. A criterion built into his system was that it was necessary to observe the result of the calculations interactively and to be able to halt the minimization process at any step, either to terminate it completely or to alter the conformation and then resume it. [13] Levinthal noted that often, as the analytical procedures were grinding on, a molecule would be trapped in an unfavorable conformation or in a local minimum, the nature of which would be obscure until the conformation could be viewed three-dimensionally. CHEMGRAF enabled the investigator to assist in generating the local minimization of energy for a subsection of the molecule through three different types of user-guided empirical manipulation of structure: "close," "glide," and "revolve." These manipulations in effect introduced external “pseudo-energy” terms into the computation which pulled the structure in various ways. [14] Atoms could be rotated out of the way by direct command and a new starting conformation chosen from which to continue the minimization procedure. By pulling individual atoms to specific locations indicated by experimental data from x-ray diffraction studies a fit between x-ray crystallography data and the computer model of a specific protein, such as myoglobin, could ultimately be achieved. With the model of the target molecule, such as myoglobin, one could then proceed to investigate the various energy terms involved in holding the protein molecule together. Thus, the goal of this interactive effort involving human and machine was eventually to generate a theoretical formulation for the lowest energy state of a protein molecule, to predict its structure, and to have that prediction confirmed by x-ray crystallographic images. [15]

The enormous number of redundant trial calculations involved in Levinthal’s work hint at the desirability of combining an expert system with a visualization system. E.J. Corey and W. Todd Wipke, working nearby at Harvard, took this next step. Space limitations prevent me from discussing their work here. In developing their work Wipke and Corey drew upon a prototype expert system at Stanford called DENDRAL, the result of a collaboration at Stanford among computer scientist Edward Feigenbaum, biologist Joshua Lederberg, and organic chemist Carl Djerassi, working on another of the NIH initiatives to bring computers directly into the laboratory. The Stanford project, called DENDRAL, was an early effort in the field of what Feigenbaum and his mentors Herbert Simon and Marvin Minsky termed “knowledge engineering.” In effect it attempted to put the human inside the machine.



DENDRAL: THE AI APPROACH AT STANFORD

DENDRAL aimed at emulating an organic chemist [16] operating in the harsh environment of Mars. The ultimate goal was to create an automated laboratory as part of the Viking mission planned to land a mobile instrument pod on Mars in 1975. Given the data of a mass spectrum of an unknown compound, the specific goal was to determine the structure of the compound. To accomplish this, DENDRAL would analyze the data, generate a list of plausible candidate structures, predict the mass spectra of those structures from the theory of mass spectrometry, and select as a hypothesis the structure whose spectrum most closely matched the data.

A key part of this program was the representation of chemical structure in terms of topological graph theory. Chemical graphs were the visual “language” to augment the theoretical and practical knowledge of the chemist with the calculating power of the computer. This part of the effort was contributed by. Lederberg, awarded the Nobel Prize in 1958 for his work on genetic exchange in bacteria and interested in the introduction of information concepts into biology for most of his professional life. Self-described as a man with a Leibnizian dream of a universal calculus for the alphabet of human thought, Lederberg’s interest in mass spectrometry and topological mapping of molecules was in part driven by the dream of mathematizing biology, starting with organic chemistry. The structures of organic molecules are a bewildering complex, and while sprinkled with lots of theory derived from quantum mechanics and thermodynamics, the “theory” of organic chemistry does not have an elegant axiomatic structure analogous, say, to Newtonian mechanics. Lederberg felt that a first step toward such a quantitative, predictive theory would be a rational systematization of organic chemistry. Trampling upon a purist’s notion of theory, Lederberg thought that computers were the royal road to mathematization in chemistry:

Could not the computer be of great assistance in the elaboration of novel and valid theories? I can dream of machines that would not only execute experiments in physical and chemical biology but also help design them, subject to the managerial control and ultimate wisdom of their human programmer. [17]
Mass spectrometry, the area upon which Feigenbaum and Lederberg concentrated with Carl Djerassi, was a particularly appropriate challenge. It differed in at least one crucial aspect from the molecular modeling of proteins I have considered above. Whereas in those areas a well understood theory, such as the quantum mechanical theory of the atomic bond, was the basis for developing the computer program to examine effects in large calculations, there was no theory of mass spectrometry that could be transferred to the program from a textbook. [18] The field has bits of theory to draw upon, but it has developed mainly by following rules of thumb, which are united in the form of the chemist-expert. The field thrives on tacit knowledge. The following excerpt from a memo of Feigenbaum written after his first meetings with Lederberg on the DENDRAL project provides a vivid sense of the objective and problems faced:
The main assumption we are operating under is that the required information is buried in chemists’ brains if only we can extract it. Therefore, the initiative for the interaction must come from the system not the chemist, while allowing the chemist the flexibility to supply additional information and to modify the question sequence or content of the system. ...What we want to design then is a question asking system [that] will gather rules about the feasibility of the chemical molecules and their subgraphs being displayed. [19]
In short, Feigenbaum sought to emulate a gifted chemist with the computer. That chemist was Carl Djerassi, named “El Supremo” by his graduate students and post-docs. Djerassi’s astonishing achievements as a mass spectrometrist relied on his abilities to feel his way through the process without the aid of a complete theory, relying rather on experience, tacit knowledge, hunches, and rules of thumb. In interviews, Feigenbaum elicited this kind of information from Djerassi, in a process that heightened awareness of the structure of the field for both participants. The process of involving a computer in chemical research, in this way, organized a variety of kinds of information in a crucial step toward theory.



A PARADIGM SHIFT IN BIOLOGY

Thus far I have been considering efforts to predict structure from physical principles as one of the paths through which computer science and computer-based information technology began to reshape biology. The Holy Grail of biology has always been the construction of a mathematized theoretical biology, and for most molecular biologists the journey there has been directed by the notion that the information for the three dimensional folding and structure of proteins is uniquely contained in the linear sequence of their amino acids. [20] As we have seen, the molecular dynamics approach assumed that if all the forces between atoms in a molecule, including bond energies and electrostatic attraction and repulsion, are known, then it is possible to calculate the three-dimensional arrangement of atoms that requires the least energy. Because this method requires intensive computer calculations, shortcuts have been developed that combine computer-intensive molecular dynamics computations, artificial intelligence and interactive computer graphics in deriving protein structure directly from chemical structure.

While theoretically elegant, the determination of protein structure from chemical and dynamical principles has been hobbled with difficulties. In the abstract, analysis of physical data generated from protein crystals, such as x-ray and nuclear magnetic resonance data, should offer rigorous ways to connect primary amino acid sequences to 3D structure. But the problems of acquiring good crystals and the difficulty of getting NMR data of sufficient resolution are impediments to this approach. Moreover, while quantum mechanics provides a solution to the protein folding problem in theory, the computational task of predicting structure from first principles for large protein molecules containing many thousands of atoms has proven impractical. Furthermore, unless it is possible to grow large, well-ordered crystals of a given protein, x-ray structure determination is not an option. The development of methods of structure determination by high resolution 2-D NMR has alleviated this situation somewhat, but this technique is also costly and time-consuming, requiring large amounts of protein of high solubility and severely limited by protein size. These difficulties have contributed to the slow rate of progress in registering atomic coordinates of macromolecules.

An indicator of the difficulty of pursuing this approach alone is suggested by the growth of databanks of atomic coordinates for proteins. The Protein Data Bank (PDB) was established in 1971 as a computer-based archival resource for macromolecular structures. The purpose of the PDB was to collect, standardize, and distribute atomic co-ordinates and other data from crystallographic studies. In 1977 the PDB listed atomic coordinates for 47 macromolecules. [21] In 1987 that number began to increase rapidly at a rate of about 10 percent per year due to the development of area detectors and widespread use of synchrotron radiation, so that by April 1990 atomic coordinate entries existed for 535 macromolecules. Commenting on the state of the art in 1990, Holbrook et al. noted that crystal determination could require one or more man-years. [22] Currently (1999) the Biological Macromolecule Crystallization Database (BMCD) of the Protein Data Bank contains entries for 2526 biological macromolecules for which diffraction quality crystals have been obtained. These include proteins, protein:protein complexes, nucleic acid, nucleic acid:nucleic acid complexes, protein:nucleic acid complexes, and viruses. [23]

While structure determination was moving at a snail’s pace, beginning in the 1970s another stream of work contributed to the transformation of biology as an information science. The development of restriction enzymes, recombinant DNA techniques, gene cloning techniques, and PCR was resulting in a flood of data on DNA, RNA, and protein sequences. Indeed more than 140,000 genes were cloned and sequenced in the twenty years from 1974 to 1994, of which more than 20 percent were human genes. [24] By the early 1990s, well before the beginning of the Human Genome Initiative, the NIH GenBank database (release 70) contained more than 74,000 sequences, while the Swiss Protein database (Swiss-Prot) included nearly 23,000 sequences. Protein databases were doubling in size every 12 months, and some were predicting that as a result of technological impact of the Human Genome Initiative by the year 2000 ten million base pairs a day would be sequenced. Such an explosion of data encouraged the development of a second approach to determining function and structure of protein sequences: namely, prediction from sequence data alone. This “bioinformatics” approach identifies the function and structure of unknown proteins by applying search algorithms to existing protein libraries in order to determine sequence similarity, percentages of matching residues, and the statistical significance of each database sequence.

A key project illustrating the ways in which computer science and molecular biology began to merge in the formation of bioinformatics was the MOLGEN project at Stanford and events related to the formation and subsequent development of BIONET. MOLGEN was a continuation of the projects in artificial intelligence and knowledge engineering begun at Stanford with DENDRAL. MOLGEN was started in 1975 as a project in the Heuristic Programming Project with Edward Feigenbaum as principal investigator directing the thesis projects of Mark Stefik and Peter Friedland. [25] The aim of MOLGEN was to model the experimental design activity of scientists in molecular genetics. [26] Before an experimentalist sets out to achieve some goal, he produces a working outline of the experiment, guiding each step of the process. The central idea of MOLGEN was that in designing a new experiment scientists rarely plan from scratch. Instead they find a skeletal plan, an overall design that has worked for a related or more abstract problem, and then adapt it to the particular experimental context. Similar to DENDRAL this approach is heavily dependent upon large amounts of domain-specific knowledge in the field of molecular biology and especially upon good heuristics for choosing among alternative implementations.

MOLGEN’s designers chose molecular biology as appropriate for the application of artificial intelligence because the techniques and instrumentation generated in the 1970s seemed ripe for automation. The advent of rapid DNA cloning and sequencing methods had had an explosive effect on the amount of data that could be most readily represented and analyzed by a computer. Moreover, it appeared that very soon progress in analyzing information in DNA sequences would be limited by the appropriate combination of the available search and statistical tools. MOLGEN was intended to apply rules to detect profitable directions of analysis and to reject unpromising ones. [27]

Peter Friedland was responsible for constructing the knowledge-base component of MOLGEN, and though not himself a molecular biologist, he made a major contribution to the field by assembling the rules and techniques of molecular biology into an interactive, computerized system of analytical programs. Friedland worked with Stanford molecular biologists Douglas Brutlag, Laurence Kedes, John Sninsky, and Rosalind Grymes, who provided expert knowledge on enzymatic methods, nucleic acid structures, detection methods, and pointers to key references in all areas of molecular biology. Along with providing an effective encyclopedia of information about technique selection in planning a laboratory experiment, the knowledge base contained a number of tools for automated sequence analysis. Brutlag, Kedes, Sninsky, and Grymes were interested in having a battery of automated tools for sequence analysis, and they contracted with Friedland and Stefik—both gifted computer program designers—to build them in exchange for contributing their expert knowledge to the project. [28] This collaboration of computer scientists and molecular biologists helped biology along the road to becoming an information science.

Among the programs Friedland and Stefik created for MOLGEN was SEQ, an interactive self-documenting program for nucleic acid sequence analysis which had 13 different procedures with over 25 different sub-procedures, many of which could be invoked simultaneously to provide different analytical methods for any sequence of interest. SEQ brought together in a single program methods for primary sequence analysis described in the literature by Korn et al., Staden, and numerous others. [29] SEQ also performed homology searches and specified the degree of homology and dyad symmetry (inverted repeats) searches on DNA sequences. [30] Another feature of SEQ was its ability to prepare restriction maps with the names and locations of the restriction sites marked on the nucleotide sequence in addition to having a facility for calculating the length of DNA fragments from restriction digests of any known sequence. Another program in the MOLGEN suite was GA1 (later called MAP). Constructed by Stefik, GA1 was an artificial intelligence program that allowed generation of restriction enzyme maps of DNA structures from segmentation data. [31] It would construct and evaluate all logical alternative models that fit the data and rank them in relative order of fit. A further program in MOLGEN was SAFE, which aided in enzyme selection for gene excision. SAFE took amino acid sequence data and predicted the restriction enzymes guaranteed not to cut within the gene.

In its first phase of development (1977-1980) MOLGEN consisted of the programs described above and a knowledge base containing information on about 300 laboratory methods and 30 strategies for using them. It also contained the best currently available data on about 40 common phages, plasmids, genes, and other known nucleic acid structures. The second phase of development beginning in 1980 scaled up both analytical tools and knowledge base. Perhaps the most significant aspect of the second phase was making MOLGEN available to the scientific community at large on the Stanford University Medical Experimental national computer resource, SUMEX-AIM. SUMEX-AIM, supported by the Biotechnology Resources Program at NIH since 1974, had been home to DENDRAL and several other programs. The new experimental resource on SUMEX, comprising the MOLGEN programs and access to all major genetic databases, was called GENET. In February 1980 GENET was made available to a carefully limited community of users. [32]

MOLGEN and GENET were immediate successes with the molecular biology community. In their first few months of operation in 1979 more than 200 labs (with several users in each of those labs) accessed the system. By November 1, 1982 more than 300 labs on the system around the clock accessed the system from 100 institutions. [33] Traffic on the site was so heavy that restrictions had to be implemented and plans for expansion considered. In addition to the academic users a number of biotech firms, such as Monsanto, Genetech, Cetus, and Chiron, used the system heavily. In order to insure that the academic community had unrestricted access to the SUMEX computer and that the NIH would be satisfied commercial users were not getting unfair access to the resource, Feigenbaum, principle investigator in charge of the SUMEX resource, and Thomas Rindfleisch, facility manager, decided to exclude commercial users. [34]

To provide commercial users with their own unrestricted access to GENET and MOLGEN programs, Brutlag, Feigenbaum, Friedland, and Kedes formed a company, IntelliGenetics, which would offer the suite of MOLGEN software for sale or rental to the emerging biotechnology industry. With 125 research labs doing recombinant DNA research in the US alone and a number of new genetic engineering firms starting up, opportunities looked outstanding. No one was currently supplying software in this rapidly growing genetic engineering marketplace. With their exclusive licensing arrangement with Stanford for the MOLGEN software, IntelliGenetics was poised to lead a huge growth area. The business plan expressed well the excellent position of the company:
A major key to the success of IntelliGenetics will be the fact that the recombinant DNA research revolution is so recent. While every potential customer is well capitalized, few have the manpower they say they need; this year several firms are hiring 50 molecular geneticist Ph.D.s, and one company speaks of 1000 within five years. These firms require computerized assistance—for the storage and analysis of very large amounts of DNA sequence information which is growing at an exponential rate—and will continue to do so for the foreseeable future (10 years). Access to this information and the ability to perform rapid and efficient pattern recognition among these sequences is currently being demanded by most of the firms involved in recombinant DNA research.

The programs offered by IntelliGenetics will enable the researchers to perform tasks that are: 1) virtually impossible to perform with hand calculations, and 2) extremely time-consuming and prone to human error.
In other words, IntelliGenetics offers researcher productivity improvement to an industry with expanding demand for more researchers which is experiencing a severe supply shortage. [35]
The resource that IntelliGenetics eventually offered to commercial users was BIONET. Like GENET, its prototype, BIONET combined in one computer site databases of DNA sequences with programs to aid in their analysis.

Prior to the startup of BIONET, GENET was not the only resource for DNA sequences. Several researchers were making their databases available. Margaret Dayhoff had created a database of DNA sequences and some software for sequence analysis for the National Biomedical Research Foundation that was marketed commercially. Walter Goad, a physicist at Los Alamos National Laboratory, collected DNA sequences from the published literature and made them freely available to researchers. But by the late 1970s the number of bases sequenced was already approaching 3 million and expected to double soon. Some form of easy communication between labs and effective data handling was considered a major priority in the biological community. While experiments were going on with GENET a number of nationally prominent molecular biologists had been pressing to start a NIH-sponsored central repository for DNA sequences. An early meeting organized by Joshua Lederberg was held in 1979 at Rockefeller University. The proposed NIH initiative was originally supposed to be coordinated with a similar effort at the European Molecular Biology Laboratory (EMBL) in Heidelberg, but the Europeans became dissatisfied with the lack of progress on the American end and decided to go ahead with their own databank. EMBL announced the availability of its Nucleotide Sequence Data Library in April 1982, several months before the American project was funded. Finally, in August, 1982 the NIH awarded a contract for $3 million over 5 years to the Boston-based firm of Bolt, Berenek, and Newman (BB&N) to set up the national database known as GenBank in collaboration with Los Alamos National Laboratory. IntelliGenetics submitted an unsuccessful bid for that contract.

The discussions leading up to GenBank included consideration of funding a more ambitious databank, known as “Project 2,” which was to provide a national center for the computer analysis of DNA sequences. Budget cuts forced the NIH to abandon that scheme. [36] However, they returned to it the following year thanks to the persistence of IntelliGenetics representatives. Although GenBank launched a formal national DNA sequence collection effort, the need for computational facilities voiced by molecular biologists was still left unanswered. In September 1983 after a review process that took over a year, the NIH division of research resources awarded IntelliGenetics a $5.6 million five year contract to establish BIONET.[37] The contract started on March 1, 1984 and ended on February 27, 1989.

BIONET first became available to the research community in November 1984. The fee for use was $400 per year per laboratory, and remained at that level throughout its first five years. BIONET’s use grew impressively. Initially the IntelliGenetics team set the target for user subscriptions at 250 labs. However the annual report for the first year’s activities of BIONET in March, 1985 listed 350 labs with nearly 1132 users. By August 1985 that number had increased dramatically to 450 labs and 1500 users. [38] In April 1986, for example, BIONET had 464 laboratories comprising 1589 users. By October 1986 the numbers were 495 labs and 1716 users. [39] By 1989 900 laboratories in the U.S., Canada, Europe, and Japan (comprising about 2800 researchers) subscribed to BIONET, and 20 to 40 new laboratories joined each month. [40]

BIONET was intended to establish a national computer resource for molecular biology satisfying three goals. A first goal was to provide a way for academic biologists to obtain access to computational tools to facilitate their nucleic acid (and possibly protein) related research. In addition to giving researchers ready access to national databases on DNA and protein sequences, BIONET would provide a library of sophisticated software for sequence searching, matching, and manipulation. A second goal was to provide a mechanism to facilitate research into improving such tools. The BIONET contract provided research and development support of further software, both in-house research by IntelliGenetics scientists and through collaborative ventures with outside researchers. A third goal of BIONET was to enhance scientific productivity through electronic communications.

The stimulation of collaborative work through electronic communication was perhaps the most impressive achievement of BIONET. BIONET was much more than the Stanford GENET plus the MOLGEN-IntelliGenetics suite of software. Whereas GENET with its pair of ports could accommodate only two users at any one time, BIONET had 22 ports providing an estimated annual 30,000 connect hours. [41] All subscribers to BIONET were provided with email accounts. For most molecular biologists this was something entirely new, since most university labs were just beginning to be connected with regular email service. At least 20 different bulletin boards on numerous topics were supported by BIONET. In an effort to change the culture of molecular biologists by accustoming them to the use of electronic communications and more collaborative work, BIONET users were required to join one of the bulletin board groups.

BIONET subscribers had access to the latest versions of the most important databases for molecular biology. Large databases available at BIONET were (i) GenBank tm, the National Institutes of Health DNA sequence library; (ii) EMBL, the European Molecular Biology Laboratory nucleotide sequence library; (iii) NBRF-PIR, the National Biomedical Research Foundation's protein sequence database (this database is part of the Protein Identification Resource [PIRI supported by NIH's Division of Research Resources); (iv) SWISS-PROT, a protein sequence database founded by Amos Bairoch of the University of Geneva and subsequently managed and distributed by the European Molecular Biology Laboratory; (v) VectorBank tm, IntelliGenetics' database of cloning vector restriction maps and sequences; (vi) Restriction Enzyme Library, a complete list of restriction enzymes and cutting sites provided by Richard Roberts at Cold Spring Harbor; and (vii) Keybank, IntelliGenetics' collection of predefined patterns or "keys" for database searching. Several smaller databases were also available, including a directory of molecular biology databases, a collection of literature references to sequence analysis papers, and a complete set of detailed molecular biological laboratory protocols (especially for E. coli and yeast work). [42]

Perhaps the most important contribution made by BIONET to establishing molecular biology as an information science was negotiated at the renewal of GenBank. As described above, BB&N was awarded the first 5-year contract to manage GenBank. The contract was up for renewal in 1987, and given its track record in managing BIONET, IntelliGenetics submitted a proposal to manage GenBank. GenBank users had become dissatisfied with the serious delay in sequence data publication. GenBank was 2 years behind in disseminating sequence data it had received. [43] At a meeting in Los Alamos in 1986, Walter Goad noted that GenBank had 12 million base pairs. Other sequences available to researchers contained 14-15 million base pairs, so that GenBank was at least 14-20% out of date. [44] Concerned that researchers would turn to other, more up-to-date data sources, the NIH listed use as one of the issues they wanted IntelliGenetics to address in their proposal to manage GenBank. [45]

IntelliGenetics proposed to solve this problem by automating the submission of gene and protein sequences. Instead of laboriously searching the published scientific literature for sequence data, rekeying them into a GenBank standard electronic format, and checking them for accuracy, which was the standard method employed at that time, IntelliGenetics would automate the submission procedure with an online submission program, XGENPUB (later called “ AUTHORIN”).

In fact, IntelliGenetics was already progressing toward automating all levels of sequence entry and (as much as possible) analysis. As early as 1986 IntelliGenetics included SEQIN in PC/GENE, its commercial software package designed for microcomputers. SEQIN was designed for entering and editing nucleic acid sequences, and it already had the functionality needed to deposit sequences with GenBank or EMBL electronically. [46] Transferring this program to the mainframe was a straightforward move. Indeed the online entry of original sequence data was already a feature of BIONET, since large numbers of researchers were using the IntelliGenetics GEL program on the BIONET computer. GEL was a program that accepted and analyzed data produced by all the popular sequencing methods. It provided comprehensive record-keeping and analysis for an entire sequencing project from start to finish. The final product of the GEL program was a sequence file suitable for analysis by other programs such as SEQ. XGENPUB added a natural extension to this capability by allowing the scientist to annotate a sequence according to the standard GenBank format and mail the sequence and its annotation to GenBank electronically. The interface was a forms-oriented display editor that would automatically insert the sequence in the appropriate place in the form by copying the sequence from a designated file on the BIONET computer. When completed it could be forwarded to the GenBank computer at Los Alamos, the National Institutes of Health DNA sequence library, EMBL, the nucleotide sequence database from the European Molecular Biology Laboratory and NBRF-PIR, the National Biomedical Research Foundation's protein sequence database. [47]

Creating a new culture requires both the carrot and the stick. Making the online programs available and easy to use was one thing. Getting all molecular biologists to use them was another. In order to doubly encourage molecular biologists to comply with the new procedure of submitting their data online, the major molecular biology journals agreed to require evidence that the data had been submitted before they would consider a manuscript for review. Nucleic Acids Research was the first journal to enforce this transition to electronic data submission. [48] With these new policies and networks in place, BIONET was able to reduce the time from submission to publication and distribution of new sequence data from two years to 24 hours. As noted above, just a few years earlier, at the beginning of BIONET, there were only 10 million base pairs published, and these had been the result of several years’ effort. The new electronic submission of data generated 10 million base pairs a month. [49] Walter Gilbert may have angered some of his colleagues at the 1987 Los Alamos Workshop on Automation in Decoding the Human Genome when he stated that, “Sequencing the human genome is not science, it is production.” [50] But he surely had his finger on the pulse of the new biology.



THE MATRIX OF BIOLOGY

The explosion of data on all levels of the biological continuum made possible by the new biotechnologies and represented powerfully by organizations such as BIONET was a source of both exhilaration and anxiety. Of primary concern to many biologists was how best to organize this massive outpouring of data in a way that would lead to deeper theoretical insight, perhaps even a unified theoretical perspective for biology. The National Institutes of Health were among those most concerned about these issues, and they organized a series of workshops to consider the new perspectives emerging from recent developments. The meetings culminated in a report chaired by Harold Morowitz entitled Models for Biomedical Research: A New Perspective (1985). The panelists foresaw the emergence of a new theoretical biology “different from theoretical physics, which consists of a small number of postulates and the procedures and apparatus for deriving predictions from those postulates.” The new biology was far more than just a collection of experimental observations. Rather it was a vast array of information gaining coherence through organization into a conceptual matrix. [51] A point in the history of biology had been reached where new generalizations and higher order biological laws were being approached but obscured by the simple mass of data and volume of literature. To move toward this new theoretical biology the committee proposed a multi-dimensional matrix of biological knowledge:
That is the complete data base of published biological experiments structured by the laws, empirical generalizations, and physical foundations of biology and connected by all the interspecific transfers of information. The matrix includes but is more than the computerized data base of biological literature, since the search methods and key words used in gaining access to that base are themselves related to the generalizations and ideas about the structure of biological knowledge. [52]
New disciplinary requirements were imposed on the biologist who wanted to interpret and use the matrix of biological knowledge:
The development of the matrix and the extraction of biological generalizations from it are going to require a new kind of scientist, a person familiar enough with the subject being studied to read the literature critically, yet expert enough in information science to be innovative in developing methods of classification and search. This implies the development of a new kind of theory geared explicitly to biology with its particular theory structure. It will be tied to the use of computers, which will be required to deal with the vast amount and complexity of the information, but it will be designed to search for general laws and structures that will make general biology much more easily accessible to the biomedical scientist. [53]
Similar concerns about managing the explosion of new information motivated the Board of Regents of the National Library of Medicine. In its Long Range Plan of 1987 the NLM drew directly on the notion of the matrix of biological knowledge and elaborated upon it explicitly in terms of fashioning the new biology as an information science. [54] The Long Range Plan contained a series of recommendations that were the outcome of studies done by five different panels, including a panel that considered issues connected with building factual databases, such as sequence databases.

In the view of the panel, the field of molecular biology is opening the door to an era of unprecedented understanding and control of life processes, including “automated methods now available to analyze and modify biologically important macromolecules.” [55] The report characterized biomedical databases as representing the universal hierarchy of biological nature: cells, chromosomes, genes, proteins. Factual databases were being developed at all levels of the hierarchy from cells to base-pair sequences. Because of the complexity of biological systems, basic research in the life sciences is increasingly dependent on automated tools to store and manipulate the large bodies of data describing the structure and function of important macromolecules. But, the NIH Long Range Plan stated, although the critical questions being asked can often only be answered by relating one biological level to another, methods for automatically suggesting links across levels are non-existent. [56]

A singular and immediate window of opportunity exists for the Library in the area of molecular biology information. Because of new automated laboratory methods, genetic and biochemical data are accumulating far faster than they can be assimilated into the scientific literature. The problems of scientific research in biotechnology are increasingly problems of information science. By applying its expertise in computer technologies to the work of understanding the structure and function of living cells on a molecular level, NLM can assist and hasten the Nation’s entry into a remarkable new age of knowledge in the biological sciences. [57]

To support and promote the entry into the new age of biological knowledge the NIH recommended building a National Center for Biotechnology Information to serve as a repository and distribution center for this growing body of knowledge and as a laboratory for developing new information analysis and communications tools essential to the advance of the field. The proposal recommended $12.75 mil per year for 1988-1990, with an additional $10 mil per year for work in medical informatics. [58] The program would emphasize collaboration between computer and information scientists and the biomedical researcher. In addition the NIH would support research in the areas of molecular biology database representation, retrieval-linkages, and modeling systems, while examining interfaces based on algorithms, graphics and expert systems. The recommendation also called for the construction of online data delivery through linked regional centers and distributed database subsets.



BRAVE NEW THEORY

Two different styles of work have characterized the field of molecular biology. The biophysical approach has sought to predict the function of a molecule from its structure. The biochemical approach, on the other hand, has been concerned with predicting phenotype from biochemical function. If there has been a unifying framework for the field, at least from its early days up through the 1980s, it was provided by the “central dogma” emerging from the work of Watson, Crick, Monod, and Jacob in the late 1960s, schematized as follows:



DNA



RNA



Protein



Function



In this paper I have singled out molecular biologists whose Holy Grail has always been to construct a mathematized, predictive biological theory. In terms of the “central dogma” the measure of success in the enterprise of making biology predictive would be—and has been since the days of Claude Bernard—rational medicine. If one had a complete grasp of all the levels from DNA to behavioral function including the processes of translation at each level, then one could target specific proteins or biochemical processes that may be malfunctioning and design drugs specifically to repair these disorders. For molecular biologists with high theory ambitions the preferred path toward achieving this goal has been based on the notion that the function of a molecule is determined by its three-dimensional folding and that the structure of proteins is uniquely contained in the linear sequence of their amino acids. [59] But determination of protein structure and function is only part of the problem confronting a theoretical biology. A fully-fledged theoretical biology would want to be able to determine the biochemical function of the protein structure as well as its expected behavioral contribution within the organism. Thus biochemists have resisted the road of high theory and have pursued a solidly experimental approach aimed at eliciting common models of biochemical function across a range of mid-level biological structures from proteins and enzymes through cells. Their approach has been to identify a gene by some direct experimental procedure—determined by some property of its product or otherwise related to its phenotype—to clone it, to sequence, it, to make its product and to continue to work experimentally so as to seek an understanding of its function. This model, as Walter Gilbert has observed, was suited to “small science,” experimental science conducted in a single lab. [60]

The emergence of organizations like the Brookhaven Protein Data Bank in 1971, GenBank in 1982, and BIONET in 1984, and the massive amount of sequencing data that began to become available in university and company databases, and more recently publicly through the Human Genome Initiative, have complicated this picture immensely through an unprecedented influx of new data. In the process a paradigm shift has occurred in both the intellectual and institutional structures of biology. According to some of the central players in this transformation, at the core is biology’s switch from having been an observational science, limited primarily by the ability to make observations, to being a data-bound science limited by its practitioner’s ability to understand large amounts of information derived from observations. To understand the data the tools of information science have not only become necessary handmaidens to theory: they have also fundamentally changed the picture of biological theory itself. A new picture of theory radically different from even the biophysicists’ model of theory has come into view. Disciplinarily, biology has become an information science. Institutionally it is becoming “big science.” Gilbert characterizes the situation sharply:
To use this flood of knowledge, which will pour across the computer networks of the world, biologists not only must become computer-literate, but also change their approach to the problem of understanding life.

The next tenfold increase in the amount of information in the databases will divide the world into haves and have-nots, unless each of us connects to that information and learns how to sift through it for the parts we need. [61]
The new data-bound biology Gilbert hints at in this scenario is genomics, the theoretical component of which might be termed “computational biology,” while its instrumental and experimental component might be considered as “bioinformatics.” The fundamental dogma of this new biology, as characterized by Douglas Brutlag, reformulates the central dogma of Jacob-Monod in terms of “information flow”: [62]



Genetic information



Molecular structure



Biochemical function



Biologic

behavior



Walter Gilbert describes the newly forming genomic view of biology:
The new paradigm now emerging is that all the “genes” will be known (in the sense of being resident in databases available electronically), and that the starting point of a biological investigation will be theoretical. An individual scientist will begin with a theoretical conjecture, only then turning to experiment to follow or test that hypothesis. The actual biology will continue to be done as “small science”—depending on individual insight and inspiration to produce new knowledge—but the reagents that the scientist uses will include a knowledge of the primary sequence of the organism, together with a list of all previous deductions from that sequence. [63]
Genomics, computational biology, and bioinformatics restructure the playing field of biology bringing a substantially modified toolkit to the repertoire of molecular biology skills developed in the 1970s. Along with the biochemistry components new skills are now required, including machine learning, robotics, databases, statistics and probability, artificial intelligence, information theory, algorithms, and graph theory. [64]

Proclamations of the sort made by Gilbert and other promoters of genomics may seem like hyperbole. But the Human Genome Initiative and the information technology that enables it has changed molecular biology in fundamental ways, and indeed, may suggest similar changes in store for other domains of science. The online DNA and protein databases I have described have not just been repositories of information for insertion into the routine work of molecular biology, and the software programs discussed in connection with IntelliGenetics and GenBank are more than retrieval aids for transporting that information back to the lab. As a set of final reflections, I want to look in more detail at some ways this software has been used to address the problems of molecular biology in order to gain a sense of the changes taking place.



BIOLOGY IN SILICO

To appreciate the relationship between genomics and earlier work in molecular biology it is useful to compare approaches to the determination of structure and function. Rather than an approach deriving structure and function from first principles of the dynamics of protein folding, the bioinformatics approach involves comparing new sequences with preexisting ones and discovering structure and function by homology to known structures. This approach examines the kinds of amino acid sequences or patterns of amino acids found in each of the known protein structures. The sequences of proteins whose structure have already been determined and are already on file in the PDP are examined to infer rules or patterns applicable to novel protein sequences to predict their structure. For instance, certain amino acids, such as leucine and alanine, are very common in α-helical regions of proteins, whereas other amino acids, such as proline, are rarely if ever found in α-helices. Using patterns of amino acids or rules based on these patterns, the genome scientist can attempt to predict where helical regions will occur in proteins whose structure is unknown and for which a complete sequence exists. Clearly the lineage in this approach is work on automated learning first begun in DENDRAL and carried forward in other AI projects related to molecular biology such as MOLGEN.

The great challenge in the study of protein structure has been to predict the fold of a protein segment from its amino acid sequence. Before the advent of sequencing technology it was generally assumed that every unique protein sequence would produce a three-dimensional structure radically different from every other protein. But the new technology revealed that protein sequences are highly redundant: only a small percentage of the total sequence is crucial to the structure and function of the protein. Moreover, while similar protein sequences generally indicate similar folded conformations and functions, the converse does not hold. In some proteins, such as the nucleotide-binding proteins, the structural features encoding a common function are conserved while primary sequence similarity is almost non-existent. [65] Methods that detect similarities solely at the primary sequence level turned out to have difficulty addressing functional associations in such sequences. A number of features often only implicit in the protein’s primary sequence of amino acids turned out to be important in determining structure and function.

Such findings implied the need for more sophisticated techniques of searching than simply finding identical matches between sequences in order to elicit information about similarities between higher ordered structures such as folds. One solution adopted early on by programs such as SEQ was to assume that if two DNA segments are evolutionarily related, their sequences will probably be related in structure and function. The related descendants are identifiable as homologues. For instance, there are more than 650 globin sequences in the protein sequence databases, all of them very similar in structure. These sequences are assumed to be related by evolutionary descent rather than having been created de novo. Many programs for searching sequence databases have been written, including an important early method written in 1970 by Needleman and Wunsch and incorporated into SEQ for aligning sequences based on homologies. [66] The method of homology depends upon assumptions related to the genetic events that could have occurred in the divergent (or convergent) evolution of proteins; namely, that homologous proteins are the result of gene duplication and subsequent mutations. If one assumes that following the duplication, point mutations occur at a constant or variable rate, but randomly along the genes of the two proteins, then after a relatively short period of time the protein pairs will have nearly identical sequences. Later there will be gaps in the shared sets of base-pairs between the two proteins. Needleman and Wunsch determined the degree of homology between protein pairs by counting the number of non-identical pairs (amino acid replacements) in the homologous comparison and using this number as a measure of evolutionary distance between the amino acid sequences. A second approach was to count the minimum number of mutations represented by the non-identical pairs.

Another example of a key tool used in determining structure-function relationship is a search for sequences that correspond to small conserved regions of proteins, modular structures known as motifs. [67] Several different kinds of motifs are related to secondary and tertiary structure. Protein scientists distinguish among four hierarchical levels of structure. Primary structure is the specific linear sequence of the 20 possible amino acids making up the building blocks of the protein. Secondary structure consists of patterns of repeating polypeptide structure within an α-helix, β-sheet, and reverse turns. Supersecondary structure refers to a few common motifs of interconnected elements of secondary structure. Segments of α-helix and β-strand often combine in specific structural motifs. One example is the α-helix-turn-helix motif found in DNA-binding proteins. This motif contains 22 amino acids in length that enable it to bind to DNA. Another motif at the supersecondary level is known as the Rossmann fold, in which three α-helices alternate with three parallel β strands. This has turned out to be a general fold for binding mono-or dinucleotides, and is the most common fold observed in globular proteins. [68]

A higher order of modular structure is found at the tertiary level. Tertiary structure is the overall spatial arrangement of the polypeptide chain into a globular mass of hydrophobic side chains forming the central core, from which water is excluded, and more polar side chains favoring the solvent-exposed surface. Within tertiary structures are certain domains on the order of 100 amino acids, which are themselves structural motifs. Domain motifs have been shown to be encoded by exons, individual DNA sequences that are directly translated into peptide sequences. Assuming that all contemporary proteins have been derived from a small number of original ones, Walter Gilbert, et al. have argued that the global number of exons from which all existing protein domains have been derived is somewhere between 1,000 and 7,000. [69]

Motifs are powerful tools for searching databases of known structure and function to determine the structure and function of an unknown gene or protein. The motif can serve as a kind of probe for searching the database or some new sequence, testing for the presence of that motif. The PROCITE database, for example, has more than 1000 of these motifs. [70] With such a library of motifs one can take a new sequence and use each one of the motifs to get clues to its structure. Suppose the sequence of a protein has been determined. The most common way to examine a new gene or protein for its biologic function is simply to compare its sequence with all known DNA or protein sequences in the databases and note any strong similarities. The particular gene or protein that has just been determined will of course not be found in the databases, but a homologue from another organism or a gene or protein having a related function may be found. [71] In either case the evolutionary similarity implies a common ancestor and hence a common function. Searching with motif probes refines the determination of the fold regions of the protein. These methods become more and more successful as the databases grow larger and as the sensitivity of the search procedure increases.

The all-or-nothing character of consensus sequences—a sequence either matches or it doesn’t—led researchers to modify this technique to introduce degrees of similarity among aligned sequences as a way of detecting similarities between proteins, even distantly related ones. Knowing the function of a protein in some genome, such as E. coli , for instance, might suggest the same function of a closely related protein in an animal or human genome. [72] Moreover, as noted above, different amino acids can fit the same pattern, such as the helix-turn-helix, so that a representation of sequence pattern in which alternate amino acids are acceptable, as well as regions in which a variable number of amino acids may occur are desirable ways of extending the power of straightforward consensus sequence comparison. One such technique is to use weights or frequencies to specify greater tolerance in some positions than in others. An illustration of the success of this approach is provided by the DNA binding proteins mentioned above, which contain a helix-turn-helix motif 22 acids in length. [73] Comparison of the linear amino acid sequences of these proteins revealed no consensus sequence that could distinguish them from any other protein. By determining the frequency with which each amino acid appears at each position, and then converting these numbers to a measure of the probability of occurrence of each acid a weight matrix is constructed. This weight matrix can be applied to measure the likelihood that any given sequence 22 amino acids long is related to the helix-turn-helix family. A further modification of the weight matrix is the profile, which allows one to estimate the probability that any amino acid will appear in a specific position. [74]

In addition to consensus sequences, weight matrices, and profiles, a further class of strategies for determining structure-function relations are various sequence alignment methods. In order to detect homologies between distantly related proteins one method is to assign a measure of similarity to each pair of amino acids, and then add up these pairwise scores for the entire alignment. [75] Related proteins will not have identical amino acids aligned, but they do have chemically similar or replaceable amino acids in similar positions. In a scoring method developed by Schwartz and Dayhoff, for example, amino acid pairs that are identical or chemically similar were given positive scores, and pairs of amino acids that are not related were assigned negative similarity scores.

A dramatic illustration of how sequence alignment tools can be brought to bear on determining function and structure is provided by the case of cystic fibrosis. Cystic fibrosis is caused by aberrant regulation of chloride transport across epithelial cells in the pulmonary tree, the intestine, the exocrine pancreas, and apocrine sweat glands. This disorder was identified as due to defects in the cystic fibrosis transmembrane conductance regulator protein (CFTR). The CFTR gene was isolated in 1989, and subsequently identified as producing a chloride channel whose activity depends on phosphorylation of particular residues within the regulatory region of the protein. Using computer-based sequence alignment tools of the sort described above, it was established that a consensus sequence for nucleotide binding folds that bind ATP are present near the regulatory region and that 70 percent of cystic fibrosis mutations are accounted for by a 3 base-pair deletion that removes a phenylalanine residue within the first nucleotide binding position. A significant portion of the remainder of cystic fibrosis mutations affect a second nucleotide-binding domain near the regulatory region. [76]

In working out the folds and binding domains for the CFTR protein Hyde, Emsley, Hartshorn, et al. (1990) used sequence alignment methods similar to those available in early models of the IntelliGenetics software suite. [77] In 1992 IntelliGenetics introduced BLAZE, an even more rapid search program running on a massively parallel computer. As an example of how computational genomics can be used to solve structure-function problems in molecular biology, Brutlag repeated the CFTR case using BLAZE.[78] A sequence similarity search compared the CFTR protein to more than 26,000 proteins in a protein database of more than 9 million residues, resulting in a list of 27 top similar proteins, all of which strongly suggested the CFTR protein is a membrane protein involved in secretion. Another feature of the comparison result was that significant homologies were shown with ATP-binding transport proteins, further strengthening the identification of CFTR as a membrane protein. The search algorithm identified two consensus sequence motifs in the protein sequence of the cystic fibrosis gene product that corresponded to the two sites on the protein involved in binding nucleotides. The search also turned up distant homologies between the CFTR protein and proteins of E. coli and yeast. The entire search took three hours. Such examples offer convincing evidence that tools of computational molecular biology can lead to the understanding of protein function.

The methods for analyzing sequence data discussed above were just the beginnings of an explosion of database mining tools for genomics that is continuing to take place. [79] In the process biology is becoming even more aptly characterized as an information science. [80] Advances in the field have led to large-scale automation of sequencing in genome centers employing robots. The success this large-scale sequencing of genes has enjoyed has in turn spawned a similar approach to applying automation to sequencing proteins, a new area complementary to genomics called proteomics. Similar in concept to genomics, which seeks to identify all genes, proteomics aims to develop techniques that can rapidly identify the type, amount and activities of the thousands of proteins in a cell. Indeed, new biotechnology companies have started marketing technologies and services for mining protein information en masse. Oxford Glycosciences (OGS) in Abingdon, England, has automated the laborious technique of two-dimensional gel electrophoresis. [81] In the OGS process, an electric current applied to a sample on a polymer gel separates the proteins, first by their unique electric charge characteristics and then by size. A dye attaches to each separated protein arrayed across the gel. Then a digital imaging device automatically detects protein levels by how much the dye fluoresces. Each of the 5,000 to 6,000 proteins that may be assayed in a sample in the course of a few days is channeled through a mass spectrometer that determines its amino acid sequence. The identity of the protein can be determined by comparing the amino acid sequence with information contained in numerous gene and protein databases. One imaged array of proteins can be contrasted with another to find proteins specific to a disease.

In order to keep pace with this flood of data emerging from automated sequencing, genome researchers have in turn looked increasingly to artificial intelligence, machine learning, and even robotics in developing automated methods for discovering patterns and protein motifs from sequence data. The power of these methods is their ability both to represent structural features rather than strictly evolutionary steps and to discover motifs from sequences automatically. The methods developed in the field of machine learning have been used to extract conserved residues, discover pairs of correlated residues, and find higher order relationships between residues as well. Techniques from the field of machine learning have included perceptrons, discriminant analysis, neural networks, Bayesian networks, hidden Markov models, minimal length encoding, and context-free grammars. [82] Important methods for evaluating and validating novel protein motifs have also derived from the machine learning area.

An example of this effort to scale up and automate the discovery of structure and function is EMOTIF (for “electronic-motif”), a program for discovering conserved sequence motifs from families of aligned protein sequences developed by the Brutlag Bioinformatics Group at Stanford. [83] Protein sequence motifs are usually generated manually with a single “best” motif optimized at one level of specificity and sensitivity. Brutlag’s aim was to automate this procedure. An automated method requires knowledge about sequence conservation. For EMOTIF, this knowledge is encoded as a particular allowed set of amino acid substitution groups. Given an aligned set of protein sequences, EMOTIF works by generating a set of motifs with a wide range of specificities and sensitivities. EMOTIF can also generate motifs that describe possible subfamilies of a protein superfamily. The EMOTIF program works by generating a new database, called IDENTIFY, of 50,000 motifs from the combined 7000 protein alignments in two widely used public databases, the PRINTS and BLOCKS databases. By changing the set of substitution groups the algorithm can be adapted for generating entirely new sets of motifs.

Highly specific motifs are well suited for searching entire proteomes. IDENTIFY assigns biological functions to proteins based on links between each motif and the BLOCKS or PRINTS databases that describe the family of proteins from which it was derived. Because these protein families typically have several members, a match to a motif may provide an association with several other members of the family. In addition, when a match to a motif is obtained, that motif may be used to search sequence databases, such as SWISS-PROT and GENPEPT, for other proteins that share this motif. In their paper introducing these new programs Nevill-Manning, Wu, and Brutlag showed that EMOTIF and IDENTIFY successfully assigned functions automatically to 25-30% of the proteins in several bacterial genomes and automatically assigned functions to 172 proteins of previously unknown function in the yeast genome.

Many molecular biologists who welcomed the Human Genome Initiative with open arms undoubtedly believed that when the genome was sequenced everyone would return to the lab to conduct their experiments in a business-as-usual fashion, empowered with a richer set of fundamental data. The developments in automation, the resulting explosion of data, and the introduction of tools of information science to master this data have changed the playing field forever: there may be no “lab” to return to. In its place is a workstation hooked to a massively parallel computer, producing simulations by drawing on the data streams of the major databanks and carrying out “experiments” in silico rather than in vitro . The result of biology’s metamorphosis into an information science just may be the relocation of the lab to the industrial park and the dustbin of history.





ENDNOTES

[1] See Richard Mark Friedhoff and William Benzon, The Second Computer Revolution: Visualization , New York: W.H. Freeman, 1989. Other important discussions are in Information Technology and the Conduct of Research: The User’s View , Report of the Panel on Information Technology and the Conduct of Research, National Academy of Sciences, 1989; B.H. McCormick, T.A. DeFanti, and M.D. Brown, Visualization in Scientific Computing , NSF Report, published as a special issue of Computer Graphics , Vol. 21 (6) (1987). An equally impressive survey is the special issue on computational physics in Physics Today , October, 1987. See especially the articles by Norman Zambusky, “Grappling with Complexity,” Physics Today , October, 1987, pp. 25-27; Karl-Heinz A. Winkler, et al., A Numerical Laboratory, ibid., pp. 28-37; Martin Karplus, “Molecular Dynamics Simulations of Proteins,” ibid., pp. 68-72.

By “computational science” I mean the use of computers in the discipline sciences as distinct from computer science. The discipline sciences include the problem domains of the physical and life sciences, economics, medicine, much of applied mathematics, and so forth. See McCormick, et al. “Visualization in Scientific Computing,” p. 11.

[2] The sciences of visualization are defined by McCormick, et al., “Visualization in Scientific Computing,” p. A-1, as follows: “Images and signals may be captured from cameras or sensors, transformed by image processing, and presented pictorially on hard or soft copy output. Abstractions of these visual representations can be transformed by computer vision to create symbolic representations in the form of symbols and structures. Using computer graphics, symbols or structures can be synthesized into visual representations. It should be noted that computational science, simulation, and visualization need not be interwoven.” Early workers in computational physics, for example, devoted their efforts to the underlying physics for simulating phenomena with little emphasis on visualization (Ullam and Teller, for instance, as discussed by Peter Galison in chapter 9 of Image and Logic.). In recent years, however, numerical simulations in the physical sciences have reached a degree of complexity such that they are incomprehensible without visual representations. See the remarkable statements of several leading scientists in the supercomputing field quoted in the abstract from the session on “The Physical Simulation and Visual Representation of Natural Phenomena,” from the 1987 meeting of the Association for Computing Machinery, in Computer Graphics , Vol 21, no. 4 (1987), pp. 335-336.

[3] Stephen S. Hall, “Protein Images Update Natural History.” Science 267(3 February) (1995): 620-624 points to the discovery of the structure of HGH (human growth hormone) receptors as an example of a structure discovery that could not have been accomplished without high-powered interactive graphics.

[4] Richard Doyle, On Beyond Living : Rhetorical Transformations of The Life Sciences , Stanford; Stanford University Press, 1997.

[5] Diane M. Ramsey, ed., Image Processing in Biological Science (Proceedings of a Conference held November, 1966), Berkeley and Los Angeles: University of California Press, 1968, pp. xiii-xiv.

[6] See Lusted’s foreward to Robert S. Ledley, Use of Computers in Biology and Medicine , New York: McGraw-Hill, 1965, p. ix. This book was written as part of the Committee’s recommendation. At the time it was written Ledley was affiliated with the Division of Medical Sciences, National Academy of Sciences National Research Council. Lusted went on to make the following recommendation:

3. The committee feels that there is a need for biomedical computer research centers, which might be established at national research laboratories or in association with academic institutions. I believe that a pilot project should be made available as soon as possible to determine the utility of such centers. In our judgment the purpose of these centers would be to cooperate in large-scale biomedical research projects that utilize computers as a necessary adjunct and to make computer facilities available for the use of biomedical research workers from other institutions. Precedents for such large-scale cooperative efforts have already been set in basic physics and other areas of science. (ibid., pp. ix-x.)
[7] Robert S. Ledley, Use of Computers in Biology and Medicine , p. xi.

[8] Illustrated in Progress in Stereochemistry , 4(1968).

[9] See Cyrus Levinthal, “Molecular Model-Building by Computer,” Scientific American , Vol. 214, no. 6, 1966, pp. 42-52.

[10] Conventions for the description of protein molecules are provided in J.T. Edsall, P.K. Flory, J.C. Kendrew, A.M. Ligouri, G. Nemathy, G.N Ramachandra, and H.A. Scherga, Journal of Molecular Biology, Vol. 15 (1966), p. 399. Methods for working with computer input are described by G. Nemathy and H.A. Scherga, Biopolymers, Vol. 3 (1966), p.155.

[11] Anthony G. Oettinger, “The Uses of Computers in Science,” Scientific American , Vol. 215, no. 3, 1966, pp. 161-172, quoted from p. 161.

[12] Cyrus Levinthal, “Molecular Model-Building by Computer,” pp. 48-49.

[13] Lou Katz and Cyrus Levinthal, "Interactive Computer Graphics and the Representation of Complex Biological Structures," Annual Reviews in Biophysics and Bioengineering, 1(1972): 465-504.

[14] Further discussion of this aspect of the model-building program can be found in C. Levinthal, C.D. Barry, S.A. Ward, and M. Zwick, “Computer Graphics in Macromolecular Chemistry,” in D. Secrest and J. Nievergelt, eds., Emerging Concepts in Computer Graphics, New York: W.A. Benjamin, 1968, pp.231-253.

[15] Stephen S. Hall, “Protein Images Update Natural History.” Science 267(3 February) (1995): 620-624 points to the discovery of the structure of HGH (human growth hormone) receptors as an example of a structure discovery that could not have been accomplished without high-powered interactive graphics.

[16] Joshua Lederberg, “How DENDRAL Was Conceived and Born,” Stanford Technical Reports. 048087-54, Knowledge Systems Laboratory Report No. KSL 87-54.

Joshua Lederberg, Georgia L. Sutherland, Bruce G. Buchanan, Edward Feigenbaum, “A Heuristic Program for Solving a Scientific Inference Problem: Summary of Motivation and Implementation,” Stanford Technical Reports 026104, Stanford Artificial Intelligence Project Memo AIM-104, November, 1969, p. 2.

[17] Joshua Lederberg, “Topology of Molecules,” in The Mathematical Sciences: A Collection of Essays , edited by the National Research Council Committtee on Support of Research in the Mathematical Sciences, Cambridge, Mass., MIT Press, 1969, pp. 37-51, quoted from pp. 37-38.

[18] See Joshua Lederberg, Georgia L. Sutherland, Bruce G. Buchanan, and Edward A. Feigenbaum, “A Heuristic Program for soving a Scientific Inference Problem: Summary of Motivation and Implementation,” Stanford Artificial Intelligence Project Memo AIM-104, November, 1969.

[19] ”Second Cut at Interaction Language and Procedure,” in “Chemistry Project” Edward Feigenbaum Papers, Stanford Special Collections SC-340, Box 13.

[20] See Christian B. Anfinsen, “Principles that Govern the Folding of Protein Chains.” Science (1973). 181(Number 4096): 223-230 discusses the work for which he was awarded the Nobel Prize for Chemistry in 1972: “This hypothesis (the “thermodynamic hypothesis”) states that the three-dimensional structure of a native protein in its normal physiological milieu...is the one in which the Gibbs free energy of the whole system is lowest; that is, that the totality of interatomic interactions and hence by the amino acid sequence, in a given environment.” (p. 223)

[21] Bernstein, F. C., T. F. Koetzle, et al. (1977). “The Protein Data Bank: A computer based archival file for macromolecular structure.” Journal of Molecular Biology 112: 535-542.

[22] Holbrook, S. R., S. M. Muskal, et al. (1993). Predicting Protein Structural Features with Artificial Neural Networks. Artificial Intelligence and Molecular Biology . L. Hunter, ED. Menlo Park, CA, AAAI Press: 161-194.

[23] Biological Macromolecule Crystallization Database and the NASA Archive for Protein Crystal Growth Data (version 2.00) is located on the web at: http://wwwbmcd.nist.gov:8080/bmcd/bmcd.html

[24] D. Brutlag, Understanding the Human Genome. Scientific American Introduction to Molecular Medicine . P. Leder, D. A. Clayton and E. Rubenstein, ED. New York, NY, Scientific American, Inc., 1994: p. 159.

[25] E. A Feigenbaum and N. Martin , Proposal: MOLGEN - a computer science application to molecular genetics , Heuristic Programing Project, Stanford University, Technical Report No: HPP-78-18,1977.

[26] P. Friedland, Knowledge-Based Experiment Design in Molecular Genetics . Ph.D. Thesis, Computer Science, Stanford University, Stanford,1979.

[27] E. A., Feigenbaum, B. Buchanan, et al. A Proposal for Continuation of the MOLGEN Project: A Computer Science Application to Molecular Biology , Computer Science Department, Stanford University, Heuristic Programming Project, Technical Report No. HPP-80-5, April, 1980, Section 1., p.1.

[28] Douglas Brutlag, personal communication. Peter Friedland, personal communication. After his work on MOLGEN and at IntelliGenetics (discussed below) Friedland went on to become chief scientist at the NASA-Ames Laboratory for Artificial Intelligence in 1987.

[29] L.J. Korn, C.L. Queen, and M.N. Wegman. “Computer Analysis of Nucleic Acid Regulatory Sequences.” Proceedings of the National Academy of Sciences 74 (1977): 4516-4520; R. Staden. “Sequence Data Handling by Computer.” Nucleic Acids Research 4 (1977): 4037-4051; R. Staden. “Further Procedures for Sequence Analysis by Computer.” Nucleic Acids Research 5 (1978): 1013-1015; R. Staden. “A Strategy of DNA Sequencing Employing Computer Programs.” Nucleic Acids Research 6 (1979): 2602-2610.

[30] P. Friedland, D.L. Brutlag, J. Clayton, and L.H. Kedes. “SEQ: A Nucleotide Sequence Analysis and Recombinant System.” Nucleic Acids Research 10 (1982): 279-294.

[31] M. Stefik. “Inferring DNA Structures from Segmentation Data.” Artificial Intelligence 11 (1977): 85-114.

[32] T. Rindfleisch, P. Friedland, and J. Clayton. The GENET Guest Service on SUMEX, SUMEX-AIM Report, 1981: Stanford University Special Collections, Friedland Papers, Fldr GENET.

[33] Doug Brutlag, Personal Communication, 6/19/99. Also discussed in the official site review for BIONET conducted by the NIH Special Study Section, March 17-19, 1983, “BIONET, National Computer Resource for Molecular Biology,” Stanford University Special Collections, Brutlag Papers, p. 2. Also discussed in Roger Lewin, “National Networks for Molecular Biologists,” Science 223 (1984): 1379-1380.

[34] This was announced to the GENET community by Allan Maxam, the chairman of the national advisory board. See: Allan M. Maxam to GENET Community. Subject: Closing of GENET: August 23,1982. Stanford University Special Collections, Peter Friedland Papers, Fldr GENET.

[35] Business Plan for IntelliGenetics, May 8, 1981, p. 5. Stanford Special Collections, Brutlag Papers, Fld IntelliGenetics. Emphasis in the original. Details of the software licensing arrangement and the revenues generated are discussed in a letter to Niels Reimers, Stanford Office of Technology Licensing on the occasion of renegotiating the terms. See: Peter Friedland to Niels Reimers. Subject: Software Licensing Agreement: April 2,1984. Stanford University Special Collections, Fldr IntelliGenetics.

[36] Roger Lewin, “National Networks for Molecular Biologists,” Science 223 (1984): 1379-1380.

[37] Lewin noted that this was the largest award of its kind by NIH to a for-profit organization. See ibid., p. 1380.

[38] Minutes of the Meeting of the National Advisory Committee for BIONET, March 23, 1985 (Final version prepared 1 August 1985),p. 4. In Stanford University Special Collections, Brutlag Papers, Fld. BIONET.

[39] “BIONET users status,” from BIONET managers’ meeting, April 3, 1986; and ibid., October 9, 1986. In Stanford University Special Collections, Brutlag Papers, Fld. BIONET.

[40] Joel Huberman. “BIONET: Computer Power for the Rest of Us.” (1989): p. 1.

[41] Peter Friedland, "BIONET Organizational Plans," 27 April 1984, Company Confidential Memo. Stanford University Special Collections, Brutlag Papers, Fldr BIONET, p. 1. A published version of these objectives appeared as: Dennis H. Smith, Douglas Brutlag, Peter Friedland, and Laurence H. Kedes, “BIONET tm: national computer resource for molecular biology,” Nucleic Acids Research , 14(1)(1986): 17-20.

[42] IntelliGenetics, Introduction to Bionet tm: A Computer Resource for Molecular Biology , User manual for Bionet subscribers, Release 2.3, Mountain View, CA, IntelliGenetics, 1987, p. 23, “Databases available on BIONET.”

[43] Douglas Brutlag, Personal communication, June 19, 1999.

[44] Steve Boswell, “Los Alamos Workshop—Exploring the Role of Robotics and Automation in Decoding the Human Genome,” IntelliGenetics trip report, January 9, 1987. In Stanford Special Collections, Brutlag Papers, Fld. BIONET.

[45] Barbara H. Duke, Contracting Officer, NIH, to IntelliGenetics, Inc. "Request for Revised Proposal in Response to Request for Proposals RFP No. NIH-GM-87-04 entitled ‘Nucleic Acid Sequence Data Bank,’" June 3, 1987, Letter with attachment. Stanford Special Collections, Brutlag Papers, Fld. BIONET.

[46] See PC/Gene: Microcomputer Software for Protein Chemists and Molecular Biologists, User Manual , Mountain View, CA, IntelliGenetics, 1986, p. 99-120.

[47] Douglas L Brutlag, and David Kristofferson. “BIONET: An NIH Computer Resource for Molecular Biology.” Biomolecular Data: A Resource in Transition . Ed. R. R. Colwell. Oxford: Oxford University Press, 1988. 287-294. Also see, “Automatic Data Submission to GenBank, EMBL, and NBRF-PIR,” BIONET News , Vol 1, No. 1, April 1988.

[48] Ibid.

[49] Douglas Brutlag, Personal Communication, June 19, 1999. See nomination for Smithsonian-Computerworld Award in Stanford Special Collections, Brutlag Papers, Fld. Smithsonian Computerworld Award.

[50] Quoted from Steve Boswell, “Los Alamos Workshop—Exploring the Role of Robotics and Automation in Decoding the Human Genome,” IntelliGenetics trip report, January 9, 1987, p. 2. In Stanford Special Collections, Brutlag Papers, Fld. BIONET.

[51] H. Morowitz, Models for Biomedical Research: A New Perspective . Washington, D.C., National Academy of Sciences Press, 1985, p. 21.

[52] Ibid., p. 65.

[53] Ibid., p. 67.

[54] Board of Regents, NLM Long Range Plan (Report of the Board of Regents), Bethesda, MD, National Library of Medicine, (1987).

[55] Ibid., p.26.

[56] Ibid., pp. 26-27.

[57] Ibid., p. 29.

[58] Ibid., pp. 46-47. The figures for Medical Informatics were $7.4, $9.9, and $13 Mil for 1888-90.

[59] See Christian B. Anfinsen, “Principles that Govern the Folding of Protein Chains.” Science (1973). 181(Number 4096): 223-230 discusses the work for which he was awarded the Nobel Prize for Chemistry in 1972: “This hypothesis (the “thermodynamic hypothesis”) states that the three-dimensional structure of a native protein in its normal physiological milieu...is the one in which the Gibbs free energy of the whole system is lowest; that is, that the totality of interatomic interactions and hence by the amino acid sequence, in a given environment.” (p. 223)

[60] Walter Gilbert, “Towards a Paradigm Shift in Biology,” Nature, 349 (1991), p. 99

[61] Ibid.

[62] Douglas L. Brutlag, “Understanding the Human Genome,” in P. Leder, D.A. Clayton, and E. Rubenstein, eds., Scientific American: Introduction to Molecular Medicine , New York; Scientific American, Inc., 1994, pp. 153-168.

[63] Walter Gilbert, “Towards a Paradigm Shift in Biology,” Nature, 349 (1991), p. 99.

[64] These are the disciplines graduate students and postdocs in molecular biology in Brutlag’s lab at Stanford are expected to work with. Source: Douglas Brutlag, “Department Review: Bioinformatics Group, Department of Biochemistry, Stanford University,1998,” personal communication.

[65] M.G. Rossman, D. Moras, K.W. Olsen, “ Chemical and Biological Evolution of a Nucleotide-binding Protein,” Nature 250 (1974): pp. 194-199; T.E. Creighton, Proteins: Structure and Molecular Properties , New York: W.H. Freeman and Co., 1983; J.J. Birktoft, L.J. Banaszak, “Structure-function relationships Among Nicotinamide-adenine Dinucleotide Dependent Oxidorecuctases,” in M.T.W. Hearn, ed., Peptide and Protein Reviews , New York; Marcel Dekker, Vol. 4, 1984, pp.1-47.

[66] S. B. Needleman and C. D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins.” Journal of Molecular Biology 48(1970): 443.

[67] Since insertions and deletions (gaps) within a motif are not easily handled from a mathematical point of view, a more technical term, “alignment block,” has been introduced that refers to conserved parts of multiple alignments containing no insertions or deletions. Peer Bork, and Toby J. Gibson. “Applying Motif and Profile Searches,” Computer Methods for Macromolecular Sequence Analysis . Ed. R.F. Doolittle. Vol. 266. Methods in Enzymology. San Diego: Academic Press, 1996. 162-184, especially p. 163.

[68] J. S. Richardson and D. C. Richardson Principles and Patterns of Protein Conformation. Prediction of Protein Structure and the Principles of Protein Conformation . G. D. Gasman, ED. New York, Plenum Press,1989.

[69] R. L. Dorit, L. Schoenback, and W. Gilbert. “How big is the universe of exons?” Science 250 (1990): p. 1377.

[70] A. Bairoch. “PROSITE: A Dictionary of Sites and Patterns in Proteins.” Nucleic Acids Research 19 (1991): 2241.

[71] Bork, Ouzounis, and Sander state that the likelihood of identifying homologues is currently higher than 80% for bacteria, 70% for yeast, and about 60% for animal sequence series. See P. Bork, C. Ouzounis, and C. Sander, Current Opinions in Structural Biology 4 (1994): 393; Peer Bork, and Toby J. Gibson. “Applying Motif and Profile Searches,” cited in note 76 above.

[72] Laszlo Patthy. “Consensus Approaches in Detection of Distant Homologies.” Computer Methods for Macromolecular Sequence Analysis . Ed. R.F. Doolittle. Vol. 266. Methods in Enzymology. San Diego: Academic Press, 1996. 184-198.

[73] R. G. Brennan and B. W. Mathews, “The Helix-Turn-Helix Binding Motif.” Journal of Biological Chemistry 264(1989): 1903.

[74] M. Gribskov, A. D. McLachlan, et al. “Profile analysis: Detection of distantly Related Proteins,” Proceedings of the National Academy of Sciences 84(1987): 4355; M. Gribskov, M. Homyak, et al. “Profile scanning for three-dimensional structural patterns in protein sequences,” Computer Applications in the Biosciences 4(1988): 61.

[75] R. M. Schwartz and M.O. Dayhoff, “Matrices for Detecting Distant Relationships” Atlas of Protein Structure 5(1979) (Supplement 3): p. 353.

[76] S. C. Hyde, P. Emsley, et al. (1990). “Structural Model of ATP-binding Proteins Associated with Cystic fibrosis, Multidrug Resistance and Bacterial Transport.” Nature 346: 362-365; B.S. Kerem, J. M. Rommens, et al. “Identification of the Cystic Fibrosis Gene: Genetic Analysis,” Science 245(1989): 1073-1080; B. S. Kerem, J. Zielenski, et al. “Identification of Mutations in Regions Corresponding to the Two Putative Nucleotide (ATP)-Binding Folds of the Cystic Fibrosis Gene,” Proceedings of the National Academy of Sciences 87(1990): 8447-8451; J. R. Riordan, J. M. Rommens, et al., “Identification of the Cystic Fibrosis Gene: Cloning and Characterization of Complementary RNA,” Science 245(1989): 1066-1073.

[77] Hyde, Emsley, et al. used the Chou-Fasman algorithm (1973)for identifying consensus sequences and the Quantatm modeling package produced by Polygen Corp., Waltham, Mass. for modeling the protein and its binding sites. See, S. C. Hyde, P. Emsley, et al. (1990). “Structural Model of ATP-binding Proteins Associated with Cystic fibrosis, Multidrug Resistance and Bacterial Transport.” Nature 346: 362-365.

[78] D. Brutlag, “Understanding the Human Genome” Scientific American Introduction to Molecular Medicine . P. Leder, D. A. Clayton and E. Rubenstein, Eds. New York, NY, Scientific American, Inc., 1994: pp. 164-166.

[79] See for instance the National Institute of General Medical Science, “(NIGMS), Protein Structure Initiative Meeting Summary,” April 24, 1998, at: http://www.nih.gov/nigms/news/reports/protein_structure.html

[80] I have focused on the development of software in this discussion. But a further crucial stimulation to the takeoff of bioinformatics, of course, are hardware and networking developments. The growth of databases and complexity of the searches that were to be undertaken stimulated the demand for faster algorithms, more powerful computer systems, and network bandwidth. At the beginning of this “bioinformatics revolution” in the 1970s, for example, a search on a DNA sequence of typical size would be performed by a computer capable of performing one million instructions per second (one MIP) and would take approximately 15 minutes. Throughout the late 1970s and 1980s mini-computers and personal computer workstations continued to increase in power at about the same rate as the growth of the databases, so that a typical search still took around 15 minutes. By the end of the 1980s, however, the growth in sequence data—now hundreds of megabytes in size—had overtaken the ability of computers to search it with acceptable turnaround time. Shortcut search methods and more efficient code helped, but the most rigorous and sensitive searches began to require hours of computing time to align and score even a single query sequence against a database of sequences. The NIH and NSF responded to the challenge by supporting research and development of new computer architectures, regional supercomputer centers and several large-scale computing initiatives. (see Thomas P. Hughes, et al., ed ., Funding a Revolution: Government Support for Computing Research , Washington, D.C., National Academy Press, 1999.) Commercial vendors such as DEC, SUN Microsystems, Cray Computers, and MasPar Computer Corporation tried to meet the large-scale computing needs of geneticists with, for example, massively parallel computers, such as the MasPar MP-1 computer. In early 1992, the MasPar MP-1104 with 4,096 processors could search the entire Swiss-Protein database in 30 seconds with a query of 100 amino acids, and a query of 1000 amino acids could be executed on the GenBank database (74,000 sequences) in 15 minutes. (see IntelliGenetics, Inc., and MasPar Computer corporation, “BLAZE: A Massively Parallel Sequence Similarity Search Program for Molecular Biologists,” Product Information Bulletin, May 1992.)

[81] See the discussion of this technology at the NIGMS site listed in note 81 as well as at the site of Oxford Glycosciences: http://www.ogs.com/proteome/home.html

[82] See especially the papers in L. Hunter, ed., Artificial Intelligence and Molecular Biology , Menlo Park, CA, AAAI Press, 1993.

[83] C. G. Nevill-Manning, T. D. Wu, et al. “Highly Specific Protein Sequence Motifs for Genomic Analysis.” Proceedings of the American Academy of Sciences 95(1998): in press. EMOTIF can be viewed at http://motif.stanford.edu/emotif