Report

The Automation of Science

See allHide authors and affiliations

Science  03 Apr 2009:
Vol. 324, Issue 5923, pp. 85-89
DOI: 10.1126/science.1165620

Abstract

The basis of science is the hypothetico-deductive method and the recording of experiments in sufficient detail to enable reproducibility. We report the development of Robot Scientist “Adam,” which advances the automation of both. Adam has autonomously generated functional genomics hypotheses about the yeast Saccharomyces cerevisiae and experimentally tested these hypotheses by using laboratory automation. We have confirmed Adam's conclusions through manual experiments. To describe Adam's research, we have developed an ontology and logical language. The resulting formalization involves over 10,000 different research units in a nested treelike structure, 10 levels deep, that relates the 6.6 million biomass measurements to their logical description. This formalization describes how a machine contributed to scientific knowledge.

Computers are playing an ever-greater role in the scientific process (1). Their use to control the execution of experiments contributes to a vast expansion in the production of scientific data (2). This growth in scientific data, in turn, requires the increased use of computers for analysis and modeling. The use of computers is also changing the way that science is described and reported. Scientific knowledge is best expressed in formal logical languages (3). Only formal languages provide sufficient semantic clarity to ensure reproducibility and the free exchange of scientific knowledge. Despite the advantages of logic, most scientific knowledge is expressed only in natural languages. This is now changing through developments such as the Semantic Web (4) and ontologies (5).

A natural extension of the trend to ever-greater computer involvement in science is the concept of a robot scientist (6). This is a physically implemented laboratory automation system that exploits techniques from the field of artificial intelligence (79) to execute cycles of scientific experimentation. A robot scientist automatically originates hypotheses to explain observations, devises experiments to test these hypotheses, physically runs the experiments by using laboratory robotics, interprets the results, and then repeats the cycle.

High-throughput laboratory automation is transforming biology and revealing vast amounts of new scientific knowledge (10). Nevertheless, existing high-throughput methods are currently inadequate for areas such as systems biology. This is because, even though very large numbers of experiments can be executed, each individual experiment cannot be designed to test a hypothesis about a model. Robot scientists have the potential to overcome this fundamental limitation.

The complexity of biological systems necessitates the recording of experimental metadata in as much detail as possible. Acquiring these metadata has often proved problematic. With robot scientists, comprehensive metadata are produced as a natural by-product of the way they work. Because the experiments are conceived and executed automatically by computer, it is possible to completely capture and digitally curate all aspects of the scientific process (11, 12).

To demonstrate that the robot scientist methodology can be both automated and be made effective enough to contribute to scientific knowledge, we have developed Robot Scientist “Adam” (13) (Fig. 1). Adam's hardware is fully automated such that it only requires a technician to periodically add laboratory consumables and to remove waste. It is designed to automate the high-throughput execution of individually designed microbial batch growth experiments in microtiter plates (14). Adam measures growth curves (phenotypes) of selected microbial strains (genotypes) growing in defined media (environments). Growth of cell cultures can be easily measured in high-throughput, and growth curves are sensitive to changes in genotype and environment.

Fig. 1.

The Robot Scientist Adam. The advances that distinguish Adam from other complex laboratory systems are the individual design of the experiments to test hypotheses and the utilization of complex internal cycles. Adam's basic operations are selection of specified yeast strains from a library held in a freezer, inoculation of these strains into microtiter plate wells containing rich medium, measurement of growth curves on rich medium, harvesting of a defined quantity of cells from each well, inoculation of these cells into wells containing defined media (minimal synthetic dextrose medium plus up to four added metabolites from a choice of six), and measurement of growth curves on the specified media. To achieve this functionality, Adam has the following components: a, an automated –20°C freezer; b, three liquid handlers (one of which can separately control 96 fluid channels simultaneously); c, three automated +30°C incubators; d, two automated plate readers; e, three robot arms; f, two automated plate slides; g, an automated plate centrifuge; h, an automated plate washer; i, two high-efficiency particulate air filters; and j, a rigid transparent plastic enclosure. There are also two bar code readers, seven cameras, 20 environment sensors, and four personal computers, as well as the software. Adam is capable of designing and initiating over a thousand new strain and defined-growth-medium experiments each day (from a selection of thousands of yeast strains), with each experiment lasting up to 5 days. The design enables measurement of OD595nm for each experiment at least once every 30 min (more often if running at less than full capacity), allowing accurate growth curves to be recorded (typically we take over a hundred measurements a day per well), plus associated metadata. See the supporting online material for pictures and a video of Adam in action.

We applied Adam to the identification of genes encoding orphan enzymes in Saccharomyces cerevisiae: enzymes catalyzing biochemical reactions thought to occur in yeast, but for which the encoding gene(s) are not known (15). To set up Adam for this application required (i) a comprehensive logical model encoding knowledge of S. cerevisiae metabolism [∼1200 open reading frames (ORFs), ∼800 metabolites] (15), expressed in the logic programming language Prolog; (ii) a general bioinformatic database of genes and proteins involved in metabolism; (iii) software to abduce hypotheses about the genes encoding the orphan enzymes, done by using a combination of standard bioinformatic software and databases; (iv) software to deduce experiments that test the observational consequences of hypotheses (based on the model); (v) software to plan and design the experiments, which are based on the use of deletion mutants and the addition of selected metabolites to a defined growth medium; (vi) laboratory automation software to physically execute the experimental plan and to record the data and metadata in a relational database; (vii) software to analyze the data and metadata (generate growth curves and extract parameters); and (viii) software to relate the analyzed data to the hypotheses; for example, statistical methods are required to decide on significance. Once this infrastructure is in place, no human intellectual intervention is necessary to execute cycles of simple hypothesis-led experimentation. [For more details of the software, and its application to a related functional genomics problem, see (16) and figs. S1 and S2].

Adam formulated and tested 20 hypotheses concerning genes encoding 13 orphan enzymes (16) (Table 1). The weight of the experimental evidence for the hypotheses varied (based on observations of differential growth), but 12 hypotheses with no previous evidence were confirmed with P < 0.05 for the null hypothesis.

Table 1.

The orphan enzymes and Adam's hypotheses. The hypothesized genes are those which Adam abduced encoded an orphan enzyme. Prob. is Adam's Monte Carlo estimate of the probability of obtaining the observed discrimination accuracy or better with a random labeling of replicates. The discrimination is between the differences in growth curves observed with the addition of specified metabolites to the wild type and the deletant. Acc. is the highest accuracy for a metabolite species in discriminating between the growth curves observed with the addition of specified metabolites to the wild type and the deletant. No. is the number of metabolites tested. Existing annotation is the summary from the Saccharomyces Genome Database of the annotation of the ORF. Dry is the summary of whether the annotated function is the same as predicted by Adam. If a gene already has an associated function, we do not consider this to be contradictory to Adam's conclusions unless this function is capable of explaining the observed growth phenotype, for example, BCY1. ida indicates inferred from direct assay and iss, inferred from sequence or structural similarity (5). Wet is the result of our manual enzyme assays. See (16) for details.

View this table:

Because Adam's experimental evidence for its conclusions is indirect, we tested Adam's conclusions with more direct experimental methods. The enzyme 2-aminoadipate:2-oxoglutarate aminotransferase (2A2OA) catalyzes a reaction in the lysine biosynthetic pathways of fungi. Adam hypothesized that three genes (YER152C, YJL060W, and YGL202W) encode this enzyme and observed results consistent with all three hypotheses (Table 1). To test Adam's conclusions, we purified the protein products of these genes and used them in in vitro enzyme assays, which confirmed Adam's conclusions [supporting online material (SOM)] (Fig. 2).

Fig. 2.

Assay results for 2A2OA activity. The proteins encoded by YGL202W, YJL060W, YER152C, and YDL168W were expressed from OpenBiosystems (www.openbiosystems.com) yeast ORF clones and purified. Activity was tested in an assay of NADPH (reduced form of nicotinamide adenine dinucleotide phosphate) production based on (22). l-α-aminoadipic acid and 2-oxoglutarate were provided as substrates and pyridoxal phosphate as cofactor. Glutamate production was assayed by using commercially available yeast glutamate dehydrogenase, which uses NADP as cofactor and deaminates glutamate, producing ammonia and NADPH and regenerating 2-oxoglutarate (16). Also consistent with 2A2OA activity is experimental evidence indicating a higher activity with l-α-aminoadipic acid over either alanine or aspartate (16).

To further test Adam's conclusions, we examined the scientific literature on the 20 genes investigated (Table 1) (16). This revealed the existence of strong empirical evidence for the correctness of six of the hypotheses; that is, the enzymes were not actually orphans (Table 1). The reason that Adam considered them to be orphans was due to the use of an incomplete bioinformatic database. These six genes therefore constitute a positive control for Adam's methodology. A possible error was also revealed (Table 1) (SOM).

To better understand the reasons why the identity of the genes encoding these enzymes has remained obscure for so long, we investigated their comparative genomics in detail (16). The likely explanation is a combination of three complicating factors: gene duplications with retention of overlapping function, enzymes that catalyze more than one related reaction, and existing functional annotations. Adam's systematic bioinformatic and quantitative phenotypic analyzes were required to unravel this web of functionality.

Use of a robot scientist enables all aspects of a scientific investigation to be formalized in logic. For the core organization of this formalization, we used the ontology of scientific experiments: EXPO (11, 12). This ontology formalizes generic knowledge about experiments. For Adam, we developed LABORS, a customized version of EXPO, expressed in the description logic language OWL-DL (17). Application of LABORS produces experimental descriptions in the logic-programming language Datalog (18). In the course of its investigations, Adam observed 6,657,024 optical density (OD595nm) measurements (forming 26,495 growth curves). These data are held in a MySQL relational database. Use of LABORS resulted in a formalization of the scientific argumentation involving over 10,000 different research units (segments of experimental research). This has a nested treelike structure, 10 levels deep, that logically connects the experimental observations to the experimental metadata. (Fig. 3). This structure resembles the trace of a computer program and takes up 366 Mbytes (16). Making such experimental structures explicit renders scientific research more comprehensible, reproducible, and reusable. This paper may be considered as simply the human-friendly summary of the formalization.

Fig. 3.

Structure of the Robot Scientist investigation (a fragment). It consists of two main parts: an investigation into the automation of science and an investigation into the reuse of formalized experiment information. The top levels involve AI research (red), which requires research in functional genomics (blue) and systems biology (yellow). Each level of research unit (studies, cycles, trials, tests, and replicates) is characterized by a specific set of properties (fig. S3) (16). Such a nested structure is typical of many scientific experiments, where the testing of a top-level hypothesis requires the planning of many levels of supporting work. What is atypical in Adam's work is the scale and depth of the nesting.

A major motivation for the formalization of experimental knowledge is the expectation that such knowledge is more easily reused to answer other scientific questions. To test this, we investigated whether we could reuse Adam's functional genomic research (16). An example question investigated was the relative growth rates (μmax) in rich and defined media of the deletion strains compared with those of the wild type. What was observed, in both media, was a skewed distribution, with a few deletants having a much lower μmax than that of the wild type, but most having a slightly higher μmax. These observations question the common assumption that wild-type S. cerevisiae is optimized for μmax and provide quantitative test data for yeast systems biology models (19).

It could be argued that the scientific knowledge “discovered” by Adam is implicit in the formulation of the problem and is therefore not novel. This argument that computers cannot originate anything is known as Lady Lovelace's objection (20): “The Analytical Engine has no pretensions to originate anything. It can do whatever we know how to order it to perform” (her italics). We accept that the knowledge automatically generated by Adam is of a modest kind. However, this knowledge is not trivial, and in the case of the genes encoding 2A2OA, it sheds light on, and perhaps solves, a 50-year-old puzzle (21).

Adam is a prototype and could be greatly improved. Its hardware and software are “brittle,” so although Adam is capable of running for a few days without human intervention, it is advisable to have a technician nearby in case of problems. The integration of Adam's artificial intelligence (AI) software also needs to be enhanced so that it works seamlessly. To extend Adam, we have developed software to enable external users to propose hypotheses and experiments, and we plan to automatically publish the logical descriptions of automated experiments. The idea is to develop a way of enabling teams of human and robot scientists to work together. The greatest research challenge will be to improve the scientific intelligence of the software. We have shown that a simple form of hypothesis-led discovery can be automated. What remain to be determined are the limits of automation.

Supporting Online Material

www.sciencemag.org/cgi/content/full/324/5923/85/DC1

Materials and Methods

Figs. S1 to S3

Table S1

References

References and Notes

View Abstract

Navigate This Article