News this Week

Science  11 Feb 2011:
Vol. 331, Issue 6018, pp. 654
  1. Around the World

    1 - Sandwich, U.K.
    Pfizer Axes R&D Lab
    2 - Washington, D.C.
    No More Earmarks?
    3 - Sark, Channel Islands
    Island Recognized for Its Dark Skies
    4 - Kyushu, Japan
    Volcano May Be Gathering Steam
    5 - Northern Queensland, Australia
    Great Barrier Reef Escapes Battering
    6 - Lake Vostok, Antarctica
    Sealed-Off Lagoon Still Tantalizingly Out of Reach
    7 - London and Brussels
    European Research Funding Set For Makeover

    Sandwich, U.K.

    Pfizer Axes R&D Lab

    Pfizer, the world's largest drug company, is slashing costs—mainly by dropping some areas of research such as work on allergies, respiratory diseases, and antibacterials—in an effort to make its stock more attractive to investors. CEO Ian Read, who took the helm in December, announced that the company will cut R&D spending by about 20% in 2012, down from a projected $8.5 billion to $7 billion or less. The company is streamlining its operations following its takeover in 2009 of another large drug company, Wyeth. Worst affected by the cuts will be Pfizer's research facility in Sandwich, U.K., now slated to be closed. The company also plans to shrink its lab in Groton, Connecticut, but will add several hundred positions to the research staff in Cambridge, Massachusetts. The aim, Read said in a teleconference with investors, is to “fix our innovative core” by instilling an “entrepreneurial sense” and a “results-oriented culture in research” (see p. 658).

    Washington, D.C.

    No More Earmarks?

    The new U.S. Congress has sworn off earmarks, the controversial practice of directing spending for activities not requested by a federal agency.

    The Republican-led House of Representatives has banned earmarks, which totaled $16 billion in 2010, including $2 billion to specific research and construction projects at hundreds of U.S. universities. Last month, in his State of the Union address, President Barack Obama said he would veto any spending bill that contained earmarks. On 1 February, the last major holdout threw in the towel.

    “The handwriting is clearly on the wall,” said Senator Daniel Inouye (D–HI), chair of the Senate Appropriations Committee and one of the biggest practitioners of earmarking. “It makes no sense to accept earmark requests that have no chance of being enacted into law.” But Inouye says he hopes the next Congress can come up with a “transparent and fair” process that removes the odor from earmarks (see p. 661).

    Sark, Channel Islands

    Island Recognized for Its Dark Skies

    CREDIT: BOB EMBLETON

    Sark, the smallest of the four landmasses in the United Kingdom's Channel Islands, has no paved roads, no cars, and no public street lighting. When it gets dark, it gets really dark, making for spectacular Milky Way views.

    The island's rustic ways have now earned it the title of the world's first “dark sky island,” bestowed by the Tucson-based International Dark-Sky Association (IDA), which raises awareness of light pollution and its effects. Many of the island's 650 residents have modified the lighting on their homes and businesses to minimize the amount of light spilling upward, says Steve Owens, a member of the IDA committee that identifies and recognizes sites with suitably dark skies. With the new recognition, he notes, Sark will likely see a boost in tourism, especially among amateur astronomers.

    Kyushu, Japan

    Volcano May Be Gathering Steam

    CREDIT: TAKAYUKI KANEKO, EARTHQUAKE RESEARCH INSTITUTE, UNIVERSITY OF TOKYO

    Japanese scientists and public safety officials are warily eyeing the volcano Shinmoedake, which started rumbling on 26 January and has had at least minor eruptions nearly every day since. As Science went to press, a lava dome had grown to nearly fill the 700-meter-wide crater of the peak on Kyushu Island. “There could be much stronger eruptions over the next 2 weeks,” says Ryusuke Imura, a volcanologist at Kagoshima University. Experts expect eruptions of ash and pumice rather than massive lava flows because of the volcano's structure. Imura says Shinmoedake could keep erupting for months, as it did starting in 1716 when eruptions went on for over a year.

    Northern Queensland, Australia

    Great Barrier Reef Escapes Battering

    Hurricane Yasi, the biggest and most powerful to hit the country in a century, tore roofs off houses along Northern Queensland's coast, destroyed banana and sugar cane crops, and sent 20,000 Australians packing to evacuation centers. But the Category 5 hurricane's powerful winds, which reached up to 290 kmph, may have left the Great Barrier Reef largely unscathed, says Ove Hoegh-Guldberg, director of the University of Queensland's Global Change Institute. Unlike cyclone Hamish of March 2009, which tracked south along the reef for 18 hours, blasting almost half the coral cover from some exposed patches, Yasi moved due west, cutting quickly across a reef that was covered by high tide. “If a hurricane has a silver lining, this was it,” he says. But Hoegh-Guldberg worries that Yasi only worsened soil and nutrient runoff into reef waters—visible by satellite as a huge plume—that was caused by last month's epic flooding of the region.

    Lake Vostok, Antarctica

    Sealed-Off Lagoon Still Tantalizingly Out of Reach

    The Russian team that has been drilling for 24 hours a day to reach the subglacial Lake Vostok, which lies at the bottom of a 3750-meter-thick ice sheet in Antarctica, has come up 29.53 meters short. The team must pack its bags and board an airplane before temperatures drop so low that the plane's hydraulic liquid freezes. The researchers will rest up, reacquaint themselves with their families, and prepare for the next Antarctic summer, when it is hoped the drill head will puncture the pristine waters to reveal new life forms. Valery Lukin, head of the Arctic and Antarctic Research Institute in St. Petersburg says, “There is no frustration,” and quoted Fridtjof Nansen, a Norwegian Arctic explorer: “The highest virtue of a polar explorer is the skill of waiting.”

    London and Brussels

    European Research Funding Set For Makeover

    Máire Geoghegan-Quinn, the European Research, Innovation and Science Commissioner, has called for rebranding Europe's massive funding scheme, the Framework Programme (FP). Speaking to the Royal Society on 7 February in London, she proposed a replacement enterprise that would combine traditional FP funds and other dispersed European research monies. A contest to name it is planned.

    The speech followed a 4 February summit in Brussels where European Union heads of state for the first time dedicated an afternoon to discussing innovation. There, leaders pledged to finish establishing a “European Research Area” by 2014, endorsed the creation of E.U.-wide intellectual property rules, and voiced support for an E.U.-wide venture capital fund. They also called for E.U. financial regulations to be simplified by the end of the year, a move observers say is key for making funding programs more grantee-friendly. http://scim.ag/EU-research

  2. Random Sample

    Noted

    >U.K. archaeologists are protesting a 2008 change in the licensing of excavations that requires reburying human remains found in England or Wales within 2 years. A letter published 4 February in The Guardian and signed by 40 archaeologists lamented that the new rules “are impeding scientific research, preventing new discoveries from entering museums, and are not in the public interest.”

    They Said It

    “It's a bit of a bother. We're getting a lot of inquiries from overseas asking if it's true.”

    —Eisuke Aizawa, spokesperson for the Japan Aerospace Exploration Agency (JAXA), after London's The Telegraph and other newspapers reported erroneously that JAXA and a Japanese fishing net maker had teamed up to make “a giant net several kilometers in size” that would sweep up abandoned satellites and drag them into the atmosphere to burn up. JAXA is exploring creative ways to take down space junk, but a giant fishing net is not among them. http://scim.ag/space-net


    Legal Muddle Hindering U.S. Stem Cell Research

    CREDIT: ADAPTED FROM A. D. LEVINE, CELL STEM CELL 8 (4 FEBRUARY 2011)

    The ongoing uncertainty about the legality of U.S. funding for research on human embryonic stem cells (hESCs) is slowing the work of many researchers across the stem cell field, a new survey finds.

    Last August, a federal court briefly halted federal funding for hESC research; the injunction is now being appealed. Meanwhile, public policy researcher Aaron Levine of the Georgia Institute of Technology in Atlanta contacted 1410 U.S. stem cell researchers, 370 of whom filled out an online survey in November. About 75% of 206 hESC researchers reported a moderate or substantial impact from the policy uncertainty, Levine reported last week in Cell Stem Cell. Effects included delays in starting projects or hiring staff members, added costs to separate hESC and non-hESC work, and decisions to move away from using hESCs. Even stem cell scientists who don't study hESCs reported negative impacts, such as disrupted collaborations.

    New Zealand's Lost-and-Found Pink Terraces

    CREDIT: THE TERRACES, CHARLES BLOMFIELD (1885), MUSEUM OF NEW ZEALAND TE PAPA TONGAREWA/ THE BRIDGEMAN ART LIBRARYNATIONALITY

    The Pink Terraces that fringed Lake Rotomahana were once New Zealand's proudest tourist attraction, until volcanic activity in 1886 submerged the enormous silica deposits. Now a research expedition has once again caught sight of the former “Eighth Wonder of the World.”

    The Pink Terraces, along with the White Terraces on the opposite side of the same lake, formed as water from hot, mineral-rich springs cascaded down hillsides through a series of pools. In June 1886, volcanic activity rocked nearby Mount Tarawera and blocked the stream flowing from the lake. Water levels rose about 100 meters, dramatically expanding the lake's size, says Cornel de Ronde, a geologist with GNS Science, a government-owned research institute based in Lower Hutt, New Zealand.

    In a 2-week-long field study, de Ronde and colleagues, including researchers from Woods Hole Oceanographic Institution (WHOI) in Massachusetts, used a 2-meter-long autonomous underwater vehicle (AUV) to scan the lakebed with sonar. Near the terraces' former location and about 60 meters down, they detected long, crescent-shaped features—the bottom tiers of the Pink Terraces. There were no signs of their upper levels or of the White Terraces.

    Data from the expedition could elucidate how other land-based hydrothermal systems, such as the hot springs and geysers in Yellowstone National Park in the United States, might respond to geological disturbances, says de Ronde. “I never really thought in my wildest dreams we'd find these,” he says. “For New Zealanders, this is the equivalent of finding the Titanic.”

    By the Numbers

    $42 billion — The reduction, proposed last week by the Republican leadership in the House of Representatives, for the rest of fiscal year 2011 in the amount U.S. civilian agencies are now spending on all activities, including all domestic research. The plan cuts total spending to $419 billion, $58 billion below President Barack Obama's 2011 request, which remains in limbo.

    61% — The percentage of North America's nearly 10,000 threatened plant species that are not maintained in seed banks or living collections, according to a new assessment of 230 collections.

    $60 million — The budget of a new initiative by the Howard Hughes Medical Institute to develop science documentary features for television. Each hourlong episode of Planet Earth, the BBC's celebrated 11-part nature series, cost about $2 million.

  3. Newsmakers

    The Million-Dollar Biomarker

    CREDIT: BRUCE WAHL, BETH ISRAEL DEACONESS MEDICAL CENTER

    A Boston neurologist has scored $1 million for developing a new way to track the progression of amyotrophic lateral sclerosis (ALS).

    Seward Rutkove, who works at Beth Israel Deaconess Medical Center, learned about the competition run by Prize4Life, an ALS patient group, in 2007 while at a scientific meeting. Someone on the group's scientific advisory board spotted his poster and told him, “You should be applying for this,” he recalls. Rutkove had worked with Carl Shiffman and Ronald Aaron, physicists at Boston's Northeastern University, to measure how deteriorating muscle fibers respond to a painless electrical current from electrodes placed on the skin and was in the midst of more testing. In 2008, Rutkove won a $50,000 “progress prize” from Prize4Life, but the biggie remained out of his reach. Further studies in patients and animals made the work a winner.

    Would he have done this without the prize as an incentive? “Yes,” says Rutkove, “but it has helped me focus.” He plans to invest a portion of his winnings in a company he's helped found to commercialize the technology.

    German Anesthesiologist Under Fire

    An ongoing investigation into the work of a German anesthesiologist may lead to as many as 90 retractions. The Klinikum Ludwigshafen and the German state medical association of Rheinland-Pfalz have announced that 90 of 115 studies by Joachim Boldt that they examined lacked proper approval from an institutional review board (IRB). In response, 11 scientific journals posted a joint letter on their Web sites stating that lack of IRB approval is grounds for automatic retraction of an article.

    Boldt was fired in November from his job as head of anesthesia at the Klinikum Ludwigshafen after an investigation into a 2009 paper raised suspicions that the study described in the article had never taken place. He has published more than 200 articles, many about the safety of hydroxyethyl starch, which is used as a fluid replacement during surgery. Although the investigation has so far found no evidence that study subjects were harmed, experts in Germany are now recommending against routine use of the fluid given the questions surrounding Boldt's work.

  4. Human Genome 10th Anniversary

    What Would You Do?

    1. Jennifer Couzin-Frankel

    As technology makes it easier to sequence people's DNA for research, scientists are facing tough decisions over what information to give back.

    Question bank.

    The UK Biobank holds more than 500,000 samples available for DNA studies.

    CREDITS: (PHOTO) PHIL NOBLE/REUTERS; (DNA ILLUSTRATION) C. BICKEL/SCIENCE

    The parents sat together in the exam room facing Leslie Biesecker, the geneticist in whose study they had enrolled their young daughter. She had unexplained mental retardation and a host of other problems. A close look at her chromosomes might illuminate why.

    And indeed it had. Biesecker shared the news that the little girl had a deletion in one chromosome, a chunk of DNA gone missing when she was conceived. Given that the parents had voluntarily enrolled her in a study whose goal was to find DNA deletions like this one, he expected them to be pleased, or at least relieved.

    The reality was different.

    “The father was enraged, enraged,” slamming his closed fist down on the table, Biesecker remembers now, more than a decade later. “Here was someone involved in a study with the express focus of finding what was causing their daughter's disability, and he was horrified when we found it.” The reason, the father suggested, was because the missing DNA couldn't be replaced. His daughter would never be normal.

    That moment stayed with Biesecker, a reminder that research participants may harbor intense hopes they expect scientists to confirm, or may not know what they want until the results are laid out in front of them. It left him treading carefully, though doggedly, into uncharted territory, as he began plotting how to return genetic findings to people participating in research.

    With genetic studies multiplying and sequencing costs plunging, more than a million people worldwide are, sometimes unknowingly, sharing their DNA with hundreds or even thousands of researchers. And it's slowly dawning on many scientists and ethicists that even if the DNA was offered to study diabetes or heart disease or some other specific condition, it may surrender many other secrets. Is a study participant at a high risk, or even just a higher risk, of breast cancer? Does she have a sex chromosome anomaly or carry a cystic fibrosis mutation that could threaten her offspring?

    Whether to divulge results like these, and how, is arguably the most pressing issue in genetics today. It “comes up in every conversation,” says Jean McEwen, a program director at the Ethical, Legal and Social Implications (ELSI) Research Program, which is housed in the U.S. National Human Genome Research Institute (NHGRI) in Bethesda, Maryland, where Biesecker also works. “This issue, which was a few years ago kind of theoretical, is becoming real.”

    This News Focus article, the related podcast by its author, and another News Focus on the genomic data explosion (p. 666) are part of a collection this month reflecting on the 10th anniversary of the publication of the human genome. All the stories, and other related material (see also Essays p. 689), will be gathered at http://scim.ag/genome10

    ELSI is now accepting applications for more than $7.5 million in studies on how to share genetic results with research participants. In December, 28 researchers convened by the U.S. National Heart, Lung and Blood Institute (NHLBI) in Bethesda published a set of “ethical and practical” guidelines for returning such results. Hospitals struggling with the issue are running focus groups and mailing surveys to patients and families, querying them on what they might want to learn, however unexpected, about their or their child's DNA.

    “Do you really want to know that your child is going to get Alzheimer's disease when they're 60?” asks Ingrid Holm, a pediatric geneticist and endocrinologist at Children's Hospital Boston, which is launching a registry designed to return genetic research results. People “say they want everything back,” she continues. “I'm not sure they know what everything means.”

    When to share

    The landscape in genetic testing has shifted irrevocably just in the past year or so. Until recently, technology and cost limited geneticists to querying very narrow stretches of DNA, or sequencing a relative handful of DNA variants across the genome. But high-powered, next-generation DNA-sequencing machines are quickly making those approaches obsolete. With the new technology, it's possible to affordably sequence a person's “exome,” all the DNA that generates proteins, which, when defective, can drive disease. Sequencing entire genomes of many research volunteers could soon be the new norm.

    Even simple quality-control measures common to genetic studies can wind up posing a dilemma for researchers. Labs often verify that samples are correctly labeled: that a female sample actually has two X chromosomes, for example, and a male's has an X and a Y. This double-checking can turn up sex chromosome disorders, like Klinefelter syndrome in which men are XXY or Turner syndrome in which women have one X chromosome instead of two. People with Klinefelter's or Turner's vary in their symptoms, and a scientist may suddenly face the prospect of telling someone who donated DNA that their sex chromosomes are abnormal and that they are likely infertile.

    CREDIT (ILLUSTRATION): N. KEVITIYAGALA/SCIENCE; (PHOTO DETAIL) STACY HOWARD/CDC

    Furthermore, the new breed of genetic studies are often fishing expeditions. Hunting all over the genome for DNA behind a particular disease, it's easy enough to collide with the unexpected. “You're using a technology that isn't just looking for the gene for X,” says Bartha Maria Knoppers, who studies law and genetics at McGill University in Montreal, Canada. “You're scanning the whole genome; you're going to see Y and Z.”

    While many genetic studies strip DNA samples of personal identifiers and assign each a number, such codes can often be linked to an individual by a central computer or by the researcher who collected the samples in the first place. In some studies, the DNA is truly anonymous and researchers couldn't contact the donors even if they wanted to, but “we've gotten away from that,” says Benjamin Wilfond, a physician and bioethicist at Seattle Children's Hospital in Washington.

    Genetics isn't the first field to come up against so-called incidental findings. On “virtual” colonoscopies that use CT scans, at least 20% reveal something atypical outside the colon. And a 2007 study found that magnetic resonance imaging (MRI) scans of the brains of adults in a Dutch population study turned up an unexpected abnormality 13% of the time. The unsettling finds included aneurysms, asymptomatic strokes, and tumors. There's little public guidance for researchers on how to handle incidental findings like these, according to Susan Wolf, a law professor specializing in bioethics at the University of Minnesota Law School in Minneapolis.

    Those enmeshed in genetics, facing potentially many more such cases, are now seeking common ground. “I think there is growing consensus,” says Wolf, that what she calls “some really big-ticket items” should be shared with research participants. But despite “widespread agreement that that category exists, there is real disagreement and ferment” over what it encompasses. Some people but not all would include mutations in genes such as BRCA1, MSH2, which predisposes one to colon cancer, and factor V Leiden, which can cause blood-clotting problems and recurrent miscarriages but is treatable.

    Many favor sharing results, whether from a functional MRI or a genetic test, that are both “clinically relevant,” meaning they have a real impact on someone's health, and “medically actionable,” meaning something can be done to alleviate the risk once information is shared. It's defining these terms that's the problem. Is a gene that confers a 30% chance of developing a disease clinically relevant? How about 5% or 1%? And what qualifies as actionable? A relatively clear-cut case is a woman found to carry certain mutations in the BRCA1 gene; her risk of developing breast cancer is about 60% and is much increased for ovarian cancer. She could take advantage of intensive surveillance or have her breasts and ovaries removed to reduce her chance of cancer, something hundreds of women with BRCA1 mutations have done.

    Because preventive care can make a real difference for someone who carries a BRCA1 mutation, many researchers believe that these results are worth sharing. The NHLBI working group agreed and endorsed disclosing many finds that are clinically relevant and medically actionable. Drawing such boundaries “speaks to the kind of narrowness of the medical profession and a certain patronizing view,” says Robert Green, a neurologist at Boston University who has been studying how people respond to learning their genetic risk for Alzheimer's disease, which can't yet be prevented or treated. Most people “don't make the distinction between medically actionable and medically not actionable that the medical and research communities keep trying to make.”

    Of course, the data that are shared must be accurate, says Ellen Wright Clayton, who studies law and genetics at Vanderbilt University in Nashville, and they should be useful. But “deciding your threshold for that is an intensely value-laden question. … The issue about what's returnable is anything but scientific.”

    Many who have a voice in the discussion, such as Clayton, say they would shy away from sharing genetic results. One reason is that there could be legal implications if results are incorrect. Some researchers are double-checking findings in clinically certified labs called CLIA labs in the United States; others are shifting their research work to CLIA labs.

    Then there's the issue of informed consent. Typically, informed consent forms for genetic studies are explicit in saying that results will not be returned. Although consent forms may change in the near future—and in a handful of cases already have—for now, when something comes up, researchers must ask themselves whether it rises to “a level where you're going to break that contract,” says Holm. In one case at Boston Children's, a blood sample from a child in an autism study suggested a fusion of two genes that would mean a still-undiagnosed cancer. A closer look dismissed this possibility, but had the result been accurate, the researchers assumed they would have shared it with the parents. The family of a boy in a research study at Children's who was found to have Klinefelter's was not told, however.

    CREDITS (ILLUSTRATION): N. KEVITIYAGALA/SCIENCE; (PHOTO DETAIL) NATIONAL INSTITUTE ON AGING/NIH

    Klinefelter's and other sex chromosome anomalies make researchers especially uneasy, in part because they're fairly common. If an older man in a genetic study is discovered to have Klinefelter's, how should one decide whether to divulge that, asks Clayton, who's aware of such a case right now. If the individual agreed not to get information back, Clayton's doubtful it should be shared. “What good is going to come out of that?” she asks.

    Others have erred on the side of openness. Alan Shuldiner studies the genetics of heart disease and diabetes at the University of Maryland School of Medicine in Baltimore and works with the Old Order Amish of Lancaster, Pennsylvania. Seven years ago he was parsing the DNA of 2000 Amish for sitosterolemia, a rare disease that causes the accumulation of plant sterols and leads to atherosclerosis and early death. Sitosterolemia is recessive, meaning that each parent must carry a copy of the defective gene to pass the disease along to their child. In his study, Shuldiner found one adult who carried two copies of the mutated gene and had the disease; because it can be treated by diet modifications, there was no question that this person should be told.

    But another 80 or so Amish turned up as healthy carriers, far more than expected given that fewer than 100 cases of sitosterolemia have been described in the general population. Shuldiner hadn't considered this outcome when designing the study. He consulted with his Amish advisory board, “who really felt we should share this information.” He sent a letter to all the Amish in the study—carriers and noncarriers alike—asking them to return a postcard stating whether they wanted their results. The “overwhelming majority” did, he says, and received them, along with counseling.

    What it takes

    Shuldiner's story is unusual, because he has nurtured a personal relationship with his research subjects over many years—something of a throwback in an era of massive biobanks and central DNA repositories accessed by hundreds of geneticists. The push to share data among scientists, across institutions and national borders, means that when a volunteer proffers DNA to one researcher, it often becomes accessible to many others who have no connection to the person who donated his DNA.

    This is especially true for biobanks, DNA collections that allow researchers everywhere to borrow samples. The UK Biobank alone has more than 500,000 of them. If a scientist using a biobank sample chances upon a disease mutation and wants to get back to the donor, where does she turn? DNA and tissue deposited in such banks are usually stripped of identifying information, and the researcher who first collected them may have retired, or moved, or died. That's one reason Knoppers and Wolf hope biobanks themselves will help coordinate delivery of these findings, something they're only beginning to contemplate.

    “Ethicists sit around a table and talk about” the importance of returning DNA results, “but if you talk to people like myself who have actually helped run biobanks, you can't imagine how unsuited we are to doing this,” says Green. Biobanks would have to reach out to the hundreds of thousands of people who have already shared DNA samples and inquire whether they might want information back; currently, virtually all biobank consent forms say that genetic results will not be returned. Even if informed consent forms change, the banks might then need to interact with researchers uncertain about what to share with a DNA donor and make decisions, often on a case-by-case basis, before recontacting a participant with a potentially upsetting research finding.

    “If we're really going to commit to taking this on as a part of every major research study, what is that going to do to the research enterprise?” asks ELSI's McEwen. “We're becoming almost a clinical feedback center.”

    One country may find out the answer to McEwen's question especially quickly. In 2007, Spain passed a law requiring that the physician in charge of a genetic study share information that “is necessary in order to avoid serious damage” to the health of the participant or that of his “biological family members.” Knoppers, who has concerns about legislating this issue, notes that the law incorrectly assumed that a physician is invariably involved. Often, those running the research are Ph.D.s who have never cared for a patient.

    Hampering the debate is an absence of data, with only assumptions to fall back on: assumptions by researchers about what's useful to study participants and the feasibility and impact of sharing genetic findings, and assumptions by participants about how they might benefit from the data they receive.

    There's a push now to move beyond guesswork. “I wanted to see what it was really going to take” to return genetic results, says NHGRI's Biesecker. In 2007, he enrolled the first volunteer in a DNA sequencing study called ClinSeq that now has more than 850 participants. Initially, ClinSeq focused on analyzing 200 to 400 genes that were mostly linked to heart disease, but the plan was always to expand well beyond that when the technology allowed, which Biesecker is now doing. His group is sequencing the exome of every participant to identify DNA behind a host of diseases. With permission from the volunteers, the researchers are then offering to disclose portions of what they find.

    It's a delicate process. “I would call you up and say, ‘Hey, you might remember you signed up for this study a year and a half ago. We have a medically significant result; it is the kind of result that might tell you about your future disposition to develop a disorder,’” says Biesecker. If the participant is interested, the finding is validated and the individual comes in to learn about it, a meeting that normally takes at least an hour.

    One thing Biesecker has learned is that generating data is the easy part. He has sequenced the exomes of more than 400 people and communicated results to about 10. Interpreting and validating the findings takes time, and so far Biesecker has focused on just a handful of genetic findings beyond those related to heart disease. They include BRCA mutations and others that dramatically increase cancer risk, or mutations that predispose to late-onset neurological disorders. The middle-aged men and women in ClinSeq can also learn about recessive mutations they carry; because they are past reproductive age, the information isn't relevant to them personally but they could share it with their children, now young adults, whose own offspring could be affected by a genetic disease.

    Biesecker already sees a problem with expanding ClinSeq's strategy across an entire population: It's not sustainable, he says, to spend hours and hours parsing one person's genome, then bring them in for a 2-hour face-to-face meeting. “The way we do it now doesn't scale,” he says. “It just doesn't.”

    Farther up the East Coast, at Boston Children's, Holm is grappling with the same issue. In October 2009, Children's launched The Gene Partnership project, a DNA registry that has so far enrolled 1000 patients and families for a range of genetic studies. It plans to return many findings related to disease risk, with guidance from an outside group of experts and Children's families, including 7000 to whom Holm sent surveys last month. Although the project will begin with face-to-face meetings for delivering any news, it anticipates shifting at some point to a Web portal that will notify participants that genetic results are available and offer them a phone call with a genetic counselor to learn more. That risks fomenting confusion about what specific findings mean, because sometimes “the only way” to ensure that people understand “is to go face to face,” says David Miller, a geneticist at Children's who works with patients who have developmental disabilities and isn't involved in the effort. But, he admits, “I don't have the right answer either.”

    CREDITS (ILLUSTRATION): N. KEVITIYAGALA/SCIENCE; (PHOTO DETAIL) ISTOCKPHOTOS.COM

    What happens next

    In the early days of widespread clinical gene sequencing—meaning about 3 years ago—the big question was how individuals would react when they learned what was buried in their DNA. Would knowledge of a looming fatal disease cause depression or even suicide attempts? Would those who learned about an uptick in heart attack or colon cancer risk embark on intense exercise regimes or overhaul their diets in hopes of staying healthy?

    Last month, a study published online in The New England Journal of Medicine reported that among 2000 people who bought genetic tests, 90% experienced no distress from the results. In Green's work telling people if they carry the APOE4 gene variant, which predisposes to Alzheimer's disease, he has found that they generally handle the news well and don't regret having learned it.

    But these examples are very different from what may become a more common scenario: an individual who donated DNA 5 years ago, has forgotten that the possibility of data return was listed in the consent form, and has no idea this information is barreling toward him or her. There's no easy way to study this. Biesecker has found that most people in ClinSeq have taken the news of a disease gene mutation in stride. Still, one was distressed and has not shared the results with family members. And only a handful of ClinSeq participants have gotten results so far.

    Another concern is the impact on the health care system when individuals receive a data dump of genetic information. “If you tell a million people that they've got 500 risk factors, and you tell their doctors, … how does this alter all the surveillance and treatment options” that are available? asks Green. This is a huge concern of Clayton's and a big reason why she generally opposes sharing genetic findings. “I think it will kill the health care system,” she says.

    CREDITS (ILLUSTRATION): N. KEVITIYAGALA/SCIENCE; (PHOTO DETAIL) WIKIPEDIA

    Holm takes the opposite view, arguing that imparting these findings could actually reduce health care costs because care might become more personalized. And either way, she says, “you can't say we're not going to do this” because of a potential cost crunch.

    Although some researchers have shared results from their genetic studies with participants, that's still uncommon; exome sequencing, which will expose many more incidental findings, is just on the cusp of rapid expansion. The Spanish law has generated much discussion but has apparently had little practical impact—yet.

    Still, geneticists need to start thinking about what, if anything, they are willing to tell their research subjects—and how they might approach breaking the news. Biesecker, reflecting back on that conversation years ago with the couple whose daughter had missing DNA, remembers that he asked the parents' permission to invite along the Ph.D. who made the discovery. She joined them for that conversation—and the father's reaction so disturbed her that she needed counseling afterward to cope with it.

  5. Human Genome 10th Anniversary

    Will Computers Crash Genomics?

    1. Elizabeth Pennisi

    New technologies are making sequencing DNA easier and cheaper than ever, but the ability to analyze and store all that data is lagging.

    CREDIT: ALVARO ARTEAGA/ALVAREJO.COM

    Lincoln Stein is worried. For decades, computers have improved at rates that have boggled the mind. But Stein, a bioinformaticist at the Ontario Institute for Cancer Research (OICR) in Toronto, Canada, works in a field that is moving even faster: genomics.

    The cost of sequencing DNA has taken a nosedive in the decade since the human genome was published—and it is now dropping by 50% every 5 months. The amount of sequence available to researchers has consequently skyrocketed, setting off warnings about a “data tsunami.” A single DNA sequencer can now generate in a day what it took 10 years to collect for the Human Genome Project. Computers are central to archiving and analyzing this information, notes Stein, but their processing power isn't increasing fast enough, and their costs are decreasing too slowly, to keep up with the deluge. The torrent of DNA data and the need to analyze it “will swamp our storage systems and crush our computer clusters,” Stein predicted last year in the journal Genome Biology.

    Funding agencies have neglected bioinformatics needs, Stein and others argue. “Traditionally, the U.K. and the U.S. have not invested in analysis; instead, the focus has been investing in data generation,” says computational biologist Chris Ponting of the University of Oxford in the United Kingdom. “That's got to change.”

    This News Focus article and another News Focus on sharing genomic data with trial participants (p. 662) are part of a collection this month reflecting on the 10th anniversary of the publication of the human genome. All the stories, and other related material (see also Essays p. 689), will be gathered at http://scim.ag/genome10

    Within a few years, Ponting predicts, analysis, not sequencing, will be the main expense hurdle to many genome projects. And that's assuming there's someone who can do it; bioinformaticists are in short supply everywhere. “I worry there won't be enough people around to do the analysis,” says Ponting.

    Recent reviews, editorials, and scientists' blogs have echoed these concerns (see Perspective on p. 728). They stress the need for new software and infrastructures to deal with computational and storage issues.

    In the meantime, bioinformaticists are trying new approaches to handle the data onslaught. Some are heading for the clouds—cloud computing, that is, a pay-as-you-go service, accessible from one's own desktop, that provides rented time on a large cluster of machines that work together in parallel as fast as, or faster than, a single powerful computer. “Surviving the data deluge means computing in parallel,” says Michael Schatz, a bioinformaticist at Cold Spring Harbor Laboratory (CSHL) in New York.

    Dizzy with data

    The balance between sequence generation and the ability to handle the data began to shift after 2005. Until then, and even today, most DNA sequencing occurred in large centers, well equipped with the computer personnel and infrastructure to support the analysis of a genome's data. DNA sequences churned out by these centers were deposited and stored in centralized public databases, such as those run by the European Bioinformatics Institute (EBI) in Hinxton, U.K., and the National Center for Biotechnology Information (NCBI) in Bethesda, Maryland. Researchers elsewhere could then download the data for study. By 2007, NCBI had 150 billion bases of genetic information stored in its GenBank database.

    Then several companies in quick succession introduced “next-generation” machines, faster sequencers that spit out data more cheaply. But the technologies behind these machines generate such short stretches of sequence—typically just 50 to 120 bases—that far more sequencing is required to assemble those fragments into a cohesive genome, which in turn greatly ups the computer memory and processing required. It once was enough to sequence a genome 10 times over to put together an accurate genome; now it takes 40 or more passes. In addition, the next-generation machines produce their sequence data at incredible rates that devour computer memory and storage. “We all had a moment of panic when we saw the projections for next-generation sequencing,” recalls Schatz.

    Those projections are already being realized. A massive study of genetic variation, the 1000 Genomes Project, generated more DNA sequence data in its first 6 months than GenBank had accumulated in its entire 21-year existence. And ambitious projects like ENCODE, which aims to characterize every DNA sequence in the human genome that has a function, offer jaw-dropping data challenges. Among other efforts, the project has investigated dozens of cell lines to identify every DNA sequence to which 40 transcription factors bind, yielding a complex matrix of data that needs to be not only stored but also represented in a way that makes sense to researchers. “We're moving very rapidly from not having enough data to going, ‘Oh, where do we start?’” says EBI bioinformaticist Ewan Birney.

    Moreover, as so-called third generation machines—which promise even cheaper, faster production of DNA sequences (Science, 5 March 2010, p. 1190)—become available, more, and smaller, labs will start genome projects of their own. As a result, the amount and kinds of DNA-related data available will grow even faster, and the sheer volume could overwhelm some databases and software programs, says Katherine Pollard, a biostatistician at the Gladstone Institutes of the University of California (UC), San Francisco. Take Genome Browser, a popular UC Santa Cruz Web site. The site's programs can compare 50 vertebrate genomes by aligning their sequences and looking for conserved or nonconserved regions, which reveal clues about the evolutionary history of the human genome. But the software, like most available genome analyzers, “won't scale to thousands of genomes,” says Pollard.

    The spread of sequencing technology to smaller labs could also increase the disconnect between data generation and analysis. “The new technology is thought of [as] being democratizing, but the analytical capacity is still focused in the hands of a few,” warns Ponting. Although large centers may be stretching their computing, and their laborpower, to new limits, they basically still have the means to interpret what they find. But small labs, many of which underestimate computational needs when budgeting time and resources for a sequencing project, could be in over their heads, he warns.

    Clouds on the horizon

    James Taylor, a bioinformaticist at Emory University in Atlanta, saw some of the demands for data analysis coming. In 2005, he and Anton Nekrutenko of Pennsylvania State University (Penn State), University Park, pulled together various computer genomics tools and databases under one easy-to-use framework. The goal was “to make collaborations between experimental and computational researchers easier and more efficient,” Taylor explains. They created Galaxy, a software package that can be downloaded to a personal computer or accessed on Penn State's computers via any Internet-connected machine. Galaxy allows any investigator to do basic genome analyses without in-house computer clusters or bioinformaticists. The public portal for Galaxy works well, but, as a shared resource, it can get bogged down, says Taylor. So last year, he and his colleagues tried a cloud-computing approach to Galaxy.

    Cloud computing can mean various things, including simply renting off-site computing memory to store data, running one's own software on another facility's computers, or exploiting software programs developed and hosted by others. Amazon Web Services and Microsoft are among the heavyweights running cloud-computing facilities, and there are not-for-prof it ones as well, such as the Open Cloud Consortium.

    CREDIT: ALVARO ARTEAGA/ALVAREJO.COM

    For Taylor's team, entering the cloud meant developing a version of Galaxy that would tap into rented off-site computing power. They set up a “virtual computer” that could run the Galaxy software on remote hardware using data uploaded temporarily into the cloud's off-site computers. To test their strategy, they worked with Penn State colleague Kateryna Makova, who wanted to look at how the genomes of mitochondria vary from cell to cell in an individual. That involved sequencing the mitochondrial genomes from the blood and cheek swabs of three mother-child pairs, generating in one study some 1.8 gigabases of DNA sequence, about 1/10 of the amount of information generated for the first human genome.

    Analyzing these data on the Penn State computers would have been a long and costly process. But when they uploaded their data to the cloud system, the processing took just an hour and cost $20, Taylor reported in May 2010 at the Biology of Genomes meeting in Cold Spring Harbor, New York. “This is a particularly cost-effective solution when you need a lot of computing power on an occasional basis,” he says. With the help of the cloud, he has access to many computers but doesn't have the overhead costs of maintaining a powerful computer network in-house. “We're going to encourage more people to move to the cloud,” he adds.

    CSHL's Schatz and Ben Langmead, a computer scientist at the Johns Hopkins Bloomberg School of Public Health in Baltimore, Maryland, are already there and are helping to make that shift possible for others. In 2009, the pair published one of the first results from marrying cloud computing and genomics. They wanted to identify common sites of DNA variation known as single-nucleotide polymorphisms (SNPs), but to do so they needed to hunt through short sequences of human DNA totaling an amount equivalent to 38 copies of the human genome. With the help of a cloud-based cluster of 320 computers, they identified 3.7 million SNPs in less than 4 hours and for less than $100. “We estimate it would have taken a single computer several hundred hours for the analysis,” says Schatz.

    At the Biology of Genomes meeting, Langmead and Schatz unveiled two new cloud-computing initiatives. Langmead described a computer program called Myrna that determines the differential expression of genes from RNA sequence data and is designed for the parallel processing performed by cloud-computing facilities. Schatz introduced another program, Contrail, that can assemble genomes from data that next-generation sequencing machines generate and deposit into a cloud.

    So much for so little.

    The decline in sequencing costs (red line) has led to a surge in stored DNA data.

    Low cost and speed aren't the only advantages of the cloud approach, says Langmead. “The cloud user never has to replace hard drives, renew service contracts, worry about electricity usage and cooling, deal with flooding or other natural disasters, et cetera,” he points out. For small labs that lack their own powerful computer clusters, “cloud computing may represent the democratization of computation,” says Schatz.

    But cloud computing is “not mature,” cautions Vivien Bonazzi, program director for computational biology and bioinformatics at NHGRI. Putting data into a cloud cluster by way of the Internet can take many hours, even days, so cloud providers and their customers often resort to the “sneaker net”: overnight shipment of data-laden hard drives. And with the exception of Galaxy, Myrna, and a few other computer tools, not much genomics software is configured for the massively parallel processing approach taken by cloud computers. “It is currently too difficult to develop cloud software that's truly easy to use,” says Langmead.

    Also, cloud computing works best if an analysis can be divided into many separate tasks handled by multiple processors. But the connections among the cloud's processors can be fairly slow, so computations requiring processors to talk to each other can get bogged down, says Langmead. Some researchers worry that the burgeoning cloud-computing industry won't agree on standards that will allow for connections between clouds, such that data stored on one cloud can be accessible to another. “Cloud computing is hot and sexy,” says Bonazzi. “But it's not the answer to everything.”

    Storage issues

    Cloud computing offers a possible solution to other problems facing the bioinformatics community: data storage and transfer. Because storage costs are dropping much more slowly than the costs of generating sequence data, “there will come a point when we will have to spend an exponential amount on data storage,” says Birney.

    That has created pressure to let go of the field's long-standing tendency to archive all raw sequence data. Because the raw material from next-generation machines is in the form of high-resolution images, it soaks up huge amounts of computer storage. So scientists are considering discarding the original image files once they produce the preliminarily processed sequence data, which is more easily kept. Eventually, it may be more economical to save no raw data and just resequence a DNA sample if necessary. But for now, as to what should be kept, “there's a lot of thrashing still to happen,” says Bonazzi.

    Putting the data in an off-site facility could relieve some of the pressure, says OICR's Stein. The economies of scale available to large cloud-providing companies can produce significant cost savings, meaning it might be cheaper to rent transient storage space from the cloud in some cases. Storage costs at the Amazon Web Server top off at 14 cents a gigabyte per month, according to Amazon's Deepak Singh. “In comparison, it commonly costs 50 cents to $1 per gigabyte for high-end storage on a local system,” Schatz says. For NCBI, however, it's still more cost-effective to keep GenBank and its other databases in-house, says Don Preuss of NCBI.

    Putting data in a cloud may help in other ways as well. Right now, anyone wanting to analyze a genome has to download it from a public archive such as GenBank—and as these data sets get larger, such transfers become slower. Moreover, downloaded copies of these data sets, some now out of date, have proliferated around the world, each one taking up storage space that eats into bioinformatics budgets. In his vision, says Stein, “you have one copy of the data located in this common cloud that everyone uses” and it won't be necessary to download or upload the data between computers for processing.

    Encouraged by the genomics community, NCBI has put a copy of the data from the pilot project of the 1000 Genomes effort into off-site storage run by a cloud-computing provider. And U.S. East Coast users of Ensemble, the EBI sequence database, are automatically funneled into a cloud environment as part of a test of the strategy.

    One worry about this approach is the security of the data. Data involving the health of human subjects, which is being linked more and more to genome information, requires extra precautions that make some researchers hesitant about clouds. However, at least one cloud-computing company already has clients whose human data are covered by the strict health information protection laws of the United States, so there are indications that this concern can be allayed.

    All these issues came to the fore last year, when NHGRI hosted several meetings on cloud computing and on informatics and analysis, says Bonazzi. Also, at a retreat last summer, the case was made for more bioinformatics training and education. “One thing that is clear is that as computation becomes more and more necessary throughout biomedical research, the way these [infrastructure] resources are funded will have to change to be more efficient,” says Taylor. For now, NHGRI has no programs in place to address these needs. “But they are on our radar,” says Bonazzi.

    Like Stein, she worries about swamped storage systems and overwhelmed computer clusters. But Bonazzi remains sanguine. “Do I think these problems will be solved?” she says. “I'm optimistic.” And even Stein is trying to think positively. “I'm very good at predicting disasters that never happen,” he says. There's always sunlight above the clouds.

  6. Computer Models

    Coming Soon to a Lab Near You: Drag-and-Drop Virtual Worlds

    1. Robert F. Service

    Researchers at Microsoft hope to convince scientists that transparent, easy-to-tweak numerical simulations are as straightforward as clicking a mouse.

    Model builder.

    “I'm interested in tools that change the way science is done,” Stephen Emmott says.

    CREDIT: COURTESY OF MICROSOFT RESEARCH

    CAMBRIDGE, UNITED KINGDOM—Techies love to hate Microsoft. They curse the “blue screen of death” that appears when a computer running the company's flagship Windows operating system crashes. They deride what they say are Windows's bloated code and security flaws. And they complain that the software giant is perpetually behind the curve on new technologies such as smart phones and tablet computers. In short, techies—many scientists included—are a tough audience.

    So in 2003, Stephen Emmott could have been forgiven if he had walked the other way when Microsoft executives asked him to come aboard and help the company figure out what it should be doing in science. Emmott, then a neuroscientist at University College London who had worked previous stints at Bell Laboratories and NCR, accepted the challenge, provided he could build a cutting-edge computational sciences laboratory within Microsoft's research division to tackle knotty scientific challenges. If successful, the software the group created would help other scientists make broad impacts on their fields as well.

    It's too early to say whether this strategy will make money for Microsoft in the long run. Indeed, for now, Emmott says that he and his colleagues plan to share their wares freely with the academic scientific community. But Emmott's vision is now in full gear. He spent his first year selling his ideas within the company and began hiring staff members. Now Microsoft Research's computational science lab has 40 Ph.D.s and students and continues to grow.

    A couple of the researchers are software engineers—obviously Microsoft's stock in trade—but most come from disciplines as varied as ecology, neuroscience, mathematics, and developmental biology. Their hope, say Emmott and others, is to transform the way scientists study complex, ever-changing systems, such as the global carbon cycle and information processing inside cells. To do so, they're working to develop a suite of new software tools including novel programming languages that better represent biological systems and computer models that work across multiple scales, simulating carbon budgets at the levels of leaves, trees, and forests, for example. They're also striving to make those tools simple to use, thereby extending the types of studies that can be done by researchers who aren't full-time programmers. “I'm interested in tools that change the way science is done,” Emmott says.

    Prototype versions of several of these tools are now up and running and being put through their paces by researchers at Microsoft. One program, currently called Microsoft Computational Science Studio, contains components that are able to handle disparate types of data, quickly plug them into a model, and visualize the interactions. Other packages help biologists design and simulate DNA circuits for biological computers and manage wireless sensor networks for tracking animal behavior. Carol Barford, an ecologist at the University of Wisconsin, Madison, says she has used other software packages produced by academics to build and visualize complex models. She recently began working with Microsoft's software to investigate how future climate-change scenarios might affect agricultural production around the globe. “It's the slickest one I've ever seen,” she says.

    Capturing complexity

    So why is a computer software company known primarily for its operating systems and business software mucking around with modeling the global carbon cycle and working to understand the human immune system? Sitting in his ground-floor office across the road from the University of Cambridge's famed Cavendish Laboratory where J. J. Thomson discovered the electron and James D. Watson and Francis Crick deciphered the structure of DNA, the 50-year-old neuroscientist spells out his thinking. For starters, Emmott says, science is “set to be the driver of our times.” So progress on new computational tools and methods has the potential to make an impact on numerous fields. As well, he adds, scientific problems at the frontier of computing are perfect for honing talent and ideas that may lead to new or better Microsoft products.

    A good way to start that improvement is by making computer models simpler to navigate and understand. Computer models, of course, aim both to approximate the real world and to predict how it might change in the future. That's relatively straightforward when a model's key inputs, or parameters, are known. That is why engineers can land a rocket on the moon and construct bridges capable of withstanding gale-force winds. But there are a host of problems, called inverse problems, for which not all of the right parameters are known. For them, researchers must sift through vast amounts of observations to identify which set of parameters to plug into their models and their appropriate values. To make matters more challenging, researchers are often also unclear about how a complex system's key parameters interact. That makes accurate model building and predictions dicey at best.

    Take a complex climate model, for example. Over decades, researchers at labs around the world—including the Met Office Hadley Centre for Climate Prediction and Research in Exeter, U.K., and the National Center for Atmospheric Research (NCAR) in Boulder, Colorado—have built enormously complex general circulation models (GCMs) that predict the future state of Earth's climate by tracking how parameters such as wind currents, sea surface temperatures, polar ice cover, clouds, and rising greenhouse gas levels interact. As new observations revise, for example, the amount of light Earth's changing ice cover reflects into space, researchers can tweak that parameter in their model and run a new simulation to gauge the likely impact. Typically, however, this process is very slow. “It can take months or years to iterate current models,” Emmott says.

    Quick turnaround.

    Computational Science Studio is one of several new software tools aimed at making complex models easier to build, test, and refine.

    CREDIT: COURTESY OF MICROSOFT RESEARCH

    Equally challenging is that such complex models are written in computer code that is impenetrable to most researchers outside the group responsible for updating it. That makes it difficult for researchers working in one particular area, such as tracking tree mortality rates in the Amazon, to get a sense of whether their data might influence the broader climate picture. That, in turn, slows the search for additional important parameters that might improve the models. “Now we have loads of data,” says Rosie Fisher, an ecophysiologist at NCAR. “What we need are ways to quickly find the patterns that emerge from that data. It is a huge software problem. So it is very exciting that the Microsoft people are willing to look at this.”

    From office to lab

    Why haven't others set their sights on such a goal before? Academic groups, Emmott explains, are adept at creating software tools and models to meet their own needs. But they typically don't have the time, money, or inclination to make them broadly useful to other researchers. “It was never really someone's job to do it,” Emmott says. “The need was always to get your own research done rather than providing a service to the community.” In the modeling arena, he adds, he hopes to streamline modeling software to make complex models far more accessible. “In essence, we want to do for modeling software what Microsoft programs such as Word and Excel did for business software,” Emmott says.

    That's where Computational Science Studio comes in. At the heart of the program—and others the lab is developing—is a software module code named Scientific Data Set (SDS), a sort of universal translator capable of recognizing and interpreting a wide variety of common data types, such as time series, satellite and medical images, and multidimensional numerical arrays. Users can also add new types of their own. The SDS allows “complete promiscuity” in working with virtually any type of data, Emmott says. With this ability, programmers then created software to allow virtually anyone to plug in different data sets and at the click of a button set up a model of how they interact. A visualization component renders the relationships, such as mapping out how different levels of deforestation in the Amazon rainforest would impact surface temperatures in Africa.

    Last summer, Drew Purves, who heads Microsoft's Computational Ecology and Environmental Sciences (CEES) group, demonstrated the modeling package to Simon Lewis, an ecologist at the University of Leeds in the United Kingdom, and some of his colleagues. At the time, the demo showed how the drag-and-drop modeling software could plug in a wide range of data on biological processes—such as rates of photosynthesis and soil nitrogen fixation—and integrate them with changing CO2 and temperature levels to show how a change in climate might affect the amount of carbon stored in forests. Among other things, the demo revealed how different deforestation rates could speed up or slow down temperature increases by 8 years by 2050. The model ran on a desktop computer in just a few minutes. “I was pretty impressed that they could be so computationally efficient,” Lewis says.

    That efficiency could be vital to improving how current climate models handle the effect of biological feedbacks on future climate, for now one of the biggest uncertainties current models grapple with. For example, forests currently store as much carbon as is present in the atmosphere. Most climate modelers expect average temperatures to warm by between 1.6°C and 4.3°C by 2100 given midrange carbon-emissions scenarios. Less clear is how plants, from trees to grasses, savannas, and agriculture will respond. If the extra CO2 in the atmosphere makes most plants grow faster, this could ameliorate some of the warming. Yet if the higher temperatures increase plant mortality, this could cause gigatons of carbon now stored in tree trunks and roots to wind up in the atmosphere and accelerate warming. Current models, known as dynamic global vegetation models, project widely different outcomes. By the year 2100, vegetation might be a carbon sink for 11 gigatons of carbon a year, or it might release an additional 6 gigatons of carbon every year beyond humanity's contribution.

    Today's GCMs simulate detailed physical processes, such as ocean and atmospheric circulation. But they've had a harder time incorporating the myriad biological processes. Part of the challenge, Emmott explains, is that biological processes vary widely at different scales. At the level of an individual leaf and tree, rates of photosynthesis and nutrient uptake are key to understanding a tree's health and viability, whereas the competition of trees for light and how trees disperse their seeds becomes important at the level of a stand of forest.

    Conventional models have struggled to incorporate such complexity across multiple scales. But the plug-and-play environment of the new software makes it a more manageable task. In fact, since their early demos to Lewis's group, among others, Purves and colleagues have constructed a more detailed carbon cycle model incorporating many biological processes. The unpublished preliminary models predict that the carbon stored in vegetation by 2100 will fall within the range forecast by previous models.

    Programmer for life.

    Andrew Phillips (right) helped design software that tells how to engineer bacteria to grow in a “Turing pattern” (above).

    CREDITS (TOP TO BOTTOM): COURTESY OF MICROSOFT RESEARCH; COURTESY OF MICROSOFT RESEARCH AND JIM HASELOFF, UNIVERSITY OF CAMBRIDGE

    The new carbon cycle model is far from the last word on the matter. Rather, the hope, Purves says, is that this ability to quickly build and test models will allow researchers, and entire research communities, to speed the cycle of improving their models. “One of the things we lack is the ability to explore a large number of scenarios,” Purves says. Computational Science Studio and the lab's other new tools can help remedy that, says Matthew Smith, an ecologist in the CEES group. “The idea here is, you plug it in and ask if it is important,” he says. “You can form your tests so much more quickly, and this allows you to cycle through them much faster.” Equally important, the software should make it easier for researchers to test their ideas without becoming experts in writing code.

    Another advantage of Microsoft's drag-and-drop modeling software is that it makes it easy to see what assumptions are built into the model, and it can even specify the degree of uncertainty in different components. Ultimately, Smith and Purves say, this type of a more generic and transparent modeling platform could help climate scientists and other groups compare their wares. “GCMs all have different data fed into them,” Purves says. “We should take several different models and train them with the same data” and compare their outcomes, Purves adds. Eventually, that should reduce the models' collective uncertainties and improve their predictions.

    Beyond climate

    Emmott and his colleagues have set their sights on modeling far more than climate. They've also recently developed new programming languages and other tools for modeling complex biology. In one example, they've modeled a set of immune molecules known as the major histocompatibility complex class I. MHC-I molecules grab, replicate, and present small protein fragments known as peptides on the outer surface of cells. Immune sentries called T cells then inspect those peptides for foreign signatures common to viruses and other invaders and kill cells that might spread infection. Much is known about many of the key molecular MHC-I players, but the complexity of their interactions has prevented biologists from constructing a good model of how they behave in cells.

    So Emmott and his colleagues used Computational Science Studio to plug in the key molecular players. The model enabled them to compare different theories of how the MHC-I system works. The prevailing view, Emmott explains, has been that a process known as peptide editing governs which peptides are presented to T cells and thus are most likely to generate an immune response. The team's latest model suggests that peptide editing indeed “accounts for a lot of the data,” Emmott says. But the model gave an even better fit when the team added a secondary step, known as peptide filtering, in which a protein called tapasin recognizes foreign proteins and prioritizes which ones are displayed. This preliminary work also needs to be fleshed out before being published, Emmott says, but it underscores that plug-and-play models can test new ideas very quickly.

    Not everyone at the lab is trying to simulate natural processes. Andrew Phillips, a computer scientist turned biologist, is leading a group developing computing languages and models for programming biological systems, from DNA strands to cells. In one project, Phillips and several colleagues created a new programming language for designing circuits in which tailored DNA strands interact to carry out a computation through a process called strand displacement. On 17 June 2009 in the Journal of the Royal Society Interface, Phillips and Microsoft colleague Luca Cardelli reported that they could use their setup to design simple logic gates and catalytic circuits, among other functions. They are testing the results with real DNA in test tubes.

    In a second project, Phillips and colleagues created a programming language and models for designing genetic circuits that function inside cells. The team simply writes a program for a desired function, and the software will design the DNA strands needed for cells to pull it off. In one example, Phillips starts with an input that allows cells to express green fluorescent protein and writes a program to make a colony of cells in a petri dish express a pattern of colored regions known as a Turing pattern. The software then automatically generates the set of DNA sequences needed to produce the pattern. At this stage, the result is still an onscreen simulation, but Phillips and his colleagues are partnering with others to try to replicate it in real cell colonies.

    As in other areas, Microsoft's computational scientists aren't alone in their efforts to push the envelope on synthetic biology. But the Cambridge team's new software languages and models could bring such work—which now requires heavy lifting by highly specialized labs—within reach of a far broader audience. If so, their stock among scientists could be on the rise.

  7. News

    Rescue of Old Data Offers Lesson for Particle Physicists

    1. Andrew Curry*

    Old data tends to get forgotten as physicists move on to new and better machines. The tale of the JADE experiment suggests that they should be more careful.

    Back on track.

    Particle-track data from the 1980s-era JADE experiment after restoration by Siegfried Bethke's team in 1999.

    CREDIT: COURTESY OF SIEGFRIED BETHKE

    In the mid-1990s, Siegfried Bethke decided to take another look at an experiment he participated in around 2 decades earlier as a young particle physicist at DESY, Germany's high-energy physics lab near Hamburg. Called JADE, it was one of five experiments at DESY's PETRA collider, which smashed positrons and electrons into each other. Looking at the strength of the force that binds quarks and gluons into protons and neutrons, JADE finished in 1986 when DESY closed down PETRA to build a more powerful collider. In the decades since, new theoretical insights had come along, and Bethke hoped the old data from JADE—taken at lower collision energies—would yield fresh information.

    What the physicist found was a disaster. Since JADE shut down and the experiment's funding ended, the data had been scattered across the globe, stored haphazardly on old tapes, or lost entirely. The fate of the JADE data is, however, typical for the field: Accustomed to working in large collaborations and moving swiftly on to bigger, better machines, particle physicists have no standard format for sharing or storing information. “There's funding to build, collect, analyze, and publish data, but not to preserve data,” says Salvatore Mele, a physicist and data preservation expert at the CERN particle physics lab near Geneva, Switzerland.

    This tendency has prompted some in the field to call for better care to be taken of data after an experiment has finished. For a very small fraction of the experiment's budget, they argue, data could be preserved in a form usable by later generations of physicists. To promote this strategy, researchers from a half-dozen major labs around the world, including CERN, formed a working group in 2009 called Data Preservation in High Energy Physics (DPHEP). One of the group's aims is to create the new post of “data archivist,” someone within each experimental team who will ensure that information is properly managed.

    Physics archaeology

    For the founders of DPHEP, Bethke's struggles with the JADE data are both an inspiration and a cautionary tale. It took Bethke, now the head of the Max Planck Institute for Physics in Munich, Germany, nearly 2 years—and a lot of luck—to reconstruct the data. Originally stored on magnetic tapes and cartridges from old-style mainframes, most of it had been saved by a sentimental colleague who copied the few gigabytes of data to new storage media every few years. Other data files turned up at the University of Tokyo. A stack of 9-track magnetic storage tapes was stashed in a Heidelberg physics lab. One critical set of calibration numbers survived only as ASCII text printed on reams of green printer paper found when a DESY building was being cleaned out. Bethke's secretary spent 4 weeks reentering the numbers by hand.

    Even then, much of the data couldn't be read. Software routines written in arcane IBM assembler codes such as SHELTRAN and MORTRAN, tweaked for '70s-era computers for which memory was at a premium, and stored on long-deactivated personal accounts, were lost forever. A graduate student spent a year recreating code used to run the numbers.

    The recovery work was motivated by more than Bethke's nostalgia. In the years since JADE ended, new theories about what physicists call the strong coupling strength had emerged. These predict phenomena that can best be seen at lower energies than today's colliders are able to replicate. By reanalyzing the old data, Bethke's team squeezed more than a dozen high-impact scientific publications out of the resurrected JADE data. Some of the data helped confirm quantum chromodynamics, the theory governing the interior of atomic nuclei, and was cited by the committee that awarded the 2004 Nobel Prize in physics to David Gross, David Politzer, and Frank Wilczek. “It was like physics archaeology,” Bethke says today. “It took a lot of work. It shouldn't be like that. If this was properly planned before the end of the experiment, it could have all been saved.”

    The usefulness of JADE's old data may not be an isolated occurrence. “Big installations are more high-energy, but they don't replace data taken at lower energy levels,” says Cristinel Diaconu, a particle physicist at DESY. “The reality is a lot of experiments done in the past are unique; they're not going to be repeated at that energy.”

    If anything, the need to better preserve particle physics data has grown more urgent in the past few years as CERN's Large Hadron Collider (LHC) captured the world's attention and a handful of other high-profile projects—BaBar at the SLAC National Accelerator Laboratory, Japan's KEK collider, and the latest DESY experiments—wrap up work and prepare to disband. “In the past, experiments were smaller and more frequent. Now we build very big devices that cost a lot of money and person power over a number of years,” says Diaconu. “Each experiment is one application, built specifically for the task.” The LHC alone represents nearly a half-century of work, with 20 years invested in design and construction and 20 years of scheduled operation. There will never be another experiment like it.

    Down but not out.

    Siegfried Bethke works on the JADE detector in 1984 (above). A display screen (top) announces the end of the experiment.

    CREDITS: SIEGFRIED BETHKE

    The issue, experts say, isn't data degradation. “The problem starts when the experiment is over, and the data used by one group of people is only understood by those people,” Diaconu says. “When they go off and do other things, the data is orphaned; it has no parents anymore.” The orphan metaphor only goes so far: After a certain point, orphaned data can't be adopted by later researchers who weren't part of the original team. Even given the raw data, only someone intimately involved in the original experiment can make sense of it. “The analysis is so complex that to understand the data you have to be there with it, working on the experiment,” says SLAC database manager Travis Brooks. “There's a whole spectrum of things you need to keep around if you want petabytes [1015 bytes] of data to be useful.”

    That spectrum includes everything from internal notes that explain the ins and outs of analyses, to subprograms designed to massage numbers for specific experiments. And then there's the fuzzy-sounding “metainfo,” the hacks and undocumented software tweaks made by a team in the midst of a project and then quickly forgotten.

    Making it worse, particle physicists don't usually share their data outside their collaborations the way most peer-reviewed scientists do. “We don't publish the data, because it's something like a petabyte—you can't just attach the raw data in a ZIP file,” Brooks says. As a result, there's been no incentive to find a standard format for the raw information that would be readable to outsiders.

    A data librarian

    To give shuttered experiments a future, the DPHEP working group is looking for ways to keep data in working order long after the original collaboration has disbanded. Typically, software that can make sense of the data is custom-made to run on servers that are optimized for the experiment and shut down when funding runs out. And the constant churn of technology can make software and storage media obsolete within a matter of years. “The data can't be read if the software can't be run,” Brooks says.

    One option is to “virtualize” the software, creating a digital layer that simulates the computers the experiment was originally run on. With regular updates and maintenance, software designed to run on the UNIX machines of today could be rerun on the computers of the future the same way people nostalgically play old Atari games on new PCs, for example.

    To capture and preserve the less tangible aspects of a particle physics experiment, the working group has suggested the job of data archivist. The archivist would be in charge of baby-sitting the data and standardizing the software used to read it, helping to justify huge investments in the big machines of physics by making data usable by future researchers or useful as a teaching tool. The idea has been endorsed by the International Committee for Future Accelerators, an advisory group that helps coordinate international physics experiments. DPHEP is also pushing data preservation among funding agencies, arguing that the physics experiments of the future should be designed with a data-preservation component to help justify their cost.

    Diaconu admits that the idea has a way to go before it captures the minds of young physicists focused on publishing new data. “Some people say, ‘Can you imagine how boring, to sit and look at old data for 20 years?’” he says. “But look at a librarian. Part of their job is taking care of books and making sure you can access them.” A data archivist would be a mix of librarian, IT expert, and physicist, with the computing skills to keep porting data to new formats but savvy enough about the physics to be able to crosscheck old results on new computer systems.

    The DPHEP group estimates that archivists—and the computing and storage resources they'd need to keep data current long after an experiment ended—would cost 1% of a collider's total budget. That can be a hefty financial commitment: It would amount to $90 million for CERN. But keeping data in a usable form would provide a return on the investment in the form of later analyses, the group argues. Says Diaconu: “Data collection may stop, but it's not true that's the end of the experiment.”

    • * Andrew Curry is a freelance writer based in Berlin.

  8. News

    Is There an Astronomer in the House?

    1. Sarah Reed*

    With biomedical researchers analyzing stars and astronomers tackling cancer, two unlikely collaborations creatively solve data problems.

    Surprisingly similar.

    A picture of the center of our galaxy and a slide of stained cancerous tissue show a common need to pick out indistinct objects in both types of images.

    CREDITS (LEFT TO RIGHT): EUROPEAN SOUTHERN OBSERVATORY/VISTA AND CASU, UNIVERSITY OF CAMBRIDGE; WALTON & IRWIN, UNIVERSITY OF CAMBRIDGE

    In 2004, Alyssa Goodman had a problem. An astronomer at Harvard University, she and her colleagues had just wrapped up a project called COMPLETE, a survey of star-forming regions; now they had to analyze massive amounts of data that were tricky to visualize in only two dimensions. Goodman wanted a three-dimensional view of the regions, but the tools available to astronomers weren't up to the task. So she went in search of the answer elsewhere.

    Goodman presented her problem at a workshop called Visualization Research Challenges, held at the headquarters of the U.S. National Institutes of Health (NIH) in Bethesda, Maryland. In the audience was Michael Halle, a radiologist at Brigham and Women's Hospital in Boston, who recognized that the technology Goodman needed already existed in medicine. His department had previously developed a piece of visualization software, called 3D Slicer, for use with medical scans such as MRIs. Halle thought it could handle Goodman's astronomical data set as well.

    His hunch was right. And the unusual collaboration that formed between his team and Goodman's still exists today at Harvard in a data-analysis project called Astronomical Medicine.

    It's not the only odd pairing of astronomers and biomedical researchers motivated by the need to deal with data. At the University of Cambridge's Institute of Astronomy (IoA) in the United Kingdom, Nicholas Walton uses sophisticated computer algorithms to analyze large batches of images, picking out faint, fuzzy objects. When he isn't looking for distant galaxies, nebulae, or star clusters, the astronomer lends his data-handling skills to the hunt for cancer.

    From stars to biomarkers

    Walton and his colleagues work on a project called PathGrid in which image-analysis software developed for astronomy is being used to automate the study of pathology slides. Pathologists stain tissue samples to identify various biomarkers that indicate a cancer's aggressiveness. Currently, they must inspect each slide personally with a microscope, but PathGrid aims to improve on this time-consuming and subjective endeavor.

    The key behind the project is the surprising similarity between images of tissue samples and the cosmos: Spotting a cancerous cell buried in normal tissue is like finding a single star in a crowded stellar field. “There's a natural overlap in astronomy and medicine for needing to identify and quantify indistinct objects in large data sets,” says oncologist James Brenton of the Cancer Research UK Cambridge Research Institute, who, with Walton, leads the PathGrid project.

    When deciding on the best course of treatment for a person with breast cancer, pathologists look for different biomarkers—specific proteins—in the patient's cancerous tissue. For example, an overexpression of the biomarker human epidermal growth factor receptor 2 (HER2) indicates a more aggressive form of breast cancer with a poorer prognosis. To spot such biomarkers, pathologists use a technique called immunohistochemical (IHC) screening. First they treat the tissue with antibodies that bind to targeted proteins, such as HER2; then a secondary antibody highlights the binding by undergoing a chemical reaction that produces a colored stain.

    At present, however, there are only a handful of well-validated biomarkers for cancer, and even fewer that reveal how a patient is likely to respond to a specific treatment, says Brenton. “That's because there's a bottleneck between new biomarker discoveries and being able to put them into clinical practice,” he says. “Discoveries are made with relatively small groups of tens to a few hundred patients, but their usefulness needs to be validated on sample sizes of hundreds or several thousands of patients.”

    Initial discovery studies are small-scale because pathologists must manually assess images in the IHC screening process, qualitatively scoring them for the abundance of a particular biomarker as well as the intensity of the staining. What was needed, Brenton thought, is a way of automating this time-consuming task: a computer algorithm that could accurately pick out stained tissue of varying shapes and sizes in cluttered images.

    That's when Walton came on the scene. “We've been developing algorithms to extract information from large telescope surveys at the IoA for years. The algorithms are robust to various backgrounds, such as stars, galaxies, and gas,” says Walton.

    He and Brenton met at the first Cambridge eScience Centre Scientific Forum held in 2002. The scientists, along with colleagues from their respective fields, talked at length about the possibility of using astronomy algorithms in cancer-screening image analysis. Walton and his IoA colleague Mike Irwin found that transferring those algorithms to medical use was painless. “In a pilot scheme to investigate the feasibility of the project, we found that we had to make virtually no changes at all, just tweaking the odd parameters here and there,” says Walton.

    PathGrid has performed well in tests, Walton and Brenton say. In a study that checked 270 breast cancer images for a biomarker called estrogen receptor (ER), Path-Grid agreed with pathologists' scorings for 88% of the positive slides and 93% of the negative ones. (An ER-positive tumor has a better prognosis than an ER-negative one and is treated by suppressing the production of the hormone estrogen.)

    A larger test, which looked for the biomarker HER2 in more than 2000 images, yielded even more impressive success rates of 96% and 98%, respectively. Walton says he thinks the results would have been even higher if not for “the subjective manner in which pathologists rate images.” PathGrid is consistent yet speedy, says Walton. To analyze a batch of a few hundred images for one specific biomarker would take PathGrid only a few minutes, he says, compared with about 3 hours for a pathologist.

    PathGrid is just one of several “virtual pathology” projects now under way, notes Laoighse Mulrane of the University College Dublin School of Biomolecular and Biomedical Science, a cancer researcher who recently reviewed this area. Yet its “novel” use of astronomy-based techniques makes it stand out from the pack, he says: “The collaborative efforts of the groups involved should be applauded.”

    PathGrid is now ready to undergo trials within a hospital environment in the United Kingdom. “Hopefully, if everything goes well, it could be used as routine, automated screening in hospitals within 3 years,” says Walton.

    Stellar views.

    3D imagery can give a clearer picture of the inner workings of the human body (top), and astronomers are using related visualization software to study distant star-forming regions (bottom).

    CREDITS (TOP AND BOTTOM): ANDRÁS JAKAB/UNIVERSITY OF DEBRECEN, HUNGARY

    Adding another dimension

    The partnership between astronomy and medicine works both ways, as Goodman and her colleagues have shown with the Astronomical Medicine project. For the COMPLETE survey, Goodman already had algorithms capable of handling enormous data sets. But they were geared toward dealing with 2D images, whereas she also wanted to see how the velocity of gas in star-forming regions changed along the line of sight—essentially treating velocity like a third dimension. “COMPLETE contained the largest ‘position-position-velocity’ maps of star-forming regions that had been made to date, and we wanted to see and understand this data all at once,” she says.

    Analyzing 3D images has been an important part of diagnostic medicine for many years, but until Halle heard Goodman's plea for help, nobody had considered adapting the technique to astronomy. After the NIH workshop, the pair quickly began working together and soon brought Michelle Borkin, a Harvard Ph.D. student in applied physics, on board the project.

    Borkin was instantly hooked. “The very first time that I saw our astronomical data come to life in 3D Slicer was amazing,” says Borkin. “Viewing the data in 3D is far more intuitive to understand than looking at it in 2D. I was instantly able to start making new discoveries that are incredibly difficult to do otherwise, such as spotting elusive jets of gas ejected from newborn stars.” While continuing to research star-forming regions using 3D Slicer, the Harvard team is currently working on projects that will give back to the medical world, developing tools based on algorithms used in astronomy to visualize, for example, coronary arteries.

    The way in which these two interdisciplinary projects have been able to share tools are special cases, cautions Stephen Wong, a biomedical informatics scientist at the Methodist Hospital Research Institute in Houston, Texas. In general, Wong says, using secondhand algorithms is not a good idea: “To be effective, image-processing algorithms and analysis tools have to be customized and specific to the particular problems under investigation.”

    Working in interdisciplinary research is also demanding on scientists' already hectic work schedules and has to be fitted around their traditional career duties. “I spend about 10% of my time working on PathGrid and the rest on my day job as part of the European Space Agency's Gaia spacecraft science team,” says Walton.

    Yet for the scientists involved in both projects, taking on this supplementary work is a labor of love. “Usually, you become an expert in just one field,” says Goodman, “but I've had the opportunity to learn something completely new in my 40s. I think people should go into interdisciplinary research, not just because the world might learn something, but because you will too.”

    As for the astronomers on the PathGrid team, they're able to make a boast that few of their stargazing colleagues can match. “It's great to think that something I'm doing is going to have an impact on how a cancer patient is treated and help to improve their chances of survival,” says Walton.

    • * Sarah Reed is a freelance writer and former Science intern.

  9. News

    May the Best Analyst Win

    1. Jennifer Carpenter

    Exploiting crowdsourcing, a company called Kaggle runs public competitions to analyze the data of scientists, companies, and organizations.

    Global contest.

    Kaggle's competitions draw entries from many countries (arrow thickness reflects number of competitors from a country).

    CREDIT: ADAPTED FROM KAGGLE

    Last May, Jure Žbontar, a 25-year-old computer scientist at the University of Ljubljana in Slovenia, was among the 125 million people around the world paying close attention to the televised finale of the annual Eurovision Song Contest. Started in 1956 as a modest battle between bands or singers representing European nations, the contest has become an often-bizarre affair in which some acts seem deliberately bad—France's 2008 entry involved a chorus of women wearing fake beards and a lead singer altering his vocals by sucking helium—and the outcome, determined by a tally of points awarded by each country following telephone voting, has become increasingly politicized.

    Žbontar and his friends gather annually and bet on which of the acts will win. But this year he had an edge because he had spent hours analyzing the competition's past voting patterns. That's because he was among the 22 entries in, and the eventual winner of, an online competition to predict the song contest's results.

    The competition was run by Kaggle, a small Australian start-up company that seeks to exploit the concept of “crowdsourcing” in a novel way. Kaggle's core idea is to facilitate the analysis of data, whether it belongs to a scientist, a company, or an organization, by allowing outsiders to model it. To do that, the company organizes competitions in which anyone with a passion for data analysis can battle it out. The contests offered so far have ranged widely, encompassing everything from ranking international chess players to evaluating whether a person will respond to HIV treatments to forecasting if a researcher's grant application will be approved. Despite often modest prizes—Žbontar won just $1000—the competitions have so far attracted more than 3000 statisticians, computer scientists, econometrists, mathematicians, and physicists from approximately 200 universities in 100 countries, Kaggle founder Anthony Goldbloom boasts.

    And the wisdom of the crowds can sometimes outsmart those offering up their data. In the HIV contest, entrants significantly improved on the efforts of the research team that posed the challenge. Citing Žbontar's success as another example, Goldbloom argues that Kaggle can help bring fresh ideas to data analysis. “This is the beauty of competitions. He won not because he is perhaps the best statistician out there but because his model was the best for that particular problem. … It was a true meritocracy,” he says.

    Meeting the mismatch

    Trained as an econometrician, Goldbloom set up his Melbourne-based company last year to meet a mismatch between people collecting data and those with the skills to analyze it. While writing about business for The Economist, Goldbloom noted that this disconnect afflicted many fields he was covering. He pondered how to attract data analysts, like himself, to solve the problems of others. His solution was to entice them with competitions and cash prizes.

    This was not a completely novel idea. In 2006, Netflix, an American corporation that offers on-demand video rental, set up a competition with a prize of $1 million to design software that could better predict which movies customers might like than its own in-house recommendation software, Cinematch. Grappling with a huge data set—millions of movie ratings—thousands of teams made submissions until one claimed the prize in 2009 by showing that its software was 10% better than Cinematch. “The Netflix Prize and other academic data-mining competitions certainly played a part in inspiring Kaggle,” Goldbloom says.

    The prizes in the 13 Kaggle competitions so far range from $150 to $25,000 and are offered by the individuals or organizations setting up the contests. For example, chess statistician Jeff Sonas and the German company ChessBase, which hosts online games, sponsored a Kaggle challenge to improve on the player-ranking system developed many decades ago by Hungarian-born physicist and chess master Arpad Elo. Its top prize was a DVD signed by several world chess champions.

    Still, Kaggle has shown that it doesn't take a million-dollar prize to pit data analyst against data analyst. Kaggle's contests have averaged 95 competitors so far, and the chess challenge drew 258 entries. “When I started running competitions, I found they were more popular and effective than I could have imagined,” Goldbloom says. “And the trend in the number of teams entering seems to be increasing with each new competition.”

    Statistician Rob Hyndman of Monash University, Clayton, in Australia, recently used Kaggle to lure 57 teams, including some from Chile, Antigua and Barbuda, and Serbia, into improving the prediction of how much money tourists spend in different regions of the world. “The results were amazing. … They quickly beat our best methods,” he says.

    Hyndman suspects that part of Kaggle's success is offering feedback to competitors. Kaggle works by releasing online a small part of an overall data set. Competitors can analyze this smaller data set and develop appropriate algorithms or models to judge how the variables influence a final outcome. In the chess challenge, for example, a model could incorporate a player's age, whether they won their previous game, if they played with white or black pieces, and other variables to predict whether a player will win their next game. The Kaggle competitors then use their models to predict outcomes from an additional set of inputs, and Kaggle evaluates those predictions against real outcomes and feeds back a publicly displayed score. In the chess challenge, the results of more than 65,000 matches between 8631 top players were offered as the training data set, and entrants had to predict the winners of nearly 8000 other already-played games.

    During a competition, which usually lasts 2 months, people or teams can keep submitting new entries but no more than two a day. “Seeing your rivals, and that they are close, spurs you on,” says Hyndman.

    Kaggle encourages the sponsors of the competition to release the winning algorithm—although they are not always persuaded to do so—and asks the winning team to write a blog post about how they tackled the problem and why they think their particular approach worked well. Goldbloom hopes that this means other entrants get something out of the competition despite not winning. They not only hone analytical skills by taking part, he says, but also are able to learn from other approaches.

    Business solution.

    Anthony Goldbloom (left) founded Kaggle to run contests to solve data problems.

    CREDIT: A. GOLDBLOOM; MELBOURNE (INSET)

    Predicting potential

    Although only a handful of its competitions have finished, Kaggle has had promising results so far. Each contest has generated a better model for its data than what was used beforehand.

    Bioinformaticist William Dampier of Drexel University in Philadelphia, Pennsylvania, organized the competition to predict, from their DNA, how a person with HIV might respond to a cocktail of antiretroviral drugs. This problem had been tackled extensively in academia, where the best models predicted the response of a patient to a set of three drugs with about 70% accuracy. By the end of the 3-month contest, the best entry was predicting a person's drug response with 78% accuracy. Dampier says even this improvement in accuracy could help doctors further improve their treatment strategies beyond the current “guess the drug and check back later” approach.

    Dampier considers Kaggle's approach innovative, noting that it draws in data analyzers with various backgrounds and perspectives who are not shackled by a field's dogma. Such outsiders, he suspects, are more likely to see something different and useful in the data set. “The results talk, not your position or your prestige. It is simply how well you can predict the data set,” says Dampier.

    His point is well illustrated by Žbontar. Despite not tabbing Eurovision's actual winner, Germany, his overall prediction of the results beat a team from the SAS Institute—a data-mining company—and a team from the Massachusetts Institute of Technology. His submission incorporated both past national voting patterns—Eastern European countries tend to vote for each other, for example—and betting odds for the current contest.

    Goldbloom also attributes Kaggle's success to crowdsourcing's capacity to harness the collective mind. “Econometrists, physicists, electrical engineers, actuaries, computer scientists, bioinformaticists—they all bring their own pet techniques to the problem,” says Goldbloom. And because Kaggle encourages competitors to trade ideas and hints, they can learn from each other.

    One sponsor of a Kaggle competition estimates that some entrants may have spent more than 100 hours refining their data analysis. This begs the question: What's the attraction, given the small prizes? Many data analysts, Goldbloom discovered, crave real-world data to develop and refine their techniques. Timothy Johnson, an 18-year-old math undergraduate at the California Institute of Technology in Pasadena, says working with the real data of the chess-ranking competition—he finished 29th—was more challenging, educational, and “fun” than analyzing the fabricated data sets classes offer.

    For Chris Raimondi, a search-engine expert based in Baltimore, Maryland, and winner of the HIV-treatment competition, the Kaggle contest motivated him to hone his skills in a newly learned computer language called R, which he used to encode the winning data model. Raimondi also enjoys the competitive aspect of Kaggle challenges: “It was nice to be able to compare yourself with others; … it became kind of addictive. … I spent more time on this than I should.”

    What has proved tricky for Kaggle is persuading companies, agencies, and researchers to open up their data. Goldbloom tries to assuage companies' concerns about putting some of their data up on the Web by pointing out that they will get a competitive advantage if the Kaggle contestants produce a better solution to their data problems. So far, two private companies, one government agency, and three universities are among the groups to have used Kaggle.

    As for researchers, Goldbloom says most reject his advances with an almost “visceral reaction.” Overcoming such reluctance to expose data may be key to his company's survival. No one pays to enter a competition, so Kaggle depends on charging a fee to those running a contest—the sum changes from competition to competition. “We aren't profitable yet, but we have some huge projects coming up and we hope to be profitable by the end of the year,” says Goldbloom.

    Žbontar hopes Kaggle survives, as he's looking forward to bettering his prediction model for this year's Eurovision Song Contest and perhaps prying his friends out of more beer money. In a blog post analyzing his victory this past year, he issued this playful challenge: “I have many ideas for next year, which I will, for the moment at least, keep to myself.”

Log in to view full text

Via your Institution

Log in through your institution

Log in through your institution