News this Week

Science  02 Aug 1996:
Vol. 273, Issue 5275, pp. 525
  1. Computers in Biology

    1. Gilbert J. Chin,
    2. Tim Appenzeller

    Bioinformatics and the Internet are the linked themes of this year's special issue on computers. They are two of computer science's boom areas, where growth and growing pains are both extraordinary. The first story in our News coverage looks at one of the growing pains—Internet congestion—and means of alleviating it. Other News stories examine the proliferation of computational tools for making sense of new DNA and protein sequences, and the new access to those tools afforded by the World Wide Web and the Web-based language Java. A final story looks at the development of the Internet in Russia, which will provide vital links for one of the world's largest scientific communities.

    The Articles look at computational tools for organizing and analyzing large amounts of biological data. John White and colleagues describe the collection and annotation of microscopic images that capture the development of the nematode Caenorhabditis elegans, one of the most widely studied organisms in developmental biology. The database allows any researcher to access the life story of any cell in the wild-type nematode, and in theory that of any mutant, with precise resolution in space and time. Liisa Holm and Chris Sander focus on another kind of database, this one consisting of high-resolution protein structures. The authors carry the neophyte through the methods for organizing protein shapes into a database and comparing newly solved structures to known ones.

    Also on the Web: Computers '95: Fluid Dynamics.

  2. Networks: Fast Lanes on the Internet

    Traffic jams on the networks are slowing scientific collaboration. Possible solutions range from reservations-only service on the existing Internet to high-speed links just for scientists

    When the Internet was being promoted just a few years ago as the tool of the future for scientific collaborations, Paul Woodward was just the kind of researcher the system's architects had in mind. An astronomer who is director of the University of Minnesota's Laboratory for Computer Science and Engineering, Woodward is part of a team at Minnesota and the University of Colorado collaborating on studies of convection in the sun. The Internet was supposed to provide a way for the two groups to exchange data and work closely together without ever leaving their labs. But it hasn't worked out that way. “When we want to sit down and look at data sets critically and brainstorm, either we go there or they come here,” says Woodward. The reason: traffic congestion on the information superhighway.

    “We'd like to be able to have visualization of this data and point things out to each other as if we were in the same room,” says Woodward. But such graphical, real-time simulation is now impractical over the network. In principle, the Internet is just about capable of providing the 2 megabits per second (Mbs) of bandwidth (capacity) that Woodward would need to send his pictures back and forth. But in practice, he can count on only 0.2 to 0.7 Mbs, far too little for the electronic collaboration he envisions.

    It's a problem familiar to every World Wide Web user who has waited in frustration as Netscape endlessly displays the message: “Host contacted. Waiting for reply.” And it's only going to get worse. Most universities and research institutes are connected to the Internet via cables that carry 1.5 or 45 Mbs, but the Internet is so clogged that everything slows down. It's like a highway with a speed limit of 55 mph where everyone ends up going 35 because there's too much traffic—an example, says computer scientist David Farber of the University of Pennsylvania, of how “success tends to get you into trouble.” The networks have far more users now than even 2 years ago, and a larger fraction of them are running video and audio programs requiring lots of bandwidth. The proliferation of glitzy Web sites with sound and color graphics has only added to the traffic jams as people try to download elaborate pictures from popular sites. “These things simply take more capacity than old e-mail,” says Mark Luker, director of the National Science Foundation's (NSF's) networking program.

    For most scientists, like everyone else, the delays in e-mail and Web browsing are an inconvenience. But for researchers like Woodward, the holdups are intolerable. Indeed, some of scientists' most ambitious visions for Internet use—operating telescopes or other instruments from a distance or collaborating in real time to run models or analyze large data sets—are on hold. Says Tom DeFanti, director of the Electronic Visualization Laboratory at the University of Illinois, Chicago, “The Internet has become a mass-migration highway. Anyone trying to get work done is looking at alternatives.”

    The alternatives DeFanti refers to can be as simple as changing work habits. At AT&T Research, says Steve Crandall, a staff scientist, “people modify their behavior. They come in in the early morning, six or seven o'clock, to get some of their high-bandwidth work done.” Universities and research laboratories can also address the problem by increasing the bandwidth of their connections to the Internet. But high-bandwidth connections are expensive, and they provide no solution to congestion elsewhere on the Internet.

    That is why more and more researchers are concluding that what's needed are strategies that make distinctions among users, offering high bandwidth and prompt service to those who need it while leaving e-mail and other services that can tolerate delays to fend for themselves on congested lines. Such solutions would benefit commercial users as well as scientists, because the delays that are now mostly a nuisance could become a real impediment to commercial expansion of the Internet.

    One set of strategies would create fast lanes on the existing Internet by prioritizing data traffic, allowing scientific and commercial traffic to bypass the congestion; another would create “private roads”: experimental high-speed networks reserved exclusively for scientists who do intensive computing. Neither solution will be easy to implement. The idea of giving some traffic preferential treatment conflicts with the egalitarian culture of the Internet and would probably require changes in pricing for Internet services. Specialized high-speed networks, meanwhile, are themselves vulnerable to being overwhelmed by traffic growth. But somehow, says Hans-Werner Braun, a staff scientist who works on networking at the San Diego Supercomputer Center, “we need to make the current Internet more predictable and more verifiable. If you're a telephone or airline customer, you have certain expectations [about performance]. We need to have similar expectations for the Internet.”

    Sharing the pain

    The Internet now is a best effort system in which all data packets are treated equally. Current Internet routers—the way stations along the network, which read the addresses of incoming packets and send them along to their destinations—work on a first-in, first-out scheduling algorithm. “As packets arrive, they're stuck in a queue, and as bandwidth becomes available they're shipped out,” explains Sally Floyd, a network researcher at Lawrence Berkeley National Laboratory (LBNL). When the networks become congested, service degrades equally for all packets, whether they are e-mail, which can tolerate delay, or real-time video, which cannot.

    One barrier to tackling congestion on the Internet is the lack of good statistics on how data is flowing, where the crunch points are, and how much different services contribute to the congestion. “There are no widely accepted metrics to assess service,” Braun says. The fundamental dilemma confronting network planners is clear, however: “If you have more people using the network than you have bandwidth for, should you push people off, or should you let everyone suffer?” asks Scott Shenker, a computer scientist at Xerox PARC.

    On the U.K.-U.S. trans-Atlantic link, one of the most heavily congested segments of the Internet, network researchers are testing a scheme for pushing off some users—with a minimum of pain. “At the moment the international link is unusable,” says Jon Crowcroft of University College, London, and a member of the Internet Architecture Board. Then he corrects himself. “No, there was a recent upgrade. Before that it was unusable. Now it's workable about a third of the day.”

    To improve that figure, the network community in Britain is trying to reduce the amount of Web traffic. When U.K. users taking part in the experiment want to visit a Web site in the United States, they point their browsers at one of six Web cache sites within Britain. If another U.K. user has recently visited the U.S. Web site, it will have been stored on the cache and won't have to be called up from across the Atlantic again. If the site requested isn't yet in the cache, the system will go out and fetch it, then store it for later users. It also checks to see if the page has been modified since a given date.

    “Caches aren't great, but they will do as a stopgap,” says Crowcroft. No one in the United Kingdom will be forced to go through caches, but as an incentive, part of the international link will be set aside for traffic routed through them. “If people go [to U.S. sites] via the U.K. national Web cache sites, they will get very good performance, but otherwise they will get bad performance. With wide publicity we hope this will fix a large proportion of the problem,” Crowcroft explains. At least 30 British universities are already using the caches. A similar system has been in place in Israel since November 1995, where all university Web traffic is required to go through cache servers. By relieving congestion due to Web use, these measures should open the bandwidth needed for critical applications.

    Still under development are congestion-cutting schemes that would carry such prioritizing much further. One would rely on the routing computers to assign priorities automatically to different kinds of Internet traffic. At LBNL, for example, Floyd is working on a scheduling algorithm called class-based queuing. “The idea is that the router can classify arriving packets into classes,” Floyd says, then give them different grades of service. For example, traffic for the Web might be put into a class of its own and restricted when the network is congested. However, given the huge commercial push to expand Web use and make it more user-friendly, any scheme to improve scientific use at the expense of Web use may well conflict with growing commercial interests.

    Reservations required

    What's more, automatic schedulers can't evaluate the traffic's urgency by distinguishing, say, a recreational video from a scientific visualization effort. But another set of approaches to reducing congestion—so-called reservation schemes—lets users divide their own Internet traffic into priority classes. The leading reservation scheme is a system called Resource Reservation Protocol (RSVP), which Shenker, Steve Deering, and others at Xerox PARC and elsewhere helped develop.

    Figure 1

    Computers use RSVP to request a specific “quality of service” from the network on behalf of an application such as video teleconferencing or real-time simulation. That request might be something like “I need 2 Mbs of bandwidth for the next 2 hours between New York and Phoenix.” The system then sends a quality-of-service request from router to router along the data's path, and the routers try to reserve the necessary bandwidth. If some routers along the path have not yet implemented RSVP, the request is ignored for those segments of the trip. Thus RSVP can't guarantee a particular quality of service for an entire trip across the Internet. But even if only some routers take part, they will improve the average performance of the network considerably. “The feeling in the Internet community is that most users would rather have better service than an absolute guarantee,” says Robert Braden, head of the RSVP project at the University of Southern California's Information Sciences Institute.

    RSVP has been running in experimental test beds for the past couple of years, and a number of vendors are starting to produce routers that implement the scheme. RSVP and other reservation systems require more than hardware, however; they also need some mechanism for deciding who should get reservations for high-quality resources. “As soon as you provide any kind of preferred service, you need a mechanism to prevent abuse,” says Braden.

    The obvious mechanism is pricing: Make it more expensive to obtain a higher priority (see related story). But the prospect of charging more for first-class service makes some network researchers reluctant to embrace resource reservations. Even Deering, who worked on RSVP, says he now has his doubts about it: “Reservations are expensive and complex and haul in a kind of charging system, and what's the criterion for saying who gets reservations and who doesn't?” He acknowledges that “special critical applications like remote surgery need reservations. It's [reservations for] every phone call that I don't approve of.” One major concern is that reservations might price out those with less money, such as schools.

    The same fears apply to another technology for opening up fast lanes on the Internet: a system known as ATM (asynchronous transfer mode). Unlike RSVP, which offers a first-class version of existing Internet service, ATM is a new way of packaging and sending data. It replaces the various-sized data packets that travel over the existing Internet with data cells of a fixed size. And instead of allowing the packets in a single message to wend their way to their destination via many different routes, it opens up a single “virtual circuit” from source to destination. Both features make the system more predictable than the standard Internet. In the existing network, unpredictable delays can result when, for example, a small packet gets stuck behind a large one while the large packet waits for enough bandwidth to become available. But an ATM-based network can predict, based on traffic, how long a transmission will take.

    Figure 2

    Moreover, because ATM sets up a virtual circuit for each transmission, it allows a user to request a specific quality of service in advance. A user running a multimedia application, for example, would request—and presumably pay for—service with very little cell delay because it's a real-time application, while e-mail users would be content with cheaper, slower service.

    With software changes, ATM can run over existing Internet cables and routers, says Mark Laubach, who chairs an Internet Engineering Task Force working group on ATM, “but it doesn't work well unless done in hardware,” meaning expensive new cables and other equipment. Still, ATM networks are up and running now: for example, the Bay Area Gigabit Testbed, an experimental high-speed ATM network connecting 15 sites in Northern California. It is used for a variety of collaborative scientific experiments including remote studies with optical and electron microscopes. MCI's Internet backbone (a major Internet pathway linking smaller users, the way an interstate highway connects feeder roads) also uses ATM.

    Some lucky scientists stymied by the congestion on the Internet don't have to bother with caching, RSVP, or ATM. They can move off the existing network altogether. One “private roadway” already available to scientists is the NSF-sponsored vBNS (very high-speed Backbone Network Service), which connects five NSF supercomputing centers at 155 Mbs on an ATM network and provides bandwidth for cutting-edge network applications and research. It is not meant to be used for day-to-day operations such as e-mail and ftp, but that restriction may be difficult to maintain because the NSF is tying more universities and other sites into the vBNS.

    That's the paradox of the Internet—and the reason that congestion is likely to plague scientists for the foreseeable future. Scientists move to high-speed networks, eventually everyone else jumps on board, and then the scientists have to move up another notch. “A few of us are out on the edge doing these things on very fast machines, and then 10 years later everyone else is doing it,” says Paul Bash, a research scientist at Argonne National Laboratory. The Internet began as an experiment in computer networking, then became a popular phenomenon. Now it's groaning under the demand, and researchers are trying to make it safe for science again.

    Ellen Germain is a science writer in Arlington, Virginia.

  3. Networks: Will Pricing Be the Price of a Faster Internet?

    Clever technology may succeed in opening some fast lanes on the Internet for scientific users who need high capacity (see related story). But many Internet researchers say that keeping those fast lanes from clogging like the rest of the Internet will take something more than technology: some form of economic incentive—pricing, in other words—so that when the network is congested, bandwidth will go to the users who pay for it. At the moment, a surgical team doing real-time surgery over a video link, say, doesn't have any more claim on Internet resources than does a teenager using up at least as much bandwidth by watching recreational videos. “Pricing is a time-honored, tried and true method of dealing with this,” says economist Jeffrey MacKie-Mason of the University of Michigan.

    The basic idea is simple, MacKie-Mason explains. Varying prices by time of day or type of service (e-mail, video, Web traffic, etc.) will help control demand because only those people who really need the highest bandwidth service will pay for it. “For congestion purposes, [the function of] pricing isn't to raise money but to allow people to express a preference of how much they value [the services],” he says. Already, researchers are developing the accounting software and payment schemes that would be needed. They are also debating the administrative and sociological aspects of pricing—who should administer it, and how it will affect the current Internet free-for-all.

    Figure 1

    MacKie-Mason Photo Credit: Univ. of Michigan

    Internet experts who favor differential pricing say that it is fairer than the current system, in which an institution pays a fixed annual fee to its Internet service provider for unlimited use of its connection to the Internet, no matter how many users the institution has or what Internet services they favor. While Internet providers such as Compuserve and America Online bill individuals by duration of connection and sometimes by type of service, most universities and laboratories don't impose similar charges on their users. At the University of California, Berkeley, says computer scientist Pravin Varaiya, “20% of the users account for 90% of the traffic.” As a result, he says, “light users subsidize heavy users.”

    What's more, the revenues that universities or Internet providers would derive from congestion pricing could be used to add capacity to the Internet. “You have to have the money to pay for expanding the capacity, and it's better for that to come from the users who actually use it,” says Hal Varian, dean of the School of Information Management and Systems at Berkeley. That's already happening in the commercial world: MCI recently announced a venture to establish new high-speed backbone connections that will be available to customers—mainly businesses—willing to pay for premium service to insure that their traffic gets through when the Internet is too crowded.

    The everyday Internet doesn't yet have a system for metering usage, but Varian says “I feel the problem is more tractable than people are willing to admit.” Some accounting software is already available. In New Zealand, the University of Waikato operates the single Internet gateway to the United States, via Hawaii, on behalf of the New Zealand universities. It meters international Internet traffic from the universities according to type of service (such as ftp or e-mail), time of day, and number of bytes, and bills them accordingly. Varaiya and his colleagues have developed software that also authenticates the user, a measure to help prevent fraud. The accounting and authentication only add about 150 milliseconds of overhead time to the operation. “Accounting doesn't add too much in terms of time and money,” Varaiya says.

    The system includes a purchasing agent that can ask the user to decide if a requested service is too expensive. A user might set up the agent to accept charges of less than 50 cents, but to pop up a window asking for the user's okay on charges greater than that. Varaiya is now planning to try out the system on a group of about 200 users on campus. “We want to give them variable rates of service and a pricing structure and see how they react,” he explains.

    Along with software for doing the accounting, a pricing scheme requires a means of payment. The traditional method is centralized billing, as is done by telephone companies. That's how New Zealand's system works. But other proposals would eliminate billing and require users to pay as they go, perhaps by attaching “digital stamps,” purchased in advance from their Internet service provider, to each message. Varian suggests that Internet tolls could also be collected by the micropayment technologies currently being developed for commercial uses of the Internet—systems capable of coping with price increments of thousandths of a cent. A user might have $50 in a micropayment “card,” and a few thousandths of a cent would be deducted automatically each time the user sent out e-mail or browsed the Web.

    The big unknown, the pricing enthusiasts agree, is how Internet users would respond to congestion pricing. “Would they turn off the images in Netscape?” Varian wonders. “No one knows.” Nor does anyone know whether Internet users and administrators could even be persuaded to adopt a pricing scheme—something to which Internet culture is historically hostile. Says MacKie-Mason, “There's a cultural resistance to allowing people to buy their way to the head of the line.” The question is how much congestion users will tolerate before that cultural resistance breaks down.

  4. Computational Molecular Biology: Software Matchmakers Help Make Sense of Sequences

    Gene sequencers are spinning out data at a mind-boggling rate. They have already sequenced the complete genomes of several bacteria and brewer's yeast, they will have completed the genome of the roundworm Caenorhabditis elegans in a couple of years, and they intend to wrap up the human genome by 2005. A string of the four letters A, G, T, and C, designating the four nucleotides that make up DNA, is unreeling from sequencing labs at an ever-increasing pace, now nearly a million a day. For the human genome alone, the sequence will total 3 billion nucleotides.

    All this would be little more than so much genetic ticker tape without some way to decipher its real meaning, which is largely hidden in the genes—the stretches of DNA, amounting to barely 3% of the human genome, coding for the proteins that are the workhorse molecules of life. The first step is to recognize the genes from their distinctive sequences of nucleotides. The next is to infer the function of the proteins they code for—and the key to doing that is to find related genes and proteins whose functions are already known.

    As molecular evolutionist Russell Doolittle of the University of California, San Diego, explains, “The structures of all these proteins and the genes that code for them are all related through a big evolutionary expansion—some small number run through biochemical Xeroxes and used over and over in different settings.” The challenge of learning the function of a newly generated sequence is the kind of challenge that computer scientists in other fields have been wrestling with for decades: spotting obvious, or less than obvious, similarities in different strings of data.

    Welcome to the world of computational molecular biology. Over the past few years, biologists-turned-computer scientists and computer scientists-turned-biologists have begun churning out algorithms to find genes and other significant features in DNA sequences and to compare and contrast DNA, RNA, and protein sequences. The explosion has been triggered not only by supply—the information spewing from the genome projects—but also by demand from biologists hooked up to the World Wide Web, says David States, a computational biologist at the Institute for Biomedical Computing (IBC) at Washington University in St. Louis. “Most biologists in academic settings now have access to the Internet and Web browsers,” says States, and that allows them to send their sequences to on-line analytical tools—or even borrow the tools and wield them on their own workstations (see p. 591).

    This past June, States and his colleagues at the IBC hosted the Fourth International Conference on Intelligent Systems in Molecular Biology (ISMB) in St. Louis to survey the explosion. The computational tools under discussion ranged from simple programs that search for similarities between known and unknown sequences to ambitious efforts to find complete genes in DNA sequences and relate the proteins they produce to known protein structures. Many of the new tools rely on techniques developed by researchers in machine learning and artificial intelligence, and the hottest subject of the conference, known as hidden Markov models, springs directly from statistics and linguistics.

    Sustaining all these efforts is a sense of mission, says Doolittle. The ISMB researchers “are missionaries and proselytizers, and they have this great esprit de corps.” With the tools now under development, biologists “should be able to relate proteins whose relationships weren't detectable and do faster searching of genomes and comparisons of genomes,” says Doolittle—“all sorts of things that weren't possible before.”

    Make me a match. One reason sequence comparisons are so powerful is evolution's conservative style. While the 20 amino acid alphabet of proteins could in theory spell out a nearly infinite number of proteins, actual proteins are variations on a limited number of themes. Human beings alone have perhaps 100,000 proteins, but we and other organisms “are dipping into a pool of relatively slowly evolving proteins we all share,” says David Lipman, head of the National Center for Biotechnology Information (NCBI). All together, the number of different protein families is “maybe less than 1000.” The result is that comparing an unknown gene to known ones has a reasonable chance of coming up with a match—providing the computer algorithm can recognize subtle similarities.

    The first problem is to find the genes, which in higher organisms, known as eukaryotes, come interspersed with pieces of noncoding DNA called introns. One approach is to look for the telltale patterns of DNA that mark the boundaries between the coding and noncoding regions. Researchers have come up with various pattern-recognition techniques for that purpose, says David Haussler, a computer scientist at the University of California, Santa Cruz. Among them are neural networks—computer algorithms that “learn,” refining their ability to recognize a pattern as they are exposed to more examples of it—including the gene-finder most widely used for eukaryotes, the GRAIL program developed by Ed Uberbacher and Richard Mural of Oak Ridge National Laboratory in Tennessee. But along with exploiting clues in the unknown sequence itself, researchers can also determine whether it codes for a protein—and glean hints to that protein's function—by comparing it with known genes.

    For the past few years, the two workhorse programs for that kind of comparison have been BLAST, written by researchers at the NCBI, and FASTA, written by computational biologist Bill Pearson at the University of Virginia. Both take an unknown sequence—DNA, RNA, or protein—and compare it to known sequences, looking for the best possible match. The programs then calculate the match's statistical power, which provides “a basis for saying that the relationship between two sequences may have some biological meaning,” says geneticist Warren Gish, an author of BLAST, now at Washington University. “If something is known about the biological function of the database sequence, then we might infer our query sequence has the same or similar function, or the same or similar structure.”

    Both BLAST and FASTA are variations on an algorithm written in the 1980s by Mike Waterman at the University of Southern California in Los Angeles and Temple Smith of Boston University, but they use shortcuts that reduce computing time. BLAST, for instance, starts by scanning known sequences for short stretches of nucleotides or amino acids that are similar but not necessarily identical. The program then uses a scoring matrix for each match, awarding a positive, negative, or zero score, depending on how good a match it is. If the match is sufficiently close, then the program uses the sequence as a seed to proceed in both directions, comparing longer alignments “to see just how big an alignment score one can get,” says Gish.

    The most recent version of BLAST, which Gish discussed at the ISMB meeting, does a better job than its predecessors of taking into account small insertions or deletions of amino acids. As Stanford University computational biologist Michael Levitt explains, “It often happens that two sequences that are very similar to one another differ by just a few amino acids inserted or deleted relative to one another.” These insertions and deletions can make sequence comparison difficult by knocking related sequences out of “register.” Because of them, the old version of BLAST would often miss a significant match. But by adding up the scores of multiple high-scoring segments on the same sequence, then subtracting a penalty for any insertions or deletions, the new algorithm should now catch the similarity.

    Programs like BLAST and FASTA can find matches for 40% to 50% of all new protein or gene sequences, says Lipman. Beyond this comes the twilight zone of sequence similarities, in which the potential homologies are less obvious because the evolutionary relations are more distant. One of the newest attempts to push into the twilight zone relies on the sophisticated statistics of hidden Markov models (HMMs).

    These algorithms have their roots in the statistics work of the Russian mathematician A. Markov, who died in 1922. HMMs were first put to work in the mid-1960s in speech recognition programs, which address a problem similar to the one facing computational biologists: analyzing an unfamiliar string of data—a string of sounds in this case—to work out how similar it is to a known string. Haussler was the first to suggest that these software algorithms could be put to work on genome database searching problems, in a technical report he published in 1992 with his colleagues Anders Krogh, Saira Mian, and Kimmen Sjölander. The report quickly circulated through the community, says Sean Eddy of Washington University, “and while it was clearly not yet ready for prime time, it was also clear it had an awful lot of potential.”

    Figure 1

    Getting the essence. Instead of starting with an unknown sequence and looking for a match, HMM algorithms go the other way: They analyze a range of known sequences from a single family of proteins or genes, looking for the essential features of that family—a step generically known as creating a profile. The result is a model of what new members of the same family should look like—a hidden Markov model. For example, says Haussler, an HMM for the globin protein family, which includes hemoglobin, would try to capture the features that make globin proteins unmistakable: “The globin starts with a variable number of amino acids that occur before the first helix, called the A helix, which consists of 16 amino acids. The 16 positions in the A helix have propensities to be certain amino acids, and you can go through them and describe these propensities. Then after the first helix, there's a loop region consisting of a variable number of amino acids; then you start the B helix, etc. At some point you get to a position where an amino acid actually binds the heme iron, and that position is quite conserved among different globins. It has to be a histidine.”

    In the course of its learning process, the HMM takes examples of known globin sequences and a priori knowledge about the variability typically found in amino acid sequences, says Haussler. It then churns out a probability distribution for each globin residue position along the way, taking into account insertions and deletions that might change the register of one globin protein compared to another.

    “It's not a black-and-white pattern recognition method,” says Philippe Bucher, of the Swiss Institute for Experimental Cancer Research outside Lausanne. “It doesn't say this is allowed, this is not. It says that in the fifth position, there is a high probability that this amino acid is found and a very low probability that another amino acid is found, etc.” Once HMMs for enough gene or protein families have been constructed, says Eddy, “we can take a newly predicted sequence, hand it to that software, and have it say it is very likely to belong to some specific protein family, say, or maybe it's a new family entirely.”

    HMMs were one of the hottest items at the June meeting, says computational biologist Chris Sander of the European Molecular Biology Laboratory in Heidelberg. But he adds that no one knows how useful they will turn out to be because they have yet to be widely used. (“What makes HMMs so popular,” says Lipman, not entirely seriously, “is that the name is so tantalizing. Something is hidden and we're finding it and we have a Russian name to do it.”) Eddy says, however, that HMM software has performed well at Washington University's Genome Sequencing Center, where he and his colleagues have used it for day-to-day analysis of sequences generated by a C. elegans sequencing project, and it seems to find matches for 5% more new sequences than does BLAST or FASTA. Adds Haussler, who has been testing HMMs for their ability to find distant matches, “It's not a panacea, but it should get you that little extra push.”

    HMMs, however, will never answer biologists' ultimate question, which is what a new gene's protein actually does. Sequence similarity to a known gene or protein doesn't give the full answer, because genes and proteins with completely different sequences sometimes perform similar functions. The key to a protein's function is its three-dimensional (3D) structure, and proteins with very different sequences occasionally fold up into similar shapes. So some biologists have tried skipping the process of matching sequences to known sequences and instead tried to match the new sequences directly to a structure.

    Following a thread. The front-runners so far in this endeavor have been a class of algorithms, known appropriately as threading algorithms, that take an unknown sequence and try to thread it through a known structure to see how well it might fit. What makes threading algorithms promising, says Sander, is that instead of trying to predict the 3D structure of a protein based only on its sequence—a goal that computational biologists acknowledge lies far in the future—they proceed by making comparisons. Given a new sequence, he says, “they ask does it fit one of the several hundred known structures, yes or no? And if it does fit, what is the precise, best arrangement of [amino acid] sequences in that 3D structure? It's a clever way of simplifying the problem.”

    Figure 2

    To answer those questions, Sander explains, threading algorithms look at how the unknown sequence and the known structure match up with respect to properties that affect a protein's folding, such as whether it is hydrophilic or hydrophobic at particular points. “When you thread the sequence through the structure,” says Sander, “for each arrangement, you ask does a hydrophobic residue of the sequence end up in hydrophobic position of the structure, yes or no? If it does you give it a one; if it doesn't you give it a zero. Now you add up those numbers for all positions in the protein. And that gives you a number for one arrangement of the sequence in the structure. Then you push [the sequence] through further, and for the new arrangement you ask what is the number for this function and so on. And then you do it for every other known 3D protein sequence, and you compare what you have at the end” to find the most likely structure of the protein.

    Threading algorithms originated in work by a number of investigators in the late 1980s. After an initial burst of popularity, they are now in what Sander describes as the “consolidation phase,” which means, he says, that “there are improvements being reported consistently but with less excitement than the original round.” Not only are the algorithms themselves being refined, says Lipman, but “as more structures become known, these threading methods will become that much more effective.”

    One recent illustration of their power, says Lipman, came last year when researchers discovered the obesity gene, so called because of its effect on mice when it is mutated (Science, 2 December 1994, p. 1477). Although researchers were unable to find a sequence match for the protein encoded by the gene, known as leptin, threading techniques showed that the new sequence was likely to have a structure resembling a well-studied class of proteins known as helical cytokines. A year later, when the receptor for the protein was sequenced, it turned out to be a cytokine receptor, confirming the threading prediction.

    Lipman thinks that the prediction could have given the researchers a head start in the search for the receptor. “You could have leveraged the information,” he says. “If you had looked in the sequence databases for examples of cytokine receptors that are not identical to ones we already know about, you would have been able to pull out a handful, and in that handful was the sequence for the leptin receptor a whole year before it was published. So these kinds of predictions could be extraordinarily useful.”

    Whatever the ultimate value of any particular technique, biocomputing experts say that their arsenal of comparative methods will become more powerful as the databases of known genes expand. “We have these islands of knowledge, and we can exploit each one to carry us to the next island of knowledge,” says Lipman. “It's going to be easier and easier to do this sort of thing.” Biologists will be well on their way to turning the data unreeling from the genome labs into real knowledge.


  5. Bioinformatics: Working the Web With a Virtual Lab and Some Java

    Curt Jamison knows the drill all too well. Say you have a new piece of DNA to analyze, says the University of Maryland geneticist. You're looking for sequences similar to those in known DNA and clues to the function of their proteins. There are plenty of databases and analytical software out there, but until this year, anyone trying to use these biocomputing tools faced what Jamison calls “a kind of Tower of Babel. There are a lot of tools available, but they all speak a different language.” Jamison recalls having to translate sequence data from MS-DOS text to a different format to run BLAST, a sequence comparison program, on a Sun workstation. Then he had to convert all the sequences back into straight text to send them over the Internet to another computer running DNAstar, a program that aligns matching sequences to show similarities and differences. There had to be a better way, he thought: Biocomputing needed a common language, or at least some common ground.

    Now Jamison and other biologists are finding plenty of both, from two separate developments. The common ground takes the form of the Biology Workbench, unveiled in June by the National Center for Supercomputing Applications (NCSA) at the University of Illinois. At one site on the World Wide Web, biologists can find a “gateway” that provides one point of access to a far-flung collection of protein, DNA, and bibliographic databases and tools for searching and swapping data and analyzing sequences. The Workbench itself runs on NCSA supercomputers, but any biologist with access to the Web can use them. “It's point-and-click biology,” says Shankar Subramanian, who led the NCSA team (including Jamison) that built the bench.

    The common language is Java. Java is a programming language that allows a user to retrieve little programs called “applets” from remote sites on the Web and run them on a local machine without worrying about software compatibility. While the Biology Workbench brings data and tools together at a common site, Java brings both back to the user, where biologists can create sophisticated displays from data retrieved from the source—homing in, for instance, on specific data subsets—without spending time going back and forth at each step.

    Just what that means to scientists became clear in June, at the Meeting on the Interconnection of Molecular Biology Databases in St. Louis, where researchers saw prototypes of Java genome browsers that allowed them to literally zoom in on the fly, going from a section of a Drosophila chromosome down to the DNA sequences, or browse along comparative linkage maps that stretched a grass gene alongside a corn gene. Some of these browsers are versatile enough that researchers are adapting them to work on the human genome, or on yeast. “Java has come on the scene like gangbusters, and it blew away everything else,” says bioinformatics specialist Stanley Letovsky, who works on the Genome Data Base (GDB) at Johns Hopkins University.

    These feats haven't toppled the Tower of Babel completely. The Workbench doesn't have Java's interactive finesse, and Java itself has some problems. Some observers caution that its graphics capabilities are limited, and Java is the focus of both security concerns (see related story) and complaints that existing security features limit its usefulness, actually preventing researchers from saving their work on their own machines. Moreover, it's not a universal language, for applets can't go anywhere on the Web for data; they can only access the servers where they originated, which limits the kind of analysis they can do. Still, says Harvard University's William Gelbart, a principal investigator of the Flybase Drosophila genetics project, “we view Java as the wave of the future. It's democratic,” because it lets anyone with a Java-equipped Web browser tap into the latest software. Or as Jamison puts it: “Power to the people.”

    Figure 1

    Putting the Web to work. Jamison (then at NCSA), Subramanian, and their colleagues already had that slogan in mind 2 years ago, even before Java, when they laid the groundwork for the Biology Workbench. They realized that by combining the universal access provided by the Web with something called dynamic federation, they could put a whole arsenal of biocomputing power into the hands of any Web-equipped biologist. NCSA's Bruce Schatz explains the concept: “One [Web site] takes your query—give me all proteins with this particular sequence, for instance—and sends it to 30 different sources in their own formats,” he says. “It gets the data back, translates it into your language, and gives you the information. So you can send queries to many databases across the Web and change the results back into Web pages. It's completely transparent.”

    Subramanian, Jamison, and their NCSA group were already at work on such a federation of databases, called ENQuire, that searched both large and small genome databases. They realized that they could make their gateway still more useful by adding tools for analyzing sequences for structure and function, which would run on NCSA supercomputers. The result, in June, was the Biology Workbench.

    In a typical search, says Subramanian, a user might type the name of the digestive protein “trypsin” in the Workbench query box, then select three databases (PIR, GenBank, and PDB) to search. A set of similar protein sequences identified by other researchers soon appears, and the user clicks the “import” button to bring them to the Workbench. Clicking the “Multiple Sequence Alignment” button aligns them. Another tool called MSAShade highlights all the matches and mismatches, showing which sequence is closest to the original. A protein structure prediction tool will even pop up a three-dimensional image of the molecule, which gives clues to its function. “So by running a machine with just a Web browser, a user has access to programs that pack all this power,” Jamison says. The Workbench now provides access to about 30 databases and 100 analysis tools.

    He's quick to add, however, that while everyone can get to the Workbench, it's not a spectacularly fast or flexible working environment. The interface is written in the Web's vernacular, hypertext markup language, or HTML, and as languages go it's pretty static. “HTML basically works with forms, like documents. There are whole classes of things that you can't do with it,” Jamison says. “You can highlight a chunk of sequence, for instance, but to do something with it, you have to fill in its parameters on a form. Then the server does the processing and ships it back to you.” To focus on a smaller chunk of the sequence, you have to fill in more parameters and resubmit. “HTML is back and forth, back and forth, every time you want to add a term. Or maybe you want to rotate a protein view or loosen a folding bond. HTML can't do that for you.”

    Some genes with your Java? But the Java language can, to a certain extent. “What Java does is move the code over to the client,” says Gregg Helt, a biologist in the University of California, Berkeley's, Drosophila Genome Project and author of the Java-based DrosophilaGenome Browser prototype. “And that makes things more dynamic.” Java applets are embedded in HTML pages, the standard form for displaying information on the Web, but they run on the user's machine and pull data off the server they came from to reshape it at the user's command. The user doesn't need to go back to the original server for each operation. So, says Helt, you get the advantages of the Web's universal access and Java's facile visualization and interaction.

    Actually, Helt says, “I was opposed to Java at first” because its graphics seemed limited and he worried about bugs. “I set out to prove it didn't work. What I found was that, hey, it works pretty well.” It was good enough, in fact, for him to put the genome browser on the Web earlier this year. It shows a 3-megabase region of the Drosophila genome as a physical map, with chromosome bands. Users can zoom in on a band to see the subbands and still finer landmarks known as contigs and P1s. “You can't do that with HTML,” notes Helt. “It would have to go to the server and get another picture file.”

    Then, says Helt, you can unpack and analyze the information hidden in each of those features. “Click on the P1, and you get a window with an annotated map of that sequence: BLAST homologies, gene predictions, and known GenBank entries. You can get down to the DNA level, and for a P1 that's 80,000 base pairs. You get a DNA viewer in a pop-up window, and you can select features in the annotated map and the viewer will move to them.” And once users have selected a gene, another window will let them see what cells in the early fruit fly embryo express it.

    Helt has also included a primer-prediction tool for scientists who want to amplify a particular segment with the polymerase chain reaction. A window lets a user specify primer length, temperature, and other options; the tool then recommends primer sequences, and arrows appear on the map to show where they would bind. “That's a way to find the best primers to pull out a particular region,” Helt says.

    Researchers who have browsed the browser are impressed. “One of the biggest wins is the platform independence” that comes with the Web and Java, says Nomi Harris of Lawrence Berkeley National Laboratory's Human Genome Informatics Group. “I also like the hierarchical way the display works, so you can see many levels of detail.” Harris, in fact, is working to adapt the fly map browser to work on human genome data. Over at Stanford University, Michael Cherry, who works on yeast and Arabidopsis genome sequencing projects, is also working on an adaptation. “Greg has done an excellent job,” he says.

    And Gelbart notes that the applet, running on the Web, solves a long-standing Flybase problem: keeping the various maps up to date. “The important thing in these projects is to get information out in a timely manner. Until now, if we wanted high-quality dynamic maps, the best we could do was the Encyclopedia of Drosophila, a CD-ROM that has a lot of this material. We'd come out with a new one every 6 months or so. But with Java on the Web, as we update maps and graphics, you update.”

    A brew of applets. But Java isn't limited to fly genetics. Applets are popping up like Starbucks coffee houses. In St. Louis, Jamison showed off a comparative map viewer for plant genomes. “We want to make information available about agriculturally important organisms,” he says. The viewer takes data about plant genomes stored in a database at the U.S. National Agricultural Library and aligns linkage maps, which lets a user scrolling down the maps find homologies across species. Researchers at Stanford are working on SStruct, an applet that allows a user to choose an RNA sequence and see its secondary structure, which influences how it interacts with other molecules in the cell. GDB, at Johns Hopkins, which contains location information for markers linked to genetic diseases, is working on a “multimap” that will let users view the same region drawn from several different data sources, revealing gaps in marker coverage. ACEDB, a database first developed for the Caenorhabditis elegans sequencing project, is getting a Java front-end called Jade. And Subramanian says he plans to come out with a Java-ized Workbench.

    Nor are all applets linked to specific databases. The University of Pennsylvania's Computational Biology and Informatics Laboratory has been trying to develop a library of adaptable, reusable software modules for standard biocomputing tasks—a.k.a. bioWidgets. “One big problem in the genome project is that we keep reinventing the wheel,” says Penn's Christian Overton. “Groups that write software don't shrink-wrap their programs and provide support for other groups.” Widgets, an idea that came from Penn's David Searls, could eliminate the wasted effort, says Overton. “They are small, clean graphical modules like a map viewer or a sequence viewer. …Widgets have been implemented in several languages now. But the problem was: How to get them across the Internet?”

    The solution seems to be Java, Overton says. With collaborators from Berkeley and several other places, the Penn group is starting a widget consortium, to create them and make them available as applets. “The response has just been amazing. Everybody wants in,” Overton says. Jamison explains why: “That will save me a heck of a lot of work.”

    That savings may not come right away. Helt, Jamison, and their colleagues admit that unlike other computer languages such as Perl, Java doesn't come with many drawing routines that are useful for scientific graphics, so the programmers have to build the routines from scratch. And that takes time. A bigger problem is that for security reasons Web browsers don't let Java applets store data on the local computer, which means that all the vaunted Java interactivity goes to waste at the end of a session. Although you can zoom in on a region on a linkage map and jot some markers down in a notebook, “this is a fairly big hole,” Jamison says. “A lot of what we do, we'd like to save.” (Netscape and Sun Microsystems—who gave the world Java—have made noises about changing this soon.)

    Figure 2

    True transparency. Some researchers add that Java's fortes—visualizing data and some limited analysis—are not enough to turn the world of biology upside down. “I think everyone should look at Java seriously,” says David Lipman, director of the National Center for Biotechnology Information at the U.S. National Institutes of Health. But he notes that applets may not give biology the same kind of boost that the field would get from, say, a comprehensive, searchable library of protein motifs—a tool for detecting distant evolutionary relationships that does not presently exist.

    What could truly speed things along, just about everyone agrees, is true transparency: If every database, every analysis tool, on every platform, could interact, then the Web would be any browser's oyster. “There are literally hundreds of biological databases out there. You're not ever going to get them all in one place” or write applets that can interact with them all, says Overton. The trick will be to get them to interact with one another. The tool for that task, many feel, will be something called CORBA.

    It stands for Common Object Request Broker Architecture, a moniker that means, in essence, that every database everywhere on the Web will have the same wrapping on the outside. “CORBA is a standard for packaging remote objects,” says Tom Flores of the European Bioinformatics Institute (EBI) in Cambridge, U.K. “It's a kind of ‘middleware’ between the databases and the clients. With CORBA, I don't need to know anything about a particular implementation. You could change from flat files to a relational database, and I don't need to know. The program gives me a handle on the object.”

    EBI plans to put CORBA handles on several objects in the coming year. Flores says they have just won preliminary approval from the European Union for a grant to place CORBA wrapping around several large databases such as EMBL, SWISS-PROT, PIR, and GDB, and several new and specialized ones, such as P53 and TRANSFAC. And then, says Flores, “we'd really have some power. Think of the serious research you could do while sitting at home.”

  6. Bioinformatics: Do Java Users Live Dangerously?

    Java is giving researchers a dose of excitement by allowing them to visualize or analyze data with software commandeered from distant machines (see related story). It is also fueling visions of a bustling Internet economy. This operating system, which makes it possible to download foreign programs over the Internet and run them in a local computer, will allow users to invite electronic shopkeepers into their home machines to conduct transactions. But these very attributes are also giving security experts the jitters.

    The ability to download and run foreign programs—applets, in Java-speak—could in theory expose a host machine to computer viruses and other digital mischief. Java's designers were well aware of this vulnerability, so they included an array of software guardians that screen each applet for admission and then keep it in what Java engineers call a sandbox, where it can't run amok. But several computer experts have learned to use bugs in Java to bypass its safeguards. Each new security breach has made the vision of a bustling Internet economy seem more distant and even raised concerns about other uses of Java. They have also rattled staff at JavaSoft. “I've named gray hairs on my head after [the attacks],” says Marianne Mueller, a staff engineer at the company.

    The latest Java bug was found by David Hopwood, a student at Oxford University. On 1 June, he announced on the Web that he had created an applet capable of undermining Java's security safeguards. “He found a bug—a subtle bug,” agrees Mueller. Hopwood's subtle bug exposes the “security manager” at Java's front door to attack. The security manager acts as a gatekeeper. Whenever an applet tries to do something potentially mischievous, like renaming files or creating new directories on the hard drive, the security manager slams the door. But Hopwood wrote a program that, when downloaded by a Java host, kills the security manager and replaces it with an impostor—a phony who dozes at the gate, leaving it open for later invaders.

    Figure 1

    To make matters worse, three Princeton University researchers had already found a bug that enables a rogue applet to escape from the sandbox. “If you combine the two attacks,” says Hopwood, “you can run any code”—even code that tells your computer to send bank records to Taiwan or to erase your hard drive. “It's a vicious attack,” agrees Drew Dean, a member of the Princeton team.

    “I'm feeling kind of bloodied,” admits Mueller. JavaSoft and Netscape, which developed the World Wide Web browser on which Java runs, are both working to patch the holes. But others keep cropping up, which is in part a reflection of Java's power. “There's a tension between being secure and doing interesting things,” says Mueller. “Often, we're between a rock and a hard place.” Security experts also blame a hurried production schedule. “Overall, companies are racing too rapidly to add new features” to software including Java, says Edward Felten, head of the Java research effort at Princeton. “And new code means new bugs.” But he notes that JavaSoft is working hard to identify any new vulnerabilities—even going as far as funding the Princeton group's efforts.

    So far, the attacks seem to be limited to laboratory exercises. Although rumors of Java viruses are rife, Hopwood calls them “hype. Sensationalist hype.” Even so, Hopwood, Dean, and Felten all disable Java and Javascript on their browsers when wandering through the Internet.

    Although only the computer experts are nervous now, consumers might have reason to worry in the future. If Java becomes the backbone of a new virtual marketplace, applets will interact with shoppers on their home computers. This means that Java applets will have to be able to accept money—and hostile applets could easily eavesdrop on transactions and skim off some of the proceeds. In that case, “[a Java-based attack] would be a good way to steal money a little at a time from a lot of people,” says Dean. “It's all fun and games until there's real money involved.”

  7. Internet: High-Speed Network Will Link Russia's Far-Flung Universities

    MOSCOW—On 10 June, Russia's far-flung scientific enterprise suddenly seemed a little less dispersed. At a new university computer center in Yaroslavl, northeast of Moscow, science education officials and journalists took part in a 2-hour teleconference over the Internet with a similar gathering 3500 kilometers away at Novosibirsk University in Siberia. The topic was computer networking plans in Russia, and the event's technical underpinnings provided a glimpse of the future: fiber-optic cables that allowed data to be exchanged four times faster than over most connections in Russia and, at each end, a spanking new Internet center equipped with dozens of workstations.

    Over the next 5 years, 30 similar university Internet centers will be set up all across Russia as part of a $130 million initiative jointly financed by George Soros, the American businessman and philanthropist, and the Russian government. The symbolic unveiling of the first two centers drew a crowd worthy of a major state function. Participants included Vice Premier of Russia Vladimir Kinelyov, Minister of Science and Technological Policy Boris Saltykov, and Deputy Chair of the State Committee on the Higher Education (Goskomvuz) Alexander Tikhonov, along with a handful of university rectors. They all came to pay homage to the power of computer networking. These and future centers, they hope, will enable Russian scientists and academics to tap into the information resources of the global Internet (see other stories in this special section) and collaborate more effectively with their colleagues at home and abroad.

    “In a few years there will be a large community of people in the universities who cannot imagine their life and work without the Internet, which we don't have at the moment,” says Pavel Arsenyev, the director of the Internet Centers at Universities program, as the Soros initiative is called. The centers are also meant to play a broader role in society, by laying the groundwork for local social and educational programs. In a letter read at the ceremony, Soros predicted, “The full impact of the centers will be felt by the year 2000, when Russia together with the rest of the world will fully enter the Information Age.”

    The initiative, announced on 15 March, came as a surprise because the American billionaire had more than once declared that he would end his support of Russian science by 1996. Instead, the announcement marked the largest single commitment Soros has yet made in Russia: $100 million over 5 years to set up the Internet centers, equip them, and cover their operating costs. The Russian government will provide another $30 million through Goskomvuz and the Ministry of Science. The government's contribution will fund high-capacity fiber-optic cables and satellite links, to fulfill the program's goal of providing each center with a 256-kilobit-per-second link to the Internet.

    The program will concentrate on upgrading Internet access from provincial universities—a change from the focus of Soros's past support, two-thirds of which has gone to the scientific institutions in big centers, mostly in Moscow and St. Petersburg. Arsenyev says that Soros wanted to redress the balance, and Saltykov applauds the shift. “It would be very just and fair,” he told Science, “to develop the provincial centers of education and research.” Saltykov points out that Russia has “old and very good universities in many provincial cities, particularly in Siberia, like Tomsk, for example. They are strong in both education and research, and connecting them to the global communications network would give them unlimited access to the information and enormous opportunities to develop.”

    By 18 April, Goskomvuz had drawn up a list of 32 Russian universities, and less than 2 months later the centers in Yaroslavl and Novosibirsk were ready for business. Saltykov notes that the first two centers had a head start. In Novosibirsk, Soros funding had already helped to link the 26 research institutes in Academgorodok (Academic Town, a special area where most of the research institutes are located) and the State Public Scientific-Technical Library of the Siberian Branch of the Russian Academy of Sciences into a single network and couple it to the Internet through links with a capacity of 2 megabits per second. In Yaroslavl, too, a local network was already in place, with connections to the European academic network and the FREEnet, an existing low-capacity network that links Russian Academy of Sciences institutions, research centers, and universities.

    Figure 1

    To help professors and students take advantage of these links, the new centers offer PC computer labs, Web-authoring facilities, and training in the use of the Internet. The Yaroslavl and Novosibirsk projects are also meant to establish an Internet infrastructure for the outside community. In Yaroslavl, for example, the university center will serve as an Internet access provider for the local schools, all of which are linked to the center through dedicated telephone lines, and for other local nonprofit organizations like libraries and museums. The center will also host a World Wide Web site giving information on the city, its culture and education, and the research that goes on in local institutions.

    In spite of the enthusiasm for the project from Russia's top science officials, some of those involved in the project have a nagging worry: uncertain government funding for the new cables and satellite connections. The Russian government managed to fund only about two-thirds of its commitments under last year's science budget. “I hope that there will be no delays in funding,” says Saltykov.

    German Mironov, the Yaroslavl State University rector, is certain of the ultimate payoff. Besides benefiting the academic community, he says, the proliferation of Internet access “will help Russia's formerly closed society become fully open.”

    Andrey Allakhverdov is a free-lance writer in Moscow.

Log in to view full text

Via your Institution

Log in through your institution

Log in through your institution