News FocusScientific Publishing

Are You Ready to Become a Number?

See allHide authors and affiliations

Science  27 Mar 2009:
Vol. 323, Issue 5922, pp. 1662-1664
DOI: 10.1126/science.323.5922.1662

Life could be a lot easier if every scientist had a unique identification number. The question is: Who should provide them?

Life could be a lot easier if every scientist had a unique identification number. The question is: Who should provide them?


A-1262-2007 and A-1270-2007 are happy to be numbers—and they think you will be, too. The two clinical researchers at Maastricht University Medical Center in the Netherlands—named Jochen Cals and Daniel Kotz, respectively—find it bizarre that in this day and age, it can be next to impossible to find all the papers written by a given scientific author. Enter “Smith, J.” into the PubMed search engine, they say, and you're deluged with more than 15,000 abstract titles. Good luck sorting out who wrote what.

That's why, in a paper in The Lancet last summer, the duo recommended that every scientist sign up for ResearcherID, a free system that promises to do away with such confusion by assigning every scientist a number. If everyone enrolled, they claimed, it would be much simpler to retrieve someone's complete publication record or to follow someone's career path. “It would make life a lot easier,” says A-1262-2007.

He's not the only one to think so. With global scientific production growing fast, it's becoming harder and harder to tell authors apart. A universal numbering system could aid scientists trying to stay on top of the literature, help universities more readily track staff productivity, and enable funding agencies to better monitor the bang they're getting for their buck. An effective identification number might also make it easier to find information about an author's affiliations, collaborators, interests, or simply their current whereabouts.

ResearcherID, launched officially in January 2008, is only one in a wave of initiatives trying to pin a number on researchers. It's the creation of digital information company Thomson Reuters, which hopes to enhance the value of its paid services. Meanwhile, universities, librarians, national agencies, and publishers have devised, or are still hatching, potentially competitive identification systems, each with slightly different purposes in mind.

A-1262-2007 and A-1270-2007 endorsed ResearcherID not because it's perfect, they stress, but because it's the first global scheme ready and available now. But some predict that an ID scheme currently in development by CrossRef, an organization that unites more than 600 scientific publishers, has the potential to emerge as the dominant system, if only because publishers can force scientists to cooperate if they want to get their papers printed. Others say the U.S. National Center for Biotechnology Information (NCBI) may have a strong suit because it could incorporate its system into PubMed Central, the free and immensely popular database of medical and life sciences research at the National Institutes of Health (NIH), of which NCBI is part.

But for the moment, the plentitude of plans is a problem, some say, because to be truly useful, a numbering system has to be universal. “There are initiatives in four or five different silos,” says Clifford Lynch, director of the Coalition for Networked Information (CNI), a Washington, D.C.-based group that promotes the use of technology in scholarly communication. “The lack of interconnection is striking.”

Dr. Who?

The confusion over who's who in science has many sources. Common names such as James Smith and Mary Johnson are one—and with the number of published papers growing by an average of 3% annually, it's only getting worse. Some people also change their names when they get married or divorced, effectively splitting their scientific record in two. Adding to the confusion, journals have widely varying style rules for noting first names and initials. There's only one published scientist in the world with the last name of Varmus, for instance—that's Nobel laureate Harold Elliot Varmus—but his name appears on 352 scientific papers in six different ways.

Today's scientific explosion in Asia is fast exacerbating the problem. Names printed in Chinese characters are not usable in most online searching systems. For papers in English, Chinese authors usually “transliterate” their names using the so-called Pinyin system, which leads to many ambiguities. At least 20 different Chinese names, many of them common, are transliterated as “Wang Hong,” for instance. Korean and Japanese names have the same problem. The Vietnamese use Roman script, but an estimated 40% of them have the family name Nguyen, which puts the Smiths to shame.

Of course, other information can help distinguish one author from another. The J. A. Smith who co-authored a 2008 paper on women, anger, and aggression is probably not the J. A. Smith who studied how to control the size of gold clusters in polyaniline. (And if you're in doubt, you can look at where they work.) Conversely, if the same e-mail address appears on two papers, it's a safe bet that they were written by one and the same researcher. But often, it's not so easy.

There are other problems that a numbering system could do away with, such as spelling errors. Of the more than 200 papers by French epidemiologist Antoine Flahault that are in one literature database, 14 are registered under the name “Flahaut.” He can't wait to become a number.

A fine balance

It's no surprise that Thomson Reuters was the first out of the gate with a worldwide system to assign unique numbers to researchers. Its livelihood, selling analyses of the scientific literature, depends on accurately matching papers with people.

Software can achieve that goal to a certain extent. So-called disambiguation algorithms can crawl through the literature and try to figure out which papers belong to the same author. They use names, as well as words in the abstract and all kinds of “metadata,” such as affiliation, scientific field, co-authors, address, and citations. Thomson Reuters uses disambiguation software in its popular ISI Web of Knowledge. Reed Elsevier has built similar algorithms into Scopus, a rival literature search system launched in 2005.

At the moment, despite being two of the most popular literature search sites, neither PubMed nor Google Scholar uses disambiguation systems. Such software is expensive and time-consuming to develop, and the algorithms are far from perfect. They can incorrectly conclude that papers by different authors belong to one person, or fail to realize that various papers came from the same author.

There's a fine balance between these two types of mistakes, says Elsevier's Niels Weertman; if an algorithm reduces the number in one category, it usually increases the prevalence of the other. The software in Scopus errs on the side of caution; it only assigns two papers to the same person if it has a high degree of confidence. That means, for instance, that if you search for “Varmus” in Scopus, there appear to be two scientists by that name, one of whom wrote 338 papers and the other 14. When Weertman plugs his own last name into Scopus's name search field, the results include the works of Johannes Weertman and Julia Weertman, who have published many papers in the same journals and who are both professors emeriti at the department of Materials Science and Engineering at Northwestern University in Evanston, Illinois. (They're also married.) Their intellectual heritage is so hard to disentangle that Scopus cuts it up into more than 60 different clusters.

Identity solution.

Jochen Cals (left) and Daniel Kotz argue that a global system of unique identifiers, such as their ResearcherIDs, will help scientists better search research literature and work with colleagues.


Because of such imprecision, disambiguation algorithms can't assign each scientist a single, unique number. (In Scopus, Varmus now has two numbers, whereas the Weertmans together have 65.) And that's why science information companies need human help. Elsevier, for instance, has staffers who “curate” the data by doing additional search work—mostly at offices in Asia—and manually merge separate groups of papers that belong to the same person.

Another strategy is to let scientists themselves take care of the job—after all, they should know which papers are part of their oeuvre. That's what motivated Thomson Reuters to launch ResearcherID. If you sign up—which some 30,000 scientists have done—you are first assigned a unique number. (Cals's number, A-1262-2007, means he was the 1262nd person to sign up in 2007.) Then you can “claim” the papers the company's software suggests are yours, coupling them to your number; you can also add others that the search engine missed—for instance because you wrote them before your divorce—or upload lists of your own citations.

For such a voluntary disambiguation system to work, scientists need an incentive, however. ResearcherID's carrot is that scientists can analyze citations to their papers, or place a “widget” on their Web site or blog that automatically retrieves a list of their most current papers whenever someone clicks on that page. They can also post a profile page with information about themselves on the ResearcherID site, much like people do on sites such as Facebook or LinkedIn. And like those sites, these services are free. Scopus currently doesn't allow users to register with a unique number, but it has plans to do so as well, Weertman says. Scientists can also suggest corrections to Scopus.

Publishers' plans

Some universities and research organizations have also begun setting up numbering systems. Part of the motivation stems from the advent of institutional repositories, online databases in which researchers can post copies of their papers and other data. Repositories are an answer to the pressure to make the fruits of taxpayer-funded research freely available to the public. They also help research managers track staff productivity and scientific impact. But again, achieving either goal is hard if you can't easily tell who's who—as the Weertmans at Northwestern demonstrate.

Several countries, meanwhile, have developed ID systems at the national level. In the Netherlands—a front-runner when it comes to institutional repositories—every researcher has had a Digital Author Identifier (DAI) since 2007. The system was designed by SURF, a technology development foundation in which most research institutes and universities collaborate. So far, however, few Dutch scientists know about the numbers; Cals, for instance, wasn't aware he had one before he recommended ResearcherID in The Lancet, and he doubts that national numbers are very useful in an international endeavor such as science. But Gera Pronk, who helped develop DAIs, says national systems could eventually be knit together into an international one.

Another ID system is in the works at NIH, the biggest funder of biomedical research in the world. NIH wants to better track what its grantees are publishing, whether it's papers, book chapters, patents, or other output. NCBI Director David Lipman says his institute is consequently developing a unique identifier system for grantees that could later be expanded to all biomedical authors. Lipman has already discussed this idea with the editors of Nature, Science, and the Proceedings of the National Academy of Sciences, who have endorsed the concept.

Scientific publishers have their own reasons to support a numbering system: It would make it easier to do business. Giving scientists a number should speed up manuscript handling, help locate reviewers for a paper and detect conflicts of interest, facilitate royalty payments, and give marketing departments a leg up. For publishers, CrossRef was a logical candidate to develop a personal ID system; it already provides the infrastructure for Digital Object Identifiers (DOIs), the numbers that uniquely identify each published scientific article and that make it possible to click from one citation to the next on the Web.

The system now in development, called ContributorID, would ideally provide one identification with which a researcher could interact with any scientific publisher, whether as an author, journal editor, or reviewer, says CrossRef's Geoffrey Bilder. Whenever a research team submits a manuscript, each member would include a ContributorID number, establishing an enduring link to the paper's DOI. If publishers have trouble selling scientists on the system's benefits, such as doing away with a multitude of login data, they could bring out a stick: You can't publish without a number.

ContributorID would compete with Thomson Reuters and Elsevier, both of which are CrossRef members. Still, Weertman and James Pringle, vice president for product development at Thomson Reuters, say their companies won't necessarily oppose the plan, because there may be ways to cooperate. For instance, CrossRef is interested in using the disambiguation software developed by either company. And, Bilder points out, any universal numbering system, even one developed by a third party, could add value to both companies' products.


Who controls it?

With so many initiatives, there's lots of discussion about which one could—or should—prevail. Because a numbering system would be for the ages, some say it shouldn't be in private hands or held by a single company. “I would be very worried if an individual publisher controlled this,” says CNI's Lynch, adding that he would be “much more comfortable” if it were operated by NCBI or a broad group like CrossRef, whose membership includes so-called open-access publishers and scientific societies like AAAS, the publisher of Science. Given the power of CrossRef members to enforce the system, Lynch predicts that “it will probably carry the day.” But Pronk says a publisher-operated system may remain unpalatable to universities; they are more likely to stick with their own, she says.

Meanwhile, some say the current lack of coordination is not just wasteful but could add to the confusion. Like people who have accounts at several social networking sites, researchers could end up with a whole series of numbers. Lipman says that may be the only way to make progress, however. “If we all had to sit down and talk until we agreed on a system, we would never get anything done,” he says. And there's nothing wrong with letting a couple of systems evolve, he says; they can always be linked or merged later on.

Whichever system emerges victorious, it will still face problems. One is how to authenticate that the scientists claiming papers are who they say they are. ResearcherID does not currently verify that. “You can log in and claim every paper by Albert Einstein and have a lot of fun,” Lynch says. Pringle says the system is “self-policing”: If authors claim papers they have not actually written, others will protest soon enough, he says. Whether that really suffices remains to be seen. But for keeping track of NIH grantees or dealing with publishers, more secure identification systems are necessary, akin to those used for logging in to an online bank account.

Then there's the problem of what to do with the millions of old papers whose authors cannot help disambiguate their work because they are dead or no longer active. “Nobody has the time and money to do the detective work to get all the retrospective stuff 100% right,” says Lynch. “It will always be a little probabilistic and flaky.”

Unless that detective work was left to the wisdom of crowds, says Lipman. After all, if people are enthusiastic enough to set up Wikipedia pages about the most arcane topics, they may also be willing to help sort out who's who in 2 centuries of scientific literature.

View Abstract

Stay Connected to Science

Navigate This Article