Sharing by design: Data and decentralized commons

See allHide authors and affiliations

Science  11 Dec 2015:
Vol. 350, Issue 6266, pp. 1312-1314
DOI: 10.1126/science.aaa7485

Ambitious international data-sharing initiatives have existed for years in fields such as genomics, earth science, and astronomy. But to realize the promise of widespread sharing of scientific data, intellectual property, data privacy, national security, and other legal and policy obstacles must be overcome (1). Although these issues have attracted much attention in some circles, they have often taken a back seat to addressing technical challenges. Yet failure to account for legal and policy issues at the outset of a large transborder data-sharing project can lead to undue resource expenditures and data-sharing structures that may offer fewer benefits than hoped. Drawing on our experience with the Belmont Forum, a multinational earth change–research program, we propose a framework to help plan data-sharing arrangements with a focus on early-stage decisions including options for legal interoperability.

A rich literature beginning with the work of Ostrom (2) addresses the organization and governance of common pool resources shared by communities of users in contexts ranging from the global environment to communal living spaces. More recent work has expanded these principles to knowledge commons: collections of intangible resources, such as digital libraries, scholarly publications, and scientific data (3). Responding to calls for increased international scientific collaboration, several expert bodies have developed high-level principles for transborder data sharing (46). Although these efforts lay the groundwork for broad data-pooling initiatives, critical design decisions must be made before larger issues of governance and operation.

A SPECTRUM OF CENTRALIZATION. Although little empirical research exists on commons structures for data sharing and related costs, we have observed four basic structural models for scientific data pools along a continuum ranging from the most to the least centralized (see the table).

(i) fully centralized: all data are aggregated in a single, centrally managed repository;

(ii) intermediate distributed: repositories are distributed and separately maintained, but may be interconnected by a central access portal, share technical service components, and utilize a common data-exchange format [sometimes called a federated database system (7)];

(iii) fully distributed: repositories are maintained locally and are not technically integrated, but share a common legal and policy framework that allows access on uniform terms and conditions (legal interoperability);

(iv) noncommons: repositories are largely disaggregated and lack technical and legal interoperability and, at most, may share a common index.

Centralized repositories with curation, analytics, and quality control can enhance the value of the data they contain [e.g., the Gen-Bank repository of DNA and RNA sequence data (8)]. Centralized structures, however, come at a cost and may be impractical in many transborder collaborations because of political, legal, and organizational issues. But the alternative to a fully centralized commons need not be a noncommons. The shortfalls of noncommons models include incompatible data formats, inability to search across data sets, underutilization of data resources, individualized and inefficient access requirements, and difficulties moving data across national boundaries. Distributed commons structures, however, offer a meaningful subset of benefits with lower cost and resource commitments than fully centralized models.

For example, an online portal through which researchers can access multiple independent repositories may feel like a centralized commons to users, but avoids the cost and governance overhead of a centralized repository [e.g., the Global Earth Observation System of Systems (GEOSS)]. Portal-based structures may make it easier for a central administrator to provide users with value-added services and aggregated statistics [e.g., the World Data Center for Microorganisms (9)], and allow users to more easily query, combine, and analyze multiple data sources (7).

Even if resources do not exist to link repositories technically, there are advantages to fostering legal interoperability among distributed repositories (10). To achieve this across jurisdictions, rules for data access and usage must be compatible with each other, must comply with laws and regulations of relevant jurisdictions, and must address rights of ownership and control granted to data generators (11). Legal interoperability can enable researchers to access and use data across multiple repositories without seeking authorization on a case-by-case basis, which increases the likelihood that more data will be put to productive use.

Perhaps the most straightforward path to legal interoperability is simply to contribute data to the public domain and waive all future rights to control it (11). This approach has been advocated by more than 250 organizations that have endorsed the 2010 Panton Principles for open data in science (12). Alternatively, researchers who wish to receive attribution credit for their contributions, but are otherwise willing to relinquish control over them, have released data under standardized Creative Commons licenses that have been widely used for other online content, including open-source code software, music, and photographs.

Despite the simplicity and appeal of these approaches, they are not always feasible. Data will often remain subject to legal regulation that, for instance, explicitly or implicitly reveal personally identifiable information, were obtained from human research subjects, relate to sensitive technologies, or disclose infrastructural details. Wilbanks and others, recognizing these requirements, have called for new models of informed consent and privacy protection to facilitate broad, socially beneficial sharing of at least some categories of such data (13).

Structural models for scientifc data pools

Data-sharing options

DESIGN CONSIDERATIONS. If a collaborative research project has sufficient resources to create a centralized data repository with accompanying infrastructure and staffing (potentially millions of dollars up-front and thereafter for fully staffed and curated repositories), important benefits can be achieved. In most cases, however, this level of funding will not be available and a distributed data commons could be a desirable alternative. We found, in our experience with the Belmont Forum, that the project's leadership gave substantial weight to early aspirational statements regarding broad data sharing. Sufficient consideration may not have been given to potentially useful distributed data structures. When, at the conclusion of a lengthy planning stage, it became apparent that a centralized commons was beyond budgetary constraints, the decision was made to settle for no commons at all and rely on lofty but nonspecific data-sharing principles to motivate researchers to share data on their own (14). To help avoid such dilemmas in the future, we offer the following actionable framework for evaluating distributed data commons early in the project-planning phase:

How many data repositories are under consideration? If the number is small, then fully distributed, unlinked repositories (i.e., no commons) may suffice. Researchers may easily access each repository, and the cost of a commons structure can be avoided.

Are there resources to develop a common data portal? As the number of repositories increases, some form of commons structure will likely facilitate data sharing and usage. Although the cost is not trivial, a common portal can enhance the value and usability of the data. If funding for a data portal is not available, planners may wish to consider a fully distributed commons with legal interoperability.

Are data regulated in the relevant jurisdictions? This question is relevant no matter which commons structure is selected. If data are not regulated or subject to human-subject, privacy, health, or similar legal regimes, consider releasing data to the public domain or licensing under a common-use license. If data are regulated in one or more relevant jurisdictions, planners should consider engaging legal experts to develop a common data access and use policy that complies with regulations in each jurisdiction. For example, if data include human genetic information, both genetic nondiscrimination laws and data privacy regulations should be considered. Legal interoperability, and the ability for users to access and use all data on consistent terms via a single authorization, will be achieved only if the most stringent jurisdiction's regulations are observed in each case or are otherwise addressed (13).

Although the Belmont Forum will doubtless produce a wealth of valuable earth science data, initial appreciation of data-sharing options might have facilitated decision-making and planning among its many national participants and might have resulted in a more robust data-sharing structure. Addressing these design choices early—while acknowledging budgetary, legal, and political constraints—can save planning and implementation costs later.

References and Notes

  1. Policy RECommendations for Open Access to Research Data in Europe (RECODE), http://recodeproject.eu.
  2. Acknowledgments: J.H.R. has received support from the National Human Genome Research Institute (award no. P50HG003391). J.L.C. and J.H.R. served as members of the U.S. delegation to the Belmont Forum organized by NSF.
View Abstract

Stay Connected to Science

Navigate This Article