PerspectiveSocial Science

Computational Social Science

See allHide authors and affiliations

Science  06 Feb 2009:
Vol. 323, Issue 5915, pp. 721-723
DOI: 10.1126/science.1167742

We live life in the network. We check our e-mails regularly, make mobile phone calls from almost any location, swipe transit cards to use public transportation, and make purchases with credit cards. Our movements in public places may be captured by video cameras, and our medical records stored as digital files. We may post blog entries accessible to anyone, or maintain friendships through online social networks. Each of these transactions leaves digital traces that can be compiled into comprehensive pictures of both individual and group behavior, with the potential to transform our understanding of our lives, organizations, and societies.

The capacity to collect and analyze massive amounts of data has transformed such fields as biology and physics. But the emergence of a data-driven “computational social science” has been much slower. Leading journals in economics, sociology, and political science show little evidence of this field. But computational social science is occurring—in Internet companies such as Google and Yahoo, and in government agencies such as the U.S. National Security Agency. Computational social science could become the exclusive domain of private companies and government agencies. Alternatively, there might emerge a privileged set of academic researchers presiding over private data from which they produce papers that cannot be critiqued or replicated. Neither scenario will serve the long-term public interest of accumulating, verifying, and disseminating knowledge.

Data from the blogosphere.

Shown is a link structure within a community of political blogs (from 2004), where red nodes indicate conservative blogs, and blue liberal. Orange links go from liberal to conservative, and purple ones from conservative to liberal. The size of each blog reflects the number of other blogs that link to it. [Reproduced from (8) with permission from the Association for Computing Machinery]

What value might a computational social science—based in an open academic environment—offer society, by enhancing understanding of individuals and collectives? What are the obstacles that prevent the emergence of a computational social science?

To date, research on human interactions has relied mainly on one-time, self-reported data on relationships. New technologies, such as video surveillance (1), e-mail, and “smart” name badges, offer a moment-by-moment picture of interactions over extended periods of time, providing information about both the structure and content of relationships. For example, group interactions could be examined through e-mail data, and questions about the temporal dynamics of human communications could be addressed: Do work groups reach a stasis with little change, or do they dramatically change over time (2)? What interaction patterns predict highly productive groups and individuals? Can the diversity of news and content we receive predict our power or performance (3)? Face-to-face group interactions could be assessed over time with “sociometers.” Such electronic devices could be worn to capture physical proximity, location, movement, and other facets of individual behavior and collective interactions. The data could raise interesting questions about, for example, patterns of proximity and communication within an organization, and flow patterns associated with high individual and group performance (4).

We can also learn what a “macro” social network of society looks like (5), and how it evolves over time. Phone companies have records of call patterns among their customers extending over multiple years, and e-Commerce portals such as Google and Yahoo collect instant messaging data on global communication. Do these data paint a comprehensive picture of societal-level communication patterns? In what ways do these interactions affect economic productivity or public health? It is also increasingly easy to track the movements of people (6). Mobile phones allow the large-scale tracing of people's movements and physical proximities over time (7). Such data may provide useful epidemiological insights: How might a pathogen, such as influenza, driven by physical proximity, spread through a population?

The Internet offers an entirely different channel for understanding what people are saying, and how they are connecting (8). Consider, for example, this past political season, tracing the spread of arguments, rumors, or positions about political and other issues in the blogosphere (9), as well as the behavior of individuals “surfing” the Internet (10), where the concerns of an electorate become visible in the searches they conduct. Virtual worlds, which by their nature capture a complete record of individual behavior, offer ample opportunities for research—experimentation that would otherwise be impossible or unacceptable (11). Similarly, social network Web sites offer a unique opportunity to understand the impact of a person's position in the network on everything from their tastes to their moods to their health (12), whereas Natural Language Processing offers increased capacity to organize and analyze the vast amounts of text from the Internet and other sources (13).

In short, a computational social science is emerging that leverages the capacity to collect and analyze data with an unprecedented breadth and depth and scale. Substantial barriers, however, might limit progress. Existing ways of conceiving human behavior were developed without access to terabytes of data describing minute-by-minute interactions and locations of entire populations of individuals. For example, what does existing sociological network theory, built mostly on a foundation of one-time “snapshot” data, typically with only dozens of people, tell us about massively longitudinal data sets of millions of people, including location, financial transactions, and communications? These vast, emerging data sets on how people interact surely offer qualitatively new perspectives on collective human behavior, but our current paradigms may not be receptive.

There are also enormous institutional obstacles to advancing a computational social science. In terms of approach, the subjects of inquiry in physics and biology present different challenges to observation and intervention. Quarks and cells neither mind when we discover their secrets nor protest if we alter their environments during the discovery process. As for infrastructure, the leap from social science to a computational social science is larger than from biology to a computational biology, largely due to the requirements of distributed monitoring, permission seeking, and encryption. There are fewer resources available in the social sciences, and even the physical (and administrative) distance between social science departments and engineering or computer science departments tends to be greater than for the other sciences.

Perhaps the thorniest challenges exist on the data side, with respect to access and privacy. Much of these data are proprietary (e.g., mobile phone and financial transactional information). The debacle following AOL's public release of “anonymized” search records of many of its customers highlights the potential risk to individuals and corporations in the sharing of personal data by private companies (14). Robust models of collaboration and data sharing between industry and academia are needed to facilitate research and safeguard consumer privacy and provide liability protection for corporations. More generally, properly managing privacy issues is essential. As the recent U.S. National Research Council's report on geographical information system data highlights, it is often possible to pull individual profiles out of even carefully anonymized data (15). Last year, the U.S. National Institutes of Health and the Wellcome Trust abruptly removed a number of genetic databases from online access (16). These databases were seemingly anonymized, simply reporting the aggregate frequency of particular genetic markers. However, research revealed the potential for deanonymization, based on the statistical power of the sheer quantity of data collected from each individual in the database (17).

Because a single dramatic incident involving a breach of privacy could produce rules and statutes that stifle the nascent field of computational social science, a self-regulatory regime of procedures, technologies, and rules is needed that reduces this risk but preserves research potential. As a cornerstone of such a self-regulatory regime, U.S. Institutional Review Boards (IRBs) must increase their technical knowledge to understand the potential for intrusion and individual harm because new possibilities do not fit their current paradigms for harm. Many IRBs would be poorly equipped to evaluate the possibility that complex data could be de-anonymized. Further, it may be necessary for IRBs to oversee the creation of a secure, centralized data infrastructure. Currently, existing data sets are scattered among many groups, with uneven skills and understanding of data security and widely varying protocols. Researchers themselves must develop technologies that protect privacy while preserving data essential for research. These systems, in turn, may prove useful for industry in managing customer privacy and data security (18).

Finally, the emergence of a computational social science shares with other nascent interdisciplinary fields (e.g., sustainability science) the need to develop a paradigm for training new scholars. Tenure committees and editorial boards need to understand and reward the effort to publish across disciplines. Initially, computational social science needs to be the work of teams of social and computer scientists. In the long run, the question will be whether academia should nurture computational social scientists, or teams of computationally literate social scientists and socially literate computer scientists. The emergence of cognitive science offers a powerful model for the development of a computational social science. Cognitive science has involved fields ranging from neurobiology to philosophy to computer science. It has attracted the investment of substantial resources to create a common field, and created enormous progress for public good in the last generation. We would argue that a computational social science has a similar potential, and is worthy of similar investments.

www.sciencemag.org/cgi/content/full/323/5915/721/DC1

References and Notes

  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.
  16. 16.
  17. 17.
  18. 18.
  19. 19.

Navigate This Article