Predicting human behavior: The next frontiers

See allHide authors and affiliations

Science  03 Feb 2017:
Vol. 355, Issue 6324, pp. 489
DOI: 10.1126/science.aam7032


Machine learning has provided researchers with new tools for understanding human behavior. In this article, we briefly describe some successes in predicting behaviors and describe the challenges over the next few years.

Advances in machine learning are revolutionizing how we understand offline and online human behavior. The ability to classify objects of interest from a training set, whether those objects are terrorists (1), machines that need maintenance (2), or emails containing a malicious link (3), represents the greatest success in the field. Typically, no single machine learning algorithm does everything well. Although accuracy is crucial, the acceptable accuracy varies with the problem being studied, and accuracy is not enough. All too often, researchers explain why their predictions are right but say nothing about why their predictions might be wrong. Knowing both enables decision-makers to make better decisions. Especially in high-risk situations, predictions must have accompanying explanations that provide deeper understanding of the situation being studied. A predictive model must also provide one or more prescriptions for potential future actions that enable decision-makers to make better decisions. Today’s machine learning methods do not necessarily satisfy these three criteria. What constitutes an ideal predictive algorithm depends on the application. Oftentimes, stakeholders (e.g., social media platforms and search engines) will use varying definitions of accuracy that meet their particular needs. Moreover, domain experts may use extensive knowledge of the domain to suggest relevant independent variables to be included in a data set. Often, they will explain predictions using both the technical accuracy measures generated by a predictive model and stories from their discipline that are more understandable to their audiences. All of this suggests that in real-world systems, computer scientists need to team with stakeholders to generate high-impact results.

In our opinion, the next generation of predictive models must deal with four major challenges.

First, the maxim that more data lead to better predictive models is not always true, because noise in the data can overwhelm predictive models. The ability to deal with noisy, incomplete, and inconsistent data will be at the heart of next-generation predictive models. For instance, when identifying “bots” on Twitter (4) that are seeking to sway opinion to be positive about a political candidate, we needed to ignore the huge numbers of bots that were seeking to achieve other ends—such as spreading spam or seeking to influence opinions about other topics or to deceive users into clicking on links that generate revenue for the person who included that link in their tweet. Moreover, data about many Twitter handles are limited and, in some cases, intentionally misleading. Bot developers go to considerable effort to ensure that their bots elude detection.

A second challenge is that of rare-event prediction. For instance, companies monitoring their internal networks to identify users who may steal secrets would include information about all employee activity on the company network, ranging from analyses of employee email, uploads (to websites), downloads onto memory sticks, and much more. Most employees are honest, with only a small fraction engaging in bad behavior. In such cases, machine learning algorithms have difficulty disambiguating the data on these “rare” individuals from innocent users (in which case, the data are called “imbalanced”) and predictive models typically perform poorly.

The generation and reduction to practice of robust multistage predictive modeling for emergent phenomena is an important third step. For instance, social movements have been classified into five stages (5): genesis of the movement, increase in social unrest, enthusiastic mobilization to develop an organization, maintenance of the organization, and termination (when the movement starts to die down). When the protest is in an early stage (for example, of people expressing grievances on Twitter), some stakeholders would benefit from a prediction of the likelihood of violence occurring in any of the future stages.

A fourth factor is that human behavior is dynamically changing. Adversaries (e.g., malware developers or terrorists) are constantly adapting to their environment. Here, a form of higher-order prediction (prediction about the prediction model) is key. We need to be able to predict when the model will go wrong or when human behavior will change, so we develop a new prediction model well before too many mistakes are made. For instance, the developers of the OpFake Android malware initially designed it to automatically send text messages from infected phones to premium rate messaging services that would bill the owner of the phone; later, they adapted their system to commit bank fraud as well. The development of predictive models that can identify such behavioral changes as they occur, or even before, is sorely needed.

The explosion in open-source data and advances in machine learning have revolutionized how we reason about human behavior. Over the next few years, with the emergence of the “Internet of Things,” we can expect a second explosion of diverse, heterogeneous data. We can expect to be beset with problems linked to incomplete, inconsistent, imbalanced, and noisy data. The ability to generate accurate predictions and high-quality analyses that include support for and evidence against predictions, and the ability to provide actionable decisions, will be critical as machine learning systems go viral. A data-driven, multidisciplinary, multistakeholder approach is critical to the success of future predictive modeling.



View Abstract

Stay Connected to Science

Navigate This Article