NULab core faculty member Nicholas Beauchamp has published new research in the American Journal of Political Science. Beauchamp’s article, “Predicting and Interpolating State-Level Polls Using Twitter Textual Data,” combines 1,200 state-level polls from the 2012 presidential campaign with over 100 million political tweets to show that “when properly modeled, the Twitter-based measures track and to some degree predict opinion polls.” Below is Beauchamp’s description of his research and its significance:

In these polarized times, when presidential elections are regularly close and the stakes are high, many of us are interested in tracking polling data from as many swing states as possible, in as close to realtime as possible. But even at the height of a campaign such polling is expensive and somewhat sporadic, and for those interested in political data from non swing states, virtually nonexistent. More generally, researchers, journalists, and citizens are often interested in many forms of public opinion—about economics, social movements, or other public institutions —which are only sporadically polled at best.


Figure 1: Twitter-estimated election results for training states (triangles) and test states (circles).

In many ways social media would seem an ideal solution to this problem: hundreds of millions of users generate terabytes of data every day from all corners of the globe on any conceivable issue. The problem, however, is that no matter how numerous they are, these people are almost never properly representative of an entire nation, state, or community. Merely counting tweets or measuring sentiment will be quite misleading if those you are counting are unrepresentative of the population you are interested in.

Instead of attempting of unskew or reweight these data, the approach here is to use what few polls we do have to discover the subset of words (and hashtags) that correlate with properly representative measures of public opinion.  During the 2012 election, the sporadic polls from swing states were collected, and hundreds of millions of politics-related tweets were located down to the state level.  Then using machine learning methods, subsets of words were automatically found from within those tweets that correlated with variations in the polls over time and across states. Once trained, the model was tested and it was shown that the Twitter text could not only be used to estimate poll differences in unpolled states and on unpolled days, but that it did a better job at anticipating the daily fluctuations of the polls than a careful extrapolation using polling data alone.

Figure 2: Tracking vote intention in Ohio using Twitter.  Open circles: traditional tracking polls (smoothed); closed circles: Twitter estimates.

Figure 2: Tracking vote intention in Ohio using Twitter. Open circles: traditional tracking polls (smoothed); closed circles: Twitter estimates.

One can use this method to give a sense of daily changes in public opinion on the state level, reflecting the instant shifts characteristic of social media rather than relying on slow and expensive surveys, but which is also properly representative and accurate in the way that merely counting hashtags or sentiments is not. While this approach would not dispense with polling, it could be a useful augmentation, using social media to extrapolate to unpolled regions at fine-grained timescales of days or even hours. And as a bonus, the words themselves give some insight into the campaign—not just what people are talking about most, but that subset of talk that most correlates with national opinion variation.  In 2012, topics reflective of national opinion included Benghazi and the 47%, but with so many issues swirling around the 2016 election, this approach may let us discern which issues actually reflect national opinion change, and which are just talk.

Read the abstract and access the full article here.