Agile methods for predicting contest outcomes by social media analysis

People think, or seem to assume, that there is some magical machine that spits out accurate predictions of future events from social media data.

There is not, and that’s why each credible analysis takes human time and effort.

But therein also lies the challenge: when fast decisions are needed, time-taking analyses reduce agility. Real-time events would require real-time analysis, whereas data analysis is often cumbersome and time-taking effort, including data collection, cleaning, machine training, etc.

It’s a project for weeks or days, not for hours. All the practical issues of the analysis workflow make it difficult to provide accurate predictions at a fast pace (although there are other challenges as well).

An example is Underhood.co – they predicted Saara Aalto to win X-Factor UK based on social media sentiment, but ended up being wrong. While there are many potential reasons for this, my conclusion is that their indicators lack sufficient predictive power. They are too reliant on aggregates (in this case country-level data), and had a problematic approach to begin with – just like with any prediction, the odds change on the go as new information becomes available, so you should never predict the winner weeks ahead.

Of course, theirs was just a publicity stunt where they hoped being right would prove the value of their service. Another example, of course, is the US election where prediction markets were completely wrong of the outcome. That was, according to my theory, because of wrong predictors – polls ask what is your preference or what you would do, whereas social media engagement shows what people do (in social media), and as such are closer to real behavior, hence better predictors.

Even if I do think human analysts are still needed in the near future, more solutions for quick collection and analysis of social media data are needed, especially to combine the human and machine work in the best possible way.

Some of these approaches can be based on automation, but others can be methodological, such as quick definition of relevant social media outlets for sampling.

Here are some ideas I have been thinking of:

I. Data collection

Quick definition of choice space (e.g., candidates in a political election, X-Factor contestants)
Identification of related media social media outlets (i.e., communities, topic hashtags)
Collecting sample (API, scraping, or copy-paste (crowdsourcing))

Each part is case-dependent and idiosyncratic – for whatever event, I’m thinking competitions here, you have to this work from scratch. Ultimately, you cannot get the whole Internet as your data, but you want the sample to be as representative as possible. For example, it was obvious that Twitter users showed much more negative sentiment towards Trump than Facebook users, and in both platforms you had supporter groups/topic concentrations that should first be identified before any data collection.

Then, the actual data collection is tricky. People again seem to assume all data is easily accessible. It’s not – while Twitter and Facebook have an API, Youtube and Reddit don’t, for example. This means the comments that you use for predicting the outcome (by analyzing their relative share of the total as well as the strength of the sentiment beyond pos/neg) need to be fetched either by webscraping or manually copying them to a spreadsheet.

Due to large volumes of data, crowdsourcing could be useful — e.g., setting up a Google Sheet where crowdworkers each paste the text material in clean format. The raw text content, e.g. tweets, Facebook comments, Reddit comments, is put in separate sheets for each candidate.

II. Data analysis

Cluster visualization (defining clusters, visualizing their respective sizes (plot # of voters), breakdown by source platform and potential other factors)
Manual training (classifying the sentiment, or “likelihood to vote”)
Machine classification (calculating the number of likely voters)

In every statistical analysis, the starting point should be visualizing the data. This shows an aggregate “helicopter view” of the situation. Such a snapshot is useful also for demonstrating the results for the end-user, to let the data speak for itself. Candidates are bubbles in the chart, their sizes in respect to the number of calculated likely voters.

The data could be broken down according to source platforms, or other factors, by using the candidate as a point of gravity for the cluster.

Likelihood to vote could be classified as a scale, not binary. That is, instead of saying “sentiment is positive: YES/NO”, we could say “How likely is the person to vote?” which is the same as asking how enthusiastic or engaged he or she is. Therefore, a scale is better, e.g. ranging from -5 (definitely not voting for this candidate) to +5 (definitely voting for this candidate). The manual training, which also could be done with the help of crowd, helps the machine classifier to improve its accuracy on the go.

Based on training data, it would generalize classification to all material. Now, the material is bucketed so that each candidate is evaluated separately and the number of likely voters can be calculated. It is possible that the machine classifier could benefit from training input from both candidates, inasmuch the language showing positive and negative engagement is not significantly different.

It is important to note that negative sentiment does not really matter. What we are interested in is the number of likely voters. This is because of the election dynamics – it does not matter how poor a candidates aggregate sentiment is, i.e. the ratio between haters and sympathizers, as long as his or her number of likely voters is higher than that of the competition. This effect was evident in the recent US presidential election.

The crucial thing is keep the process alive during the whole election/competition period. There is no such point where it becomes certain that one loses and the other remains, although the divide can become substantial and therefore increase the accuracy of the prediction.

III. Presentation of results

constantly updating feed (à la Facebook video stream)
cluster visualization
search trend widget (source: Google Trends)
live updating predictions (manual training –> machine model)

The results could be shown as a form of dashboard to the end user. Search trend graph and the above mentioned cluster visualization could be viable alternatives. In addition, it would be interesting to see the count of voters evolving in time – in such a way that it, along with the visualization, could be “played back” to examine the development in time.

In other words, interactive visualization. As noted, the prediction, or the count of likely votes, should update real-time as a result of combined human-machine work.

Conclusion and discussion

The idea behind development of more agile methods to use social media data to predict content outcomes is that the accuracy of the prediction is based on the choice of indicators rather than the finesse of the method. For example, complex Bayesian models falsely predicted Hillary Clinton would win the election. It’s not that the models were poorly built; they just used the wrong indicators, namely polling data.

This is the usual case of ‘garbage in, garbage out’, and it shows that the choice of indicators is more important than technical features of the predictive model.

The choice of indicators should be done based on their predictive power and although I don’t have strict evidence on it, it intuitively makes sense that social media engagement is a stronger indicator in many instances than survey data, because it’s based on actual preferences instead of stated preferences.

Social scientists know from long tradition of survey research that there are myriad of social effects reducing the reliability of data (e.g., social desirability bias). Those, I would argue, are much smaller issue in social media engagement data.

However, to be fair, there can be issues of bias in the social media engagement data. The major concern is low participation rate: a common heuristic is that 1/10 of participants actually contribute in writing, while the other 9/10 are readers whose real thoughts remain unknown.

It’s then a question of how well does the vocal minority reflect the opinion of the silent majority. Or, in some cases, this is irrelevant for competitions if the overall voting share remains low. For example, if it’s 60% it is relative much more important to mobilize the active base than if voting was close to 100% where one would need a universal acceptance.

Another issue is the non-representative sampling. This is a concern when the voting takes place offline, and online data does not accurately reflect the voting of those who do not express themselves online. However, as social media participation is constantly increasing, this becomes less of a problem. In addition, compared to other methods of data collection – apart from stratified polling, perhaps – social media is likely to give a good result on competitive predictions because of their political nature.

People who strongly support a candidate are more likely to be vocal about it, and the channel for voicing their opinion is the social media.

It is evident that the value of social media engagement as a predictor is currently underestimated, as proven by the large emphasis put on political polls and virtually zero discussion on social media data. As a direct consequence of this, those who are able to leverage the social media data in the proper way will gain competitive advantage, be it betting market, or any other purpose where prediction accuracy plays a key role.

The prediction work will remain a hybrid effort by man and machine.