March 30, 2017
The 45th President of the USA
The problem of predicting election outcomes with social media is that the data, such as likes, are aggregate, whereas the election system is not — apart from simple majority voting, in which you only have the classic representativeness problem that Gallup solved in 1936. To solve the aggregation problem, one needs to segment the polling data so that it 1) corresponds to the prevailing election system and 2) accurately reflects the voters according to that system. For example, in the US presidential election each state has a certain number of electoral votes. To win, a candidate needs to reach 270 electoral votes.
One obvious solution would be track like sources to profiles and determine the state based on publically given information by the user. This way we could filter out foreign likers as well. However, there are some issues of using likes as indicators of votes. Most importantly, “liking” something on social media does not in itself predict future behavior of an individual to a sufficient degree.
Therefore, I suggest here a simple polling method via social media advertising (Facebook Ads) and online surveys (Survey Monkey). Polling is partly facing the same aforementioned problem of future behavior than using likes as the overarching indicator which is why in the latter part of this article I discuss how these approaches could be combined.
At this point, it is important to acknowledge that online polling does have significant advantages relating to 1) anonymity, 2) cost, and 3) speed. That is, people may feel more at ease expressing their true sentiment to a machine than another human being. Second, the method has the potential to collect a sizeable sample in a more cost-effective fashion than calling. Finally, a major advantage is that due to the scalable nature of online data collection, the predictions can be updated faster than via call-based polling. This is particularly important because election cycles can involve quick and hectic turns. If the polling delay is from a few days to a week, it is too late to react to final week events of a campaign which may still carry a great weight in the outcome. In other words: the fresher the data, the better. (An added bonus is that by doing several samples, we could consider momentum i.e. growth speed of a candidate’s popularity into our model – albeit this can be achieved with traditional polling as well.)
The method, social media polling or SMP, is described in the following picture.
Figure 1 Social media polling
1. Define segmentation criteria
First, we understand the election system. For example, in the US system every state has a certain weight expressed by its share of total electoral votes. There are 50 states, so these become our segmentation criteria. In case we deem appropriate to do further segmentation (e.g., gender, age), we can do so by creating additional segments which are reflected in target groups and surveys. (These sub-segments can also be analyzed in the actual data later on.)
2. Create unique surveys
Then, we create a unique survey for each segment so that the answers will be bucketed. The questions of the survey are identical – they are just behind different links to enable easy segmentation. We create a survey rather than use a visible poll (app) or picture-type of poll (“like if you vote Trump, heart if you vote Hillary”), because we want to avoid social desirability bias. A click on Facebook will lead the user to the unique survey of their segment, and their answers won’t be visible to the public.
3. Determine sample size
Calculating sample size is one of those things that will make your head spin, because there’s no easy answer as to what is a good sample size. Instead, “it depends.” However, we can use some heuristical rules to come up with decent alternatives in the context of elections. Consider two potential sample sizes.
These are seen as decent options among election pollsters. However, the margin of error is still quite sizeable in both of them. For example, if there are two candidates and their “true” support values are A=49%, B=51%, the large margin of error makes us easily go wrong. We could solve this by increasing the sample size, but the problem is that if we would like to reduce the margin of error from +/- 3% to say 1%, our required sample size grows dramatically (more precisely, with a 95% confidence and population size of 1M, it’s 9512 – unpractically high for a 50-state model). In other words, we have to accept the risk of wrong predictions in this type of situation.
All states have over 1,000,000 million people so each of them are considered as “large” populations (this is a mathematical thing – required sample size stabilizes after reaching a certain population size). Although US is characterized as one population, in the context of election prediction it’s actually several different populations (because we have independent states that vote). The procedure we apply is stratified random sampling in which the large general population is split into sub-groups. In practice, each sub-group requires its own sample, and therefore our approach requires a considerably larger sample size than a prediction that would only consider the whole population of the country. But, exactly because of this it should be more accurate.
So, with this lengthy explanation let us say we satisfice with a sample size of 500 per state. That would be 500×50=25,000 respondents. If it would cost 0.60$ to get a respondent via running Facebook ads, the cost for data collection would be 15,000$. For repetitive purposes, there are a few strategies. First, the sample size can be reduced for states that show a large difference between the candidates. In other words, we don’t need to collect a large number of respondents if we “know” the popularity difference between candidates is high. The important thing is that the situation is measured periodically, and sample sizes are flexibly adjusted according to known results. In a similar vein, we can increase the sample size for states where the competition is tight, to reduce the margin of error and therefore to increase the accuracy of our prediction. To my understanding, the opportunity of flexible sampling is not efficiently used by all pollsters.
4. Create Facebook campaigns
For each segment, a target group is created in Facebook Ads. The target group is used to advertise to that particular group; for example, the Michigan survey link is only shown to people from Michigan. That way, we minimize the risk of people outside the segment responding (however, they can excluded later on by IP). At this stage, creating attractive ads help keeping the cost per response low.
5. Run until sample size is reached
The administrator observes the results and stops the data collection once a segment has reached the desired sample size. When all segments are ready, the data collection is stopped.
6. Verify data
Based on IP, we can filter out respondent who do not belong to our particular geographical segment (=state).
Ad clicks can be used to determine sample representativeness by other factors – in other words, we can use Facebook’s campaign reports to segment age and gender information. If a particular group is under-represented, we can correct by altering the targeting towards them and resume data collection. However, we can also accept the under-representation if we have no valid reference model as to voting behavior of the sub-segments. For example, millennials might be under-represented in our data, but this might correspond with their general voting behavior as well – if we assume survey response rate corresponds with voting rate of the segments, then there is no problem.
7. Analyze results
The analysis process is straight-forward:
segment-level results x weights = prediction outcome
For example, in the US presidential election, segment-level results would be each state (who polls highest in the state is the winner there) which would be multiplied by the share of electoral votes of each state. The candidate who gets at least 270 votes is the predicted winner.
Now, as for other methods, we can use behavioral data. I have previously argued behavioral data is a stronger indicator of future actions since it’s free from reporting bias. In other words, people say they do, but won’t end up doing. This is a very common problem, but in research and daily life.
To correct for that, we consider two approaches here:
1) The volume of likes method, which parallels a like to a vote (the more likes a candidate has in relation to another candidate, the more likely they are to win)
For this method to work, the “intensity of like”, i.e. its correlation to behavior should be determined, as not all likes are indicators of voting before. Likes don’t readily translate into votes, and there does not appear to be other information we can use to further examine their correlation (like is a like). We could, however, add contextual information of the person, or use rules such as “the more likes a person gives, the more likely (s)he is to vote for a candidate.”
Or, we could use another solution which I think is better:
2) Text analysis/mining
By analyzing social media comments of a person, we can better infer the intensity of their attitude towards a given topic (in this case, a candidate). If a person is using strongly positive vocabulary while referring to a candidate, then (s)he is more likely to vote for him/her than if the comments are negative or neutral. Notice that the mere positive-negative range is not enough, because positivity has degrees of intensity we have to consider. It is different to say “he is okay” than “omg he is god emperor”. The more excitement and associated feelings – which need to be carefully mapped and defined in the lexicon – a person exhibits, the more likely voting behavior is.
As I mentioned, even this approach risks shortcoming of representativeness. First, the population on Facebook may not correspond with the population at large. It may be that the user base is skewed by age or some other factor. The choice of platform greatly influences the sample; for example, SnapChat users are on average younger than Facebook users, whereas Twitter users are more liberal. It is not clear whether Facebook’s user base represents a skewed sample or not. Second, the people voicing their opinions may be part of “vocal minority” as opposed to “silent majority”. In that case, we apply the logic of Gaussian standard distribution and assumed that the general population is more lenient to middle ground than the extremes — if, in addition, we would assume the central tendency to be symmetrical (meaning people in the middle are equally likely to tip into either candidate in a dual race), the analysis of extremes can still yield a valid prediction.
Another limitation may be that advertising targeting is not equivalent to random sampling, but has some kind of bias. That bias could emerge e.g. from 1) ad algorithm favoring a particular sub-set of the target group, i.e. showing more ads to them, whereas we would like to get all types of respondents; or 2) self-selection in which the respondents are of similar kind and again not representative to the population. Out of my head, I’d say number two is less of a problem because those people who show enough interest are also the ones who vote – remember, essentially we don’t need to care about the opinions of the people who don’t vote (that’s how elections work!). But number one could be a serious issue, because ad algorithm directs impressions based on responses and might identify some hidden pattern we have no control over. Basically, the only thing we can do is examine superficial segment information on the ad reports, and evaluate if the ad rotation was sufficient or not.
As both approaches – traditional polling and social media analysis – have their shortcomings and advantages, it might be feasible to combine the data under a mixed model which would factor in 1) count of likes, 2) count of comments with high affinity (=positive sentiment), and 3) polled preference data. A deduplicating process would be needed to not count twice those who liked and commented – this requires associating likes and comments to individuals. Note that the hybrid approach requires geographic information as well, because otherwise segmentation is diluted. Anyhow, taking user as the central entity could be a step towards determining voting propensity:
user (location, count of likes, count of comments, comment sentiment) –> voting propensity
Another way to see this is that enriching likes with relevant information (in regards to the election system) can help model social media data in a more granular and meaningful way.
March 30, 2017
Had an interesting chat with Sami Kuusela from Underhood.co. Based on that, got some inspiration for an analysis framework which I’ll briefly describe here.
Figure 1 Identifying and analyzing topical text material
I’m not sure whether steps 4 and 5 would improve the system’s ability to identify topics. It might be that a more general model is not required because the system already can detect the key themes. Would be interesting to test this with a developer.
The whole point is to acknowledge that each large topic naturally divides into small sub-topics, which are dimensions that people perceive relevant for that particular topic. For example, in politics it could be things like “economy”, “domestic policy”, “immigration”, “foreign policy”, etc. While the dimensions can have some consistency based on the field, e.g. all political candidates share some dimensions, the exact mix is likely to be unique, e.g. dimensions of social media texts relating to Trump are likely to be considerably different from those of Clinton. That’s why the analysis ultimately needs to be done case-by-case.
In any case, it is important to note that instead of giving a general sentiment or engagement score of, say a political candidate, we can use an approach like this to give a more in-depth or segmented view of them. This leads to better understanding of “what works or not”, which is information that can be used in strategic decision-making. In addition, the topic-segmented sentiment data could be associated with predictors in a predictive model, e.g. by multiplying each topic sentiment with the weight of the respective topic (assuming the topic corresponds with the predictor).
This is just a conceptual model. As said, would be interesting to test it. There are many potential issues, such as handling with cluster overlap (some text pieces can naturally be placed into several clusters which can cause classification problems) and hierarchical issues (e.g., “employment” is under “economy” and should hence influence the former’s sentiment score).
March 30, 2017
People think, or seem to assume, that there is some magical machine that spits out accurate predictions of future events from social media data. There is not, and that’s why each credible analysis takes human time and effort. But therein also lies the challenge: when fast decisions are needed, time-taking analyses reduce agility. Real-time events would require real-time analysis, whereas data analysis is often cumbersome and time-taking effort, including data collection, cleaning, machine training, etc.
It’s a project for weeks or days, not for hours. All the practical issues of the analysis workflow make it difficult to provide accurate predictions at a fast pace (although there are other challenges as well).
An example is Underhood.co – they predicted Saara Aalto to win X-Factor UK based on social media sentiment, but ended up being wrong. While there are many potential reasons for this, my conclusion is that their indicators lack sufficient predictive power. They are too reliant on aggregates (in this case country-level data), and had a problematic approach to begin with – just like with any prediction, the odds change on the go as new information becomes available, so you should never predict the winner weeks ahead. Of course, theirs was just a publicity stunt where they hoped being right would prove the value of their service. Another example, of course, is the US election where prediction markets were completely wrong of the outcome. That was, according to my theory, because of wrong predictors – polls ask what is your preference or what you would do, whereas social media engagement shows what people do (in social media), and as such are closer to real behavior, hence better predictors.
Even if I do think human analysts are still needed in the near future, more solutions for quick collection and analysis of social media data are needed, especially to combine the human and machine work in the best possible way. Some of these approaches can be based on automation, but others can be methodological, such as quick definition of relevant social media outlets for sampling.
Here are some ideas I have been thinking of:
I. Data collection
Each part is case-dependent and idiosyncratic – for whatever event, I’m thinking competitions here, you have to this work from scratch. Ultimately, you cannot get the whole Internet as your data, but you want the sample to be as representative as possible. For example, it was obvious that Twitter users showed much more negative sentiment towards Trump than Facebook users, and in both platforms you had supporter groups/topic concentrations that should first be identified before any data collection. Then, the actual data collection is tricky. People again seem to assume all data is easily accessible. It’s not – while Twitter and Facebook have an API, Youtube and Reddit don’t, for example. This means the comments that you use for predicting the outcome (by analyzing their relative share of the total as well as the strength of the sentiment beyond pos/neg) need to be fetched either by webscraping or manually copying them to a spreadsheet. Due to large volumes of data, crowdsourcing could be useful — e.g., setting up a Google Sheet where crowdworkers each paste the text material in clean format. The raw text content, e.g. tweets, Facebook comments, Reddit comments, is put in separate sheets for each candidate.
II. Data analysis
In every statistical analysis, the starting point should be visualizing the data. This shows an aggregate “helicopter view” of the situation. Such a snapshot is useful also for demonstrating the results for the end-user, to let the data speak for itself. Candidates are bubbles in the chart, their sizes in respect to the number of calculated likely voters. The data could be broken down according to source platforms, or other factors, by using the candidate as a point of gravity for the cluster.
Likelihood to vote could be classified as a scale, not binary. That is, instead of saying “sentiment is positive: YES/NO”, we could say “How likely is the person to vote?” which is the same as asking how enthusiastic or engaged he or she is. Therefore, a scale is better, e.g. ranging from -5 (definitely not voting for this candidate) to +5 (definitely voting for this candidate). The manual training, which also could be done with the help of crowd, helps the machine classifier to improve its accuracy on the go. Based on training data, it would generalize classification to all material. Now, the material is bucketed so that each candidate is evaluated separately and the number of likely voters can be calculated. It is possible that the machine classifier could benefit from training input from both candidates, inasmuch the language showing positive and negative engagement is not significantly different.
It is important to note that negative sentiment does not really matter. What we are interested in is the number of likely voters. This is because of the election dynamics – it does not matter how poor a candidates aggregate sentiment is, i.e. the ratio between haters and sympathizers, as long as his or her number of likely voters is higher than that of the competition. This effect was evident in the recent US presidential election.
The crucial thing is keep the process alive during the whole election/competition period. There is no such point where it becomes certain that one loses and the other remains, although the divide can become substantial and therefore increase the accuracy of the prediction.
III. Presentation of results
The results could be shown as a form of dashboard to the end user. Search trend graph and the above mentioned cluster visualization could be viable alternatives. In addition, it would be interesting to see the count of voters evolving in time – in such a way that it, along with the visualization, could be “played back” to examine the development in time. In other words, interactive visualization. As noted, the prediction, or the count of likely votes, should update real-time as a result of combined human-machine work.
The idea behind development of more agile methods to use social media data to predict content outcomes is that the accuracy of the prediction is based on the choice of indicators rather than the finesse of the method. For example, complex Bayesian models falsely predicted Hillary Clinton would win the election. It’s not that the models were poorly built; they just used the wrong indicators, namely polling data. This is the usual case of ‘garbage in, garbage out’, and it shows that the choice of indicators is more important than technical features of the predictive model.
The choice of indicators should be done based on their predictive power and although I don’t have strict evidence on it, it intuitively makes sense that social media engagement is a stronger indicator in many instances than survey data, because it’s based on actual preferences instead of stated preferences. Social scientists know from long tradition of survey research that there are myriad of social effects reducing the reliability of data (e.g., social desirability bias). Those, I would argue, are much smaller issue in social media engagement data.
However, to be fair, there can be issues of bias in the social media engagement data. The major concern is low participation rate: a common heuristic is that 1/10 of participants actually contribute in writing, while the other 9/10 are readers whose real thoughts remain unknown. It’s then a question of how well does the vocal minority reflect the opinion of the silent majority. Or, in some cases, this is irrelevant for competitions if the overall voting share remains low. For example, if it’s 60% it is relative much more important to mobilize the active base than if voting was close to 100% where one would need a universal acceptance.
Another issue is the non-representative sampling. This is a concern when the voting takes place offline, and online data does not accurately reflect the voting of those who do not express themselves online. However, as social media participation is constantly increasing, this becomes less of a problem. In addition, compared to other methods of data collection – apart from stratified polling, perhaps – social media is likely to give a good result on competitive predictions because of their political nature. People who strongly support a candidate are more likely to be vocal about it, and the channel for voicing their opinion is the social media.
It is evident that the value of social media engagement as a predictor is currently underestimated, as proven by the large emphasis put on political polls and virtually zero discussion on social media data. As a direct consequence of this, those who are able to leverage the social media data in the proper way will gain competitive advantage, be it betting market, or any other purpose where prediction accuracy plays a key role. The prediction work will remain a hybrid effort by man and machine.