The black sheep problem in machine learning


Just a picture of a black sheep.

Introduction. Hal Daumé III wrote an interesting blog post about language bias and the black sheep problem. In the post, he defines the problem as follows:

The “black sheep problem” is that if you were to try to guess what color most sheep were by looking and language data, it would be very difficult for you to conclude that they weren’t almost all black. In English, “black sheep” outnumbers “white sheep” about 25:1 (many “black sheep”s are movie references); in French it’s 3:1; in German it’s 12:1. Some languages get it right; in Korean it’s 1:1.5 in favor of white sheep. This happens with other pairs, too; for example “white cloud” versus “red cloud.” In English, red cloud wins 1.1:1 (there’s a famous Sioux named “Red Cloud”); in Korean, white cloud wins 1.2:1, but four-leaf clover wins 2:1 over three-leaf clover.

Thereafter, Hal accurately points out:

“co-occurance frequencies of words definitely do not reflect co-occurance frequencies of things in the real world”

But the mistake made by Hal is to assume language describes objective reality (“the real world”). Instead, I would argue that it describes social reality (“the social world”).

Black sheep in social reality. The higher occurence of ‘black sheep’ tells us that in social reality, there is a concept called ‘black sheep’ which is more common than the concept of white (or any color) sheep. People are using that concept, not to describe sheep, but as an abstract concept in fact describing other people (“she is the black sheep of the family”). Then, we can ask: Why is that? In what contexts is the concept used? And try to teach the machine its proper use through associations of that concept to other contexts (much like we teach kids when saying something is appropriate and when not). As a result, the machine may create a semantic web of abstract concepts which, if not leading to it understanding them, at least helps in guiding its usage of them.

We, the human. That’s assuming we want it to get closer to the meaning of the word in social reality. But we don’t necessarily want to focus on that, at least as a short-term goal. In the short-term, it might be more purposeful to understand that language is a reflection of social reality. This means we, the humans, can understand human societies better through its analysis. Rather than trying to teach machines to imputate data to avoid what we label an undesired state of social reality, we should use the outputs provided by the machine to understand where and why those biases take place. And then we should focus on fixing them. Most likely, technology plays only a minor role in that.

Conclusion. The “correction of biases” is equivalent to burying your head in the sand: even if they magically disappeared from our models, they would still remain in the social reality, and through the connection of social reality and objective reality, echo in the everyday lives of people.


How to teach machines common sense? Solutions for ambiguity problem in artificial intelligence



The ambiguity problem illustrated:

User: “Siri, call me an ambulance!”

Siri: “Okay, I will call you ‘an ambulance’.”

You’ll never reach the hospital, and end up bleeding to death.


Two potential solutions:

A. machine builds general knowledge (“common sense”)

B. machine identifies ambiguity & asks for clarification from humans

The whole “common sense” problem can be solved by introducing human feedback into the system. We really need to tell the machine what is what, just like a child. It is iterative learning, in which trials and errors take place.

But, in fact, A. and B. converge by doing so. Which is fine, and ultimately needed.

Contextual awareness

To determine which solution to an ambiguous situation is proper, the machine needs contextual awareness; this can be achieved by storing contextual information from each ambiguous situation, and being explained “why” a particular piece of information results in disambiguity. It’s not enough to say “you’re wrong”, but there needs to be an explicit association to a reason (concept, variable). Equally, it’s not enough to say “you’re right”, but again the same association is needed.

The process:

1) try something

2) get told it’s not right, and why (linking to contextual information)

3) try something else, corresponding to why

4) get rewarded, if it’s right.

The problem is, currently machines are being trained by data, not by human feedback.

New thinking on teaching the machine

So we would need to build machine-training systems which enable training by direct human feedback, i.e. a new way to teach and communicate with the machine. It’s not a trivial thing, since the whole machine-learning paradigm is based on data. From data and probabilities, we would need to move into associations and concepts. A new methodology is needed. Potentially, individuals could train their own AIs like pets (think Tamagotchi), or we could use large numbers of crowd workers who would explain the machine why things are how they are (i.e., create associations). A specific type of markup (=communication) would probably also be needed.

Through mimicking human learning we can teach the machine common sense. This is probably the only way; since common sense does not exist beyond human cognition, it can only be learnt from humans. An argument can be made that this is like going back in time, to era where machines followed rule-based programming (as opposed to being data-driven). However, I would argue rule-based learning is much closer to human learning than the current probability-based one, and if we want to teach common sense, we therefore need to adopt the human way.

Conclusion: machines need education

Machine learning may be at par, but machine training certainly is not. The current machine learning paradigm is data-driven, whereas we could look into ways for concept-driven training approaches.


Rule-based AdWords bidding: Hazardous loops


1. Introduction

In rule-based bidding, you want to sometimes have step-backs where you first adjust your bid based on a given condition, and then adjust it back after the condition has passed.

An example. An use case would be to decrease bids for weekend, and increase back to normal level for weekdays.

However, defining the step-back rate is not done how most people would think. I’ll tell you how.

2. Step-back bidding

For step-back bidding you need two rules: one to change the bid (increase/decrease) and another one to do the opposite (decrease/increase). The values applied by these rules must cancel one another.

So, if your first rule raises the bid from $1 to $2, you want the second rule to drop it back to $1.

Call these

x = raise by percentage

y = lower by percentage

Where most people get confused is by assuming x=y, so that you use the same value for both the rules.

Example 1:

x = raise by 15%

y = lower by 15%

That should get us back to our original bid, right? Wrong.

If you do the math (1*1.15*0.85), you get 0.997, whereas you want 1 (to get back to the baseline).

The more you iterate with the wrong step-back value, the farther from the baseline you end. To illustrate, see the following simulation, where the loop is applied weekly for three months (12 weeks * 2 = 24 data points).

Figure 1 Bidding loop

As you can see, the wrong method will take you more and more off from the correct pattern as the time goes by. For a weekly rule the difference might be manageable, especially if the rule’s incremental change is small, but imagine if you are running the rule daily or each time you bid (intra-day).

3. Solution

So, how to get to 1?

It’s very simple, really. Consider

  • B = baseline value (your original bid)
  • x = the value of the first rule (e.g., raise bid by 15% –> 0.15)
  • y = the value of the second rule (dependant on the 1st rule)

You want to solve y from

B(1+x) * y = 1

That is,

y = 1 / B(1+x)

For the value in Example 1,

y = 1 / 1*(1+0.15)

multiplying that by the increased value results in 1, so that

1.15 * (1/1*(1+0.15) = 1


Remember to consider elementary mathematics, when applying AdWords bidding rules!


Hakukoneoptimointi toimittajan näkökulmasta

Hakukoneoptimointi toimittajan näkökulmasta


Media on riippuvainen mainostuloista. On jatkuva kiistelyn aihe, miten paljon toimittajien tulisi kirjoittaa juttuja, jotka saavat klikkejä ja näyttöjä suhteessa juttuihin, joiden yhteiskunnallinen merkitys on korkea. Nämä kaksi kun eivät aina kulje käsi kädessä.

Sosiaalisen median ja hakukoneiden merkitys toimittajan työssä

Käytännössä toimittajat joutuvat työnsä puolesta huomioimaan juttujen kiinnostavuuden sosiaalisessa mediassa. Tämä on tärkeää vaikka haluaisi kirjoittaa vain yhteiskunnallisesti tärkeistä aiheista, koska huomion saaminen kilpailevan sisällön keskellä on ainut tapa saada viestinsä läpi. Sosiaalisen median osalta on siis huomioitava sellaisia seikkoja kuin 1) vetävän otsikon muotoilu, 2) vetävän esikatselukuvan valinta, ja 3) object graph -metatietojen muokkaaminen (vaikuttavat siihen miltä linkki näyttää sosiaalisessa mediassa).

Sosiaalisen median lisäksi toimittajan on huomioitava hakukoneoptimointi, sillä somen ohella hakukoneet ovat tyypillisesti merkittävä liikenteen lähde. Mitä paremmin artikkelit on optimoitu, sitä todennäköisemmin ne sijoittuvat tärkeillä avainsanoilla korkealle Googlen tuloksissa.

Mitä toimittajan on tiedettävä hakukoneoptimoinnista?

Juttuja kirjoittaessaan toimittajan on huomioitava seuraavat seikat hakukoneiden kannalta:

  1. Avainsanat – kaikessa pitää lähteä siitä, että tunnistetaan oikeat avainsanat, joilla artikkelin halutaan löytyvän. Tässä kannattaa hyödyntää avainsanatutkimuksen työkaluja, kuten Googlen avainsanatyökalua (Keyword Planner).
  2. Pääotsikko ja väliotsikot – valittujen avainsanojen tulee näkyä jutun otsikossa ja väliotsikoissa. Väliotsikot (h2) ovat tärkeitä, sillä ne luovat hakukoneelle ymmärrettävissä olevan rakenteen, sekä tukevat käyttäjien luontaista, skannaukseen pohjautuvaa verkkolukemista.
  3. Linkit – jutussa tulee olla linkkejä muihin lähteisiin oikeanlaisilla ankkuriteksteillä merkittynä. Ei “lisää tietoa lisäravinteista saat klikkaamalla tänne“, vaan “esimerkiksi Helsingin Sanomat on kirjoittanut useita juttuja lisäravinteista“.
  4. Teksti – kappaleiden tulee olla lyhyitä, sisältää selkeästi luettavissa olevaa kieltä ja optimoitavia avainsanoja sopiva määrä. Sopivan määrän mitta on se, että avainsanoja on luonnolliselta tuntuva määrä – liikaa toistoa ei saa olla, koska Google voi tulkita sen manipulointiyritykseksi.

Ennen kaikkea kirjoitetun artikkelin tulee olla sekä käyttäjälle miellyttävä lukea, että hakukoneelle helposti ymmärrettävä. Nämä kaksi seikkaa yhdistämällä hakukoneoptimoinnin perusteet ovat kunnossa.


Affinity analysis in political social media marketing – the missing link


Introduction. Hm… I’ve figured out how to execute successful political marketing campaign on social media [1], but one link is missing still. Namely, applying affinity analysis (cf. market basket analysis).

Discounting conversions. Now, you are supposed to measure “conversions” by some proxy – e.g., time spent on site, number of pages visited, email subscription. Determining which measurable action is the best proxy for likelihood of voting is a crucial sub-problem, which you can approach with several tactics. For example, you can use the closest action to final conversion (vote), i.e. micro-conversion. This requires you have an understanding of the sequence of actions leading to final conversion. You could also use a relative cut-off point; e.g. the nth percentile with the highest degree of engagement is considered as converted.

Anyhow, this is very important because once you have secured a vote, you don’t want to waste your marketing budget by showing ads to people who already have decided to vote for your candidate. Otherwise, you risk “preaching to the choir”. Instead, you want to convert as many uncertain voters to voters as possible, by using different persuasion tactics.

Affinity analysis. The affinity analysis can be used to accomplish this. In ecommerce, you would use it as a basis for recommendation engine for cross-selling or up-selling (“customers who bought this item also bought…” à la Amazon). First you detemine which sets of products are most popular, and then show those combinations to buyers interested in any item belonging to that set.

In political marketing, affinity analysis means that because a voter is interested in topic A, he’s also interested in topic B. Therefore, we will show him information on topic B, given our extant knowledge his interests, in order to increase likelihood of conversion. This is a form of associative

Operationalization. But operationalizing this is where I’m still in doubt. One solution could be building an association matrix based on website behavior, and then form corresponding retargeting audiences (e.g., website custom audiences on Facebook). The following picture illustrates the idea.

Figure 1 Example of affinity analysis (1=Visited page, 0=Did not visit page)

For example, we can see that themes C&D and A&F commonly occur together, i.e. people visit those sub-pages in the campaign site. You can validate this by calculating correlations between all pairs. When you set your data in binary format (0/1), you can use Pearson correlation for the calculations.

Facebook targeting. Knowing this information, we can build target audiences on Facebook, e.g. “Visited /Theme_A; NOT /Theme_F; NOT /confirmation”, where confirmation indicates conversion. Then, we would show ads on Theme F to that particular audience. In practice, we could facilitate the process by first identifying the most popular themes, and then finding the associated themes. Once the user has been exposed to a given theme, and did not convert, he needs to be exposed to another theme (with the highest association score). The process is continued until themes run out, or the user converts, which ever comes first. Applying the earlier logic of determining proxy for conversion, visiting all theme sub-pages can also be used as a measure for conversion.

Finally, it is possible to use more advanced methods of associative learning. That is, we could determine that {Theme A, Theme F} => {Theme C}, so that themes A and B predict interest in theme C. However, it is more appropriate to predict conversion rather than interest in other themes, because ultimately we’re interested in persuading more voters.


[1] Posts in Finnish:


Total remarketing – the concept


Here’s a definition:

Total remarketing is remarketing in all possible channels with all possible list combinations.


  • Programmatic display networks (e.g., Adroll)
  • Google (GDN, RLSA)
  • Facebook (Website Custom Audience)
  • Facebook (Video viewers / Engaged with ads)
  • etc.

How to apply:

  1. Test 2-3 different value propositions per group
  2. Prefer up-selling and cross-selling over discounts (the goal is to increase AOV, not reduce it; e.g. you can include an $20 gift voucher when basket size exceeds $100)
  3. Configure well; exclude those who bought; use information you have to improve remarketing focus (e.g. time of site, products or categories visited — the same remarketing for all groups is like the same marketing for all groups)
  4. Consider automation options (dynamic retargeting; behavior based campaign suggestions for the target)

Koneoppimisen jämähtämisongelma


Konekin voi joskus jäätyä.

Kone oppii kuten ihminen: empiirisen havaintoaineiston (= datan) perusteella.

Tästä syystä samoin kuin ihmisen on hankala oppia pois huonoista tavoista ja asenteista (ennakkoluulot, stereotypiat), on koneen vaikea oppia nopeasti pois virheellisestä tulkinnasta.

Kysymys ei ole poisoppimisesta, mikä lienee monessa tapauksessa mahdotonta, vaan uuden oppimisesta, niin että vanhat muistirakenteet (= mallin ominaisuudet) korvataan tehokkaasti uusilla. Tehokkaasti, koska mitä kauemmin vanhat epäpätevät mallit ovat käytössä, sitä enemmän koneellinen päätöksenteko ehtii tehdä vahinkoa. Ongelma korostuu laajan mittakaavan päätöksentekojärjestelmässä, jossa koneen vastuulla voi olla tuhansia tai jopa miljoonia päätöksiä lyhyen ajan sisällä.

Esimerkki: Kone on oppinut diagnosoimaan sairauden X oireiden {x} perusteella. Tuleekin uutta tutkimustietoa, jonka mukaan sairaus X liitetään oireisiin {y}, jotka ovat lähellä oireita {x} mutta eivät identtisiä. Koneelta kestää kauan oppia uusi assosiaatio, jos sen pitää tunnistaa itse eri oireiden yhteydet sairauksiin samalla unohtaen vanhoja malleja.

Miten tätä prosessia voidaan nopeuttaa? Ts. säilyttää koneoppimisen edut (= löytää oikeat ominaisuudet, esim. oireyhdistelmät) ihmistä nopeammin, mutta ihminen voi kuitenkin ohjatusti korjata koneen oppimaa mallia paremman tiedon varassa.

Teknisesti ongelman voisi mieltää ns. bandit-algoritmin kautta: Jos algoritmi toteuttaa sekä eksploraatiota että eksploitaatiota, voisi ongelmaa pyrkiä ratkomaan rajoittamalla hakuavaruutta. Koneelle voisi myös syöttää tarpeeksi evidenssiä, jotta se oppisi suhteen nopeasti – ts. jos kone ei ole löytänyt samaa asiaa kuin tietty tieteellinen tutkimus, tämän tieteellisen tutkimuksen dataa voisi käyttää kouluttamaan konetta niin paljon (jopa ylipainottamalla, jos se suhteessa hukkuu muuhun dataan) että luokittelumalli korjautuu.


In 2016, Facebook bypassed Google in ads. Here’s why.

In 2016, Facebook bypassed Google in ads. Here’s why.


The gone 2016 was the first year I thought Facebook ends up beating Google in the ad race, despite the fact Google still dominates in revenue ($67Bn vs. $17Bn in 2015). I’ll explain why.

First, consider that Google’s growth is restricted by three things:

  1. natural demand
  2. keyword volumes, and
  3. approach of perfect market.

More demand than supply

First, at any given time there is a limited number of people interested in a product/service. The interest can be of purchase intent or just general interest, but either way it translates into searches. Each search is an impression that Google can sell to advertisers through its AdWords bidding. The major problem is this: even when I’d like to spend more money on AdWords, I cannot. There is simply not enough search volume to satisfy my budget (in many cases there is, but in highly targeted and profitable campaigns many times there isn’t). So, the excess budget I will spend elsewhere where the profitable ad inventory is not limited (that is, Facebook at the moment).

Limited growth

According to estimates, search volume is growing by 10-15% annually [1]. Yet, Google’s revenue is expected to grow even by 26% [2]. Over the year, Google’s growth rate in terms of search volume has substantially decreased, although this is perceived as a natural phenomenon (after trillion searches it’s hard to keep growing double digits). In any case, the aforementioned dynamics reflect to search volumes – when the volumes don’t grow much and new advertisers keep entering the ad auction, there is more competition over the same searches. In other words, supply stays stable but demand increases, resulting in more intense bid wars.

Approaching perfect market

For a long time now, I’ve added +15% increase in internal budgeting for AdWords, and last year that was hard to maintain. Google is still a profitable channel, but the advertisers’ surplus is decreasing year by year, incentivizing them to look for alternative channels. While Google is restrained by its natural search volumes, Facebook’s ad inventory (=impressions) are practically limitless. The closer AdWords gets to a perfect market (=no economic rents), the less attractive it is for savvy marketers. Facebook is less exploited, and allows rents.

What will Google do?

Finally, I don’t like the Alphabet business. Already in the beginning it signals to investors that Google is in “whatever comes to mind” business instead of strategic focus on search. Most likely Alphabet ends up draining resources from the mother company, producing loss and taking human capital off from succeeding in online ads business (which is where their money comes from). In contrast, Facebook is very focused on social; it buys off competitors and improves fast. That said, I do have to recognize that Google’s advertising system is still much better than that of Facebook, and in fact still the best in the world. But momentum seems to be shifting to Facebook’s side.


The maximum number of impressions (=ad inventory) of Facebook is much higher than that of Google, because Google is limited by natural demand and Facebook is not. In the marketplace, there is always more supply than demand which is why advertisers want to spend more than what Google enables. These factors combined with Facebook’s continously increasing ability to match interested people with the right type of ads, makes Facebook’s revenue potential much bigger than Google’s.

From advertiser’s perspective, Facebook and Google both are and are not competitors. They are competitors for ad revenue, but they are not competitors in the online channel mix. Because Google is for demand capture and Facebook for demand creation, most marketers want to include both in their channel mix. This means Google’s share of online ad revenue might decrease, but a rational online advertisers will not drop its use so it will remain as a (less important) channel into foreseeable future.





Buying and selling complement bundles: When individual selling maximizes profit



When we were young, me and my brother used to buy and sell game consoles on (local eBay) and on various gamer discussion forums (Konsolifin BBS, for example). We didn’t have much money, so this was a great way to earn some cash — plus it taught us some useful business lessons along the years.

What we would often do was to buy a bundle (console+games), break it apart and sell the pieces individually. At that time we didn’t know anything about economics, but intuitively it felt the right thing to do. Indeed, we would always make money with that strategy, as we knew the market prices (or their range) of each individual item.

Looking back, I can now try and explain with economic terms why this was a successful strategy. In other words, why individual selling of items in a complement bundle is a winning strategy.

Why does individual selling provide a better profit than the selling of a bundle?

Let’s first define the concepts.

  • individual selling = buy complement bundle, break it apart and sell individual pieces
  • a complement bundle = a central unit and its complements (e.g., a game console and games)

Briefly, it is so because the tastes of the market are randomly distributed and do not align with the exact contents of the bundle. It then follows that the exact set of complements does not maximize any individual’s utility, so they will bid accordingly (e.g., “I like those two games (out of five), but not the three so I don’t put much value to them”) and the market price of the bundle will set below the full value of its individual parts.

In contrast, by breaking apart and selling individually each complement can be appraised at full value (“I like that game, so I’ll pay its real value”). In other words, the seller will need to find a buyer for each piece who appreciates that piece to its full value (=has a preference for it).

The intuition

Tastes and preferences differ, which reflects to individuals’ utility functions and therefore willingness to pay. Selling a bundle is a compromise from the perspective of the seller – he compromises his full price, because the buyer is willing to pay only according to his preferences (utility function) which do not match completely with the contents of the bundle.


There are two exceptions I can think of:

1) Highly valued complements (or homogeneous tastes)

Say all the complements are of high value in the market (e.g., popular hit games). Then, a large portion of the market assigns full value to them, and the bundle sets close or equal to the sum of individual full prices. Similarly, if all the buyers value the complements in a similar way, i.e. their taste is homogeneous, the randomness required for the individual selling to perform does not exist.

2) Information asymmetry

Sometimes, you can get a higher price by selling a bundle than by selling the individual pieces. We would use this strategy when the value of complements is very little to an “expert”. Then, if you were less experienced you could see a game console + 5 games the 5 games, however, had very little value in the market and it would therefore make sense to include them in the bundle and to attract less-informed buyers. In other words, benefiting from information asymmetries.

Finally, the buyer of a complement bundle needs to be aware of the market price (or the range of it) of each item. Otherwise, he might end up paying more than the value of the sum of individual items.


Finding bundles and selling the pieces individually is a great way for young people to practice business. Luckily, there are always sellers in the market who are not looking to optimize their asking price, but appreciate the speed and comfort associated with selling bundles (i.e., dealing with one buyer). The actors with more time and less sensitivity to comfort can then take advantage of that condition to make some degree of profit.

EDIT: My friend Zeeshan pointed out that a business may actually prefer bundling even when the price is lower than in individual selling, if they assign a transaction cost (search, bargaining) to individual selling and the sum of transaction costs of selling individual items is higher than the sum of differences between the full price and bundle price of complements. (Sounds complicated but means that you’d spend too much time selling each item in comparison to profit.) For us as kids this didn’t matter since we had plenty of time, but for businesses the cost of selling does matter.


Polling social media users to predict election outcomes


The 45th President of the USA


The problem of predicting election outcomes with social media is that the data, such as likes, are aggregate, whereas the election system is not — apart from simple majority voting, in which you only have the classic representativeness problem that Gallup solved in 1936. To solve the aggregation problem, one needs to segment the polling data so that it 1) corresponds to the prevailing election system and 2) accurately reflects the voters according to that system. For example, in the US presidential election each state has a certain number of electoral votes. To win, a candidate needs to reach 270 electoral votes.

Disaggregating the data

One obvious solution would be track like sources to profiles and determine the state based on publically given information by the user. This way we could filter out foreign likers as well. However, there are some issues of using likes as indicators of votes. Most importantly, “liking” something on social media does not in itself predict future behavior of an individual to a sufficient degree.

Therefore, I suggest here a simple polling method via social media advertising (Facebook Ads) and online surveys (Survey Monkey). Polling is partly facing the same aforementioned problem of future behavior than using likes as the overarching indicator which is why in the latter part of this article I discuss how these approaches could be combined.

At this point, it is important to acknowledge that online polling does have significant advantages relating to 1) anonymity, 2) cost, and 3) speed. That is, people may feel more at ease expressing their true sentiment to a machine than another human being. Second, the method has the potential to collect a sizeable sample in a more cost-effective fashion than calling. Finally, a major advantage is that due to the scalable nature of online data collection, the predictions can be updated faster than via call-based polling. This is particularly important because election cycles can involve quick and hectic turns. If the polling delay is from a few days to a week, it is too late to react to final week events of a campaign which may still carry a great weight in the outcome. In other words: the fresher the data, the better. (An added bonus is that by doing several samples, we could consider momentum i.e. growth speed of a candidate’s popularity into our model – albeit this can be achieved with traditional polling as well.)

Social media polling (SMP)

The method, social media polling or SMP, is described in the following picture.

Figure 1 Social media polling

The process:

1. Define segmentation criteria

First, we understand the election system. For example, in the US system every state has a certain weight expressed by its share of total electoral votes. There are 50 states, so these become our segmentation criteria. In case we deem appropriate to do further segmentation (e.g., gender, age), we can do so by creating additional segments which are reflected in target groups and surveys. (These sub-segments can also be analyzed in the actual data later on.)

2. Create unique surveys

Then, we create a unique survey for each segment so that the answers will be bucketed. The questions of the survey are identical – they are just behind different links to enable easy segmentation. We create a survey rather than use a visible poll (app) or picture-type of poll (“like if you vote Trump, heart if you vote Hillary”), because we want to avoid social desirability bias. A click on Facebook will lead the user to the unique survey of their segment, and their answers won’t be visible to the public.

3. Determine sample size

Calculating sample size is one of those things that will make your head spin, because there’s no easy answer as to what is a good sample size. Instead, “it depends.” However, we can use some heuristical rules to come up with decent alternatives in the context of elections. Consider two potential sample sizes.

  • Sample size: 500
  • Confidence level: 95%
  • Margin of error: +/- 4.4%
  • Sample size: 1,000
  • Confidence level: 95%
  • Margin of error: +/- 3%

These are seen as decent options among election pollsters. However, the margin of error is still quite sizeable in both of them. For example, if there are two candidates and their “true” support values are A=49%, B=51%, the large margin of error makes us easily go wrong. We could solve this by increasing the sample size, but the problem is that if we would like to reduce the margin of error from +/- 3% to say 1%, our required sample size grows dramatically (more precisely, with a 95% confidence and population size of 1M, it’s 9512 – unpractically high for a 50-state model). In other words, we have to accept the risk of wrong predictions in this type of situation.

All states have over 1,000,000 million people so each of them are considered as “large” populations (this is a mathematical thing – required sample size stabilizes after reaching a certain population size). Although US is characterized as one population, in the context of election prediction it’s actually several different populations (because we have independent states that vote). The procedure we apply is stratified random sampling in which the large general population is split into sub-groups. In practice, each sub-group requires its own sample, and therefore our approach requires a considerably larger sample size than a prediction that would only consider the whole population of the country. But, exactly because of this it should be more accurate.

So, with this lengthy explanation let us say we satisfice with a sample size of 500 per state. That would be 500×50=25,000 respondents. If it would cost 0.60$ to get a respondent via running Facebook ads, the cost for data collection would be 15,000$. For repetitive purposes, there are a few strategies. First, the sample size can be reduced for states that show a large difference between the candidates. In other words, we don’t need to collect a large number of respondents if we “know” the popularity difference between candidates is high. The important thing is that the situation is measured periodically, and sample sizes are flexibly adjusted according to known results. In a similar vein, we can increase the sample size for states where the competition is tight, to reduce the margin of error and therefore to increase the accuracy of our prediction. To my understanding, the opportunity of flexible sampling is not efficiently used by all pollsters.

4. Create Facebook campaigns

For each segment, a target group is created in Facebook Ads. The target group is used to advertise to that particular group; for example, the Michigan survey link is only shown to people from Michigan. That way, we minimize the risk of people outside the segment responding (however, they can excluded later on by IP). At this stage, creating attractive ads help keeping the cost per response low.

5. Run until sample size is reached

The administrator observes the results and stops the data collection once a segment has reached the desired sample size. When all segments are ready, the data collection is stopped.

6. Verify data

Based on IP, we can filter out respondent who do not belong to our particular geographical segment (=state).

Ad clicks can be used to determine sample representativeness by other factors – in other words, we can use Facebook’s campaign reports to segment age and gender information. If a particular group is under-represented, we can correct by altering the targeting towards them and resume data collection. However, we can also accept the under-representation if we have no valid reference model as to voting behavior of the sub-segments. For example, millennials might be under-represented in our data, but this might correspond with their general voting behavior as well – if we assume survey response rate corresponds with voting rate of the segments, then there is no problem.

7. Analyze results

The analysis process is straight-forward:

segment-level results x weights = prediction outcome

For example, in the US presidential election, segment-level results would be each state (who polls highest in the state is the winner there) which would be multiplied by the share of electoral votes of each state. The candidate who gets at least 270 votes is the predicted winner.

Other methods

Now, as for other methods, we can use behavioral data. I have previously argued behavioral data is a stronger indicator of future actions since it’s free from reporting bias. In other words, people say they do, but won’t end up doing. This is a very common problem, but in research and daily life.

To correct for that, we consider two approaches here:

1) The volume of likes method, which parallels a like to a vote (the more likes a candidate has in relation to another candidate, the more likely they are to win)

For this method to work, the “intensity of like”, i.e. its correlation to behavior should be determined, as not all likes are indicators of voting before. Likes don’t readily translate into votes, and there does not appear to be other information we can use to further examine their correlation (like is a like). We could, however, add contextual information of the person, or use rules such as “the more likes a person gives, the more likely (s)he is to vote for a candidate.”

Or, we could use another solution which I think is better:

2) Text analysis/mining

By analyzing social media comments of a person, we can better infer the intensity of their attitude towards a given topic (in this case, a candidate). If a person is using strongly positive vocabulary while referring to a candidate, then (s)he is more likely to vote for him/her than if the comments are negative or neutral. Notice that the mere positive-negative range is not enough, because positivity has degrees of intensity we have to consider. It is different to say “he is okay” than “omg he is god emperor”. The more excitement and associated feelings – which need to be carefully mapped and defined in the lexicon – a person exhibits, the more likely voting behavior is.


As I mentioned, even this approach risks shortcoming of representativeness. First, the population on Facebook may not correspond with the population at large. It may be that the user base is skewed by age or some other factor. The choice of platform greatly influences the sample; for example, SnapChat users are on average younger than Facebook users, whereas Twitter users are more liberal. It is not clear whether Facebook’s user base represents a skewed sample or not. Second, the people voicing their opinions may be part of “vocal minority” as opposed to “silent majority”. In that case, we apply the logic of Gaussian standard distribution and assumed that the general population is more lenient to middle ground than the extremes — if, in addition, we would assume the central tendency to be symmetrical (meaning people in the middle are equally likely to tip into either candidate in a dual race), the analysis of extremes can still yield a valid prediction.

Another limitation may be that advertising targeting is not equivalent to random sampling, but has some kind of bias. That bias could emerge e.g. from 1) ad algorithm favoring a particular sub-set of the target group, i.e. showing more ads to them, whereas we would like to get all types of respondents; or 2) self-selection in which the respondents are of similar kind and again not representative to the population. Out of my head, I’d say number two is less of a problem because those people who show enough interest are also the ones who vote – remember, essentially we don’t need to care about the opinions of the people who don’t vote (that’s how elections work!). But number one could be a serious issue, because ad algorithm directs impressions based on responses and might identify some hidden pattern we have no control over. Basically, the only thing we can do is examine superficial segment information on the ad reports, and evaluate if the ad rotation was sufficient or not.

Combining different approaches

As both approaches – traditional polling and social media analysis – have their shortcomings and advantages, it might be feasible to combine the data under a mixed model which would factor in 1) count of likes, 2) count of comments with high affinity (=positive sentiment), and 3) polled preference data. A deduplicating process would be needed to not count twice those who liked and commented – this requires associating likes and comments to individuals. Note that the hybrid approach requires geographic information as well, because otherwise segmentation is diluted. Anyhow, taking user as the central entity could be a step towards determining voting propensity:

user (location, count of likes, count of comments, comment sentiment) –> voting propensity

Another way to see this is that enriching likes with relevant information (in regards to the election system) can help model social media data in a more granular and meaningful way.