We are not writing here anymore, check the tldr.io blog out!

Behind Opinion Polls: Fundamentals and Reliability

|

This article is quite technical and requires a mathematical background for a comprehensive understanding. All technical points have a reference to the corresponding Wikipedia’s article for further explanations.

We are in an election time in France. Everyday news is rythmed by polls. Nicolas Sarkozy is falling from 2% in today’s poll from IPSOS, Francois Hollande slipped below 33% after the crisis etc… Everybody seems to give a lot of credit to these polls forgetting most of the time that they are just statistical estimations of the reality. But how reliable are they? The results for the first round of the french presidential election seemed to surprise so many people and medias that it pushed me to dig into my maths lessons from college to look for some scientific explanation if any.

Polling

Organizing the Opinion Poll

Opinion polls tend to estimate the percentage of people who will vote for a given candidate. Polling Institutes want to produce accurate polls but they can’t poll a too big sample as it would be time consuming and expensive.

They choose a representative sample of the population and ask them for whom they plan to vote. The responses are assumed independant.

Once collected, how do we use them build an estimate of the votes and what is the precision of this estimate?

The natural and logical estimate is to take the sample mean as an estimator. How do we justify such an estimate and what is its precision? We’ll see that what sounds like common sense has a strong mathematical justification.

Building the Estimate

We are in a situation where we have n people

\[ X_1, X_2,…X_n\]

who can choose between k candidates. We define the variable

\[ \delta_{i,j} \]

which is equal to 1 if person i votes for candidate j and 0 if not.

The real vote intentions we want to estimate are noted :

\[ \pi_1, \pi_2,…\pi_k\]

with the constraint:

\[ \sum_j^k \pi_j = 1 \]

The probability to observe the result of the polling is:

\[ p_{\pi}(X_1, X_2,…,X_n) = \prod_i^n p_{\pi} (X_i)\]

as the n people are chosen independantly.

\[ \prod_i^n p_{\pi} (X_i) = \prod_j^k \prod_i^n \pi_j^{\delta_{i,j}} = \prod_j^k \pi_j^{n_j}\]

where

\[ n_j = \sum_i^n \delta_{i,j}\]

The strategy we adopt is, given the n people, to maximize this propability of observing the output. Maximizing this propability is the same as minimizing the -log of the probability as -log is a convex function. With this transformation, the computation of the extremum is much easier as the product becomes a sum.

\[ max[p_{\pi}(X_1, X_2,…,X_n)] = min[(-log(p_{\pi}(X_1, X_2,…,X_n))] = min[ \sum_j^k n_j log(\pi_j) ]\]

To find the minimum with the above constraint we can use the Lagrangian multiplier and just minimize:

\[ \sum_j^k n_j log(\pi_j) + \lambda(\sum_j^k \pi_j -1)\]

which gives us the estimates which minimize the above function

\[ \hat{\pi_1}, \hat{\pi_2},…\hat{\pi_k}\]
\[ \hat{\pi_j} = \frac{n_j}{n} \]

We just justified the use of the empirical mean as an estimate of the true mean given a set of observations using maximum a posteriori estimation.

Reliability

Given the estimator, which level of trust can we give it? How close is are the estimated and true values. The law of large numbers tells us that when the size of the sample goes to infinity the estimated value tends to the true value. But in reality there is not an infinity of people who get polled. Usually just a few thousands are. From one sample to the other the estimator won’t be the same. So how confident are the polling institutes in their estimations? The reliability of the estimator depends on the statistical fluctuations of the estimate around the true value.

The central limit theorem tells us that the random variables (m is the true mean and sigma the variance):

\[ \frac{\hat{\pi_j} -nm}{\sigma \sqrt{n}} \]

converge in distribution to a normal random variable with mean 0 and variance 1. From here we have the approximation:

\[ P(|\hat{\pi_j} - \pi_j| \leq \epsilon) = P(\frac{\sqrt{n}|\hat{\pi_j} - \pi_j|}{\sqrt{p(1-p)}} \leq \frac{\sqrt{n}\epsilon}{\sqrt{p(1-p)}}) \approx \int_{-a}^{a} g(x)dx\]

where g is the density of the normal distribution and

\[ a = \frac{\sqrt{n}\epsilon}{\sqrt{p(1-p)}} \]

from which we can write

\[ \epsilon = \frac{a\sqrt{p(1-p)}}{\sqrt{n}} \]

So given a threshold of confidence t we can build a interval of confidence where we can state with t% of confidence that the true value is within this interval:

\[ I_t = [\hat{\pi}_{j,n} - \frac{a_t}{2\sqrt{n}}, \hat{\pi}_{j,n} + \frac{a_t}{2\sqrt{n}}] \]

where

\[ \int_{-a_t}^{a_t} g(x)dx = t\]

The value for a can be found in a computation table. For t equal to 95% the value of a is 1.96 which gives us for the interval where there is 95% of chance that the true value is:

\[ I_t = [\hat{\pi}_{j,n} - \frac{1}{\sqrt{n}}, \hat{\pi}_{j,n} + \frac{1}{\sqrt{n}}] \]

In Practice

With the simple slider and graphic you can see by yourself the influence of the number of people polled for an opinion poll. The box represents the 95% interval of confidence that means there is 95% of chance that the true vote intention is within this interval.

1000

Conclusion

This quick refresher on opinion polls should remind us that opinion polls can be quite far from reality. And there is one variable that I didn’t take into account: there are people who don’t dare to name the true candidate they plan to vote for. This is especially true for extremist votes and explains why we can experience such a discrepancy between the estimations and the true results of one election.

Comments