Git Product home page Git Product logo

gen-better-polls's Introduction

GEN Hackathon - Better Poll Visualization

At the Süddeutsche Zeitung Editor's Lab we were working on a better way to deal with opinion polls.

In autumn 2017 the next general election will be held. In the months to come, opinion polls play an even more important component of reporting about German politics.

Traditionally, media outlets are reporting about in a new poll in the following style: If election would be held today, party x would get y per cent of the votes. This is a decline of z per cent compared to previous week.

This has two major shortcomings.

Polling data are blurry

As it is the case with a lot of data: We readers are tempted to take them as a fact - solely due the fact of decimal places. But in fact, polls have a insecurity attached which has mainly two sources:

  1. Most of the times, only the key figure is communicated: the mean value. Statistically, this value is wrapped inside a error range. So a better approach would be to publish the mean along with a confidence level. That can be interpreted as "In 95 percent the party's result will be between 10 and 15 per cent."

  2. Every polling institute has its own way of conduction its survey: How big is the sample size? How do they weigh different demographics? How do they treat undecided voters or non-voters? Therefore every survey is wrong in its own way. But on an aggregate level they provide valid information about potential voting patterns of the electorate.

Therefore a smarter way of reporting about opinion polls is to get as many data as possible.

Data Source

The most comprehensive overview of German opinion polls can be found on Wahlrecht.de, a website about maintained by volunteers.

Calculation of confidence intervall

The data on Wahlrecht.de has information on the party's survey result and the sample size. This offers the opportunity to calculate standard errors (se) and a confidence intervall (ci) from the party result p.

The corresponding formula:

se = sqrt(p * (1-p) / n)

se: standard error p: survey result n: sample size

Assuming the data to follow a normal distribution and using a significance level of 0.05 and a z-value of 1.96 the confidence intervall can be computed:

half size of ci: delta = 1.96 * se = 1.96 * sqrt(p * (1-p) / n) Lower limit: ci_lower = p - delta Upper limit: ci_upper = p + delta

Calculation of a weighted average

In order to offer a single value we compute an average of the latest polls of every polling institute which is included in our data set (in March 2017, seven institutes in total). Instead of a simple arithmetic mean, we use a weighted average, the weights provided from the sample size of each poll. So with individual survey results p_1 ... p_k and sample sizes n_1 ... n_k, the average would be:

p = (p_1 * n_1 + p_2 * n_2 + ... + p_k * n_k) / (n_1 + n_2 + ... + n_k)

Calculation of the total error bars

Now we have a indiviual error bar for every survey and a weighted average of all the surveys. That's missing are the error bars for the average. These are calculated with a linear error propagation. We use the linear propagation instead of a squared propagation becauase the assume the different surveys not to be statistical independent. So with individual survey errors delta_1 ... delta_k and sample sizes n_1 ... n_k, the average error size would be:

delta = (delta_1 * n_1 + delta_2 * n_2 + ... + delta_k * n_k) / (n_1 + n_2 + ... + n_k)

And for the total confidence intervall we get

Lower limit: ci_lower = p - delta Upper limit: ci_upper = p + delta



Display of results

Out of that data, we produce two different graphics. One shows the current political mood, using only the latest poll from every institute. The other one shows the development over a longer time, calculating for every day the average of the latest polls available on that day.

Wen würden Sie wählen, wenn am Sonntag Bundestagswahl wäre?

Umfrageergebnisse liefern keine exakten Werte, sondern geben eine Spanne an, innerhalb der die Ergebnisse für eine Partei wahrscheinlich liegen. Die Institute setzen verschiedene Methoden ein, die zu unterschiedlichen Ergebnissen führen. Die Linie zeigt den gewichteten Mittelwert der jeweils neuesten Umfrage von sieben Instituten.

Wen würden Sie wählen, wenn am Sonntag Bundestagswahl wäre?

Umfrageergebnisse liefern keine exakten Werte, sondern geben eine Spanne an, innerhalb der die Ergebnisse für eine Partei wahrscheinlich liegen. Die Institute setzen verschiedene Methoden ein, die zu unterschiedlichen Ergebnissen führen. Die Balken zeigen den gewichteten Mittelwert der jeweils neuesten Umfrage von sieben Instituten.

Quelle: http://www.wahlrecht.de



Usage

Requirements NodeJS 4+, R 3.3+ Installation Ensure you have nodeJs and R installed. Then run Rscript install.R in the project folder to install the requried packages.

This project consists of a set of small scripts that may be used independently of each other. @see tasks directory

To completely build the project run either Rscript main.R or npm start. This will

  • scrape the data from the given website and generate a file data/data-input-longform.csv
  • perform the statistical transformation and save the result in data/data-latest-average.csv
  • create a visualization of the transformed data in data/assets/plot.svg and
  • updates the README.md with the latest scraped images and markdown snippets from data/*.md

These tasks are also mapped in the package.json and may be started using npm run <task>

Task: scraper

To scrape the poll data from www.wahlrecht.de, run Rscript tasks/scrape-wahlrechtde-umfragen.R. This will create a table at data/data-input-longform.csv.

Task: calculations

Rscript tasks/calculations-latest_polls_weights.R transforms table data in data/data-input-longform.csv by our statistical method and stores the following results

  • data/data-rolling-average-and-error.csv timebased chart

Task: plot

In order to visualize the data in data/data-rolling-average-and-error.csv, run the scripts 'Rscripts tasks/chart-longterm-polls.R'")' and 'Rscript tasks/chart-sunday-polls.R'. This will create above images and store them in data/assets/.

FAQ

  • some possible errors may be solved by running the scripts (like main.R) in RStudio, instead of the cli

gen-better-polls's People

Contributors

benurb avatar mschories avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gen-better-polls's Issues

SZoSansCond-Light etc. Fonts

Your R code appears to use some propriety fonts, like SZoSansCond-Light etc. Are these available to the public? At least when running your code I had to comment out all these font statements for the code to run.

weighted_average function in calculations-latest_polls_weights.R

You may also want to consider using the summarise function instead of mutating an entire column and then extract the first element, which seems a bit clumsy. This also goes for the weighted_error function. For example:

weighted_average <- function(df_in,party_in,date_in){
  sz_avr <- df_in %>%
    tbl_df %>%
    filter(datum <= date_in) %>%
    filter(partei == party_in) %>%
    group_by(institut) %>%
    filter(datum == max(datum)) %>%
    ungroup() %>%
    summarise(roll_avr = sum(befragte * anteil) / sum(befragte)) %>% as.numeric
  sz_avr
}

License?

We are interested in using the code in this repository (especially the scraper).
If you are fine with this, could you add a LICENSE to this repository that would allow the usage?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.