Summer School: Big Data Analysis in the Social Sciences

Part of the

2017 ECPR Summer School in Methods and Techniques

CEU, Budapest, August 7 - August 11, 2017

Instructors

Pablo Barberá
TA: Juraj Medzihorsky

Outline

Massive-scale datasets from web sources and social media, newly digitized text sources, and large longitudinal survey studies present exciting opportunities for the study of social and political behaviour, but at the same time its size and heterogeneity present significant challenges. This course will introduce participants to new computational methods and tools required to explore and analyse Big Data in the social sciences using the R programming language. It will be structured around techniques to deal with the 3 V's of Big Data: volume, variety, and veracity. First, we will cover the basics of parallel programming and cloud computing to analyse large-scale datasets. Second, we will learn how to scale human tasks through the use of machine learning methods. Finally, we will discuss how to automatically discover insights from large text and network datasets and validate the output of this analysis. The course will follow a "learning-by-doing" approach, with short theoretical sessions followed by "data challenges" where participants will need to apply new methods.

Additional information and schedule is available at the ECPR Summer School website

Setup and Preparation

There are two ways you can follow the course and run the code contained in this GitHub repository. The recommended method is to connect to the provided RStudio server where all the R packages have already been installed, and all the R code is available. To access the server, visit bigdata.pablobarbera.com and log in with the information provided during class.

Alternatively, you can run the code in your own laptop. You will need R and RStudio installed.

If you're using your own laptop, you can either download the course materials clicking on each link in this repository, download it as a zip file, or you can "clone" it, using the buttons found to the right side of your browser window as you view this repository. If you do not have a git client installed on your system, you will need to get one here and also to make sure that git is installed.

You can also subscribe to the repository if you have a GitHub account, which will send you updates each time new changes are pushed to the repository.

Day 1

The course will begin with a discussion of the concept of "Big Data" and the research opportunities and challenges of the use of massive-scale datasets in the social sciences. The first session will also provide a foundation of R coding skills upon which we will rely during the rest of the course. Here, we will go over existing packages to efficiently analyze large-scale datasets in R, how to parallelize for loops, and how to read and write large files.

Slides: Big Data in the social sciences (.pdf)

Slides: Efficient data analysis with R (.pdf)

Code: Efficient programming with R (.html)

Challenge 1: Writing more efficient code

Code: Parallel computing with R (.html)

Challenge 2: Parallel computing

Day 2

The second session will focus on the most common application of Big Data in the social sciences: large-scale text classification. After a quick overview of the basics of machine learning, we will discuss specific details of the implementation of supervised learning algorithms in massive-scale datasets, and in particular recently-developed methods in computer science such as stochastic gradient descent, xgboost, and ensemble classifiers. Our emphasis will lie on the practical aspects: we will study these methods in the context of an application of sentiment analysis to newspaper articles, and will go through the entire research process, from the creation of a training dataset labeled by humans using crowd-sourcing platforms, to the application and validation of the machine learning algorithm, and passing through all the intermediate steps, such as cleaning and preprocessing the corpus of documents.

Slides: Supervised machine learning (.pdf)

Code: Regularized regression

Code: SVM, Random Forests, and beyond

Challenge 1: Text classification

Code: Large-scale text classification

Slides: Creating training datasets (.pdf)

Code: Large-scale text classification

Challenge 2: Crowd-sourcing the creation of datasets

Day 3

Exploratory data analysis can be a powerful tool for social scientists when they are interested in analyzing a new dataset. The third session will cover the existing tools for large-scale discovery in "Big Data" using R, applied to textual datasets. We will start with different techniques, such as collocation analysis, TF-IDF feature weighting and word2vec, which will allow us to identify salient themes and ideas across documents. Then, we will move to topic models, which allow researchers to automatically identify latent classes of documents in a corpus, with an application to the classification of Facebook posts by politicians into relevant political issues. This session will also cover other dimensionality reduction techniques that are commonly used in the social sciences to visualize large-scale datasets.

Slides: Topic discovery in text datasets (.pdf)

Code: Discovery in text

Challenge 1: Discovery in text

Code: Topic models

Code: STM

Code: Ideological Scaling with Wordfish

Day 4

In the fourth session we will turn our attention to social networks, and in particular the detection of communities of individuals with shared interests or political preferences. The running example in this part of the course will be the classification of Twitter users along a latent ideological dimension based on the structure of the networks in which they are embedded. A common theme to this session and the previous one will be the emphasis on validation: once an unsupervised model is completed, how can we measure the quality of the results? We will discuss basics concepts of measurement theory, and best practices in the validation of the results from unsupervised statistical models.

Slides: Advanced network analysis (.pdf)

Code: Community detection in networks (.html)

Challenge 1: Community detection

Code: Latent-space models (.html)

Challenge 2: Latent-space models

Day 5

The course will conclude with an introduction to cloud computing and database management for social scientists. Most available online resources and courses on these topics assume students are proficient in UNIX or have a background in programming. Here, however, we will start from scratch and focus on the coding skills required to conduct statistical analyses with data hosted in the “cloud”, while at the same time helping participants become familiar with programming concepts that can facilitate future collaborations with computer scientists. We will cover the most important commands in UNIX – the language required to interact with High-Performance Clusters (HPC), for example, which are now available in most universities – and test our skills in an online virtual machine hosted on Amazon Elastic Compute Cloud (EC2). In the second half of this session, we will learn the basics of SQL, and run our own queries in a dataset with over a billion geolocated tweets hosted in Google BigQuery.

Slides: Introduction to cloud computing (.pdf)

Code: UNIX and cloud computing (.html)

Slides: SQL databases (.pdf)

Code: Querying an SQL database (.html)

Challenge: Querying an SQL database (.html)

Code: Introduction to Google Big Query (.html)

Challenge 2: Querying billion-row datasets with SQL and Google BigQuery

Code: Setting up RStudio Server on Amazon Web Services

Slides: Creating training datasets (.pdf)

Challenge 3: Crowd-sourcing the creation of datasets