Git Product home page Git Product logo

andreantonacci / eu2019 Goto Github PK

View Code? Open in Web Editor NEW
3.0 1.0 1.0 24.57 MB

Social Network Analysis of Twitter Topic-Network Structures during the 2019 European Elections

Home Page: https://electionstats.eu

License: Creative Commons Attribution Share Alike 4.0 International

R 10.88% Python 16.56% TeX 72.56%
data-analysis social-network-analysis gephi master-thesis election-analysis european-elections latex

eu2019's Introduction

Social Network Analysis of Twitter Topic-Network Structures during the 2019 European Elections

python-badge license-badge

Populism Graph

This study explores how Twitter users differently interact based on the topic of discussion in online conversations.

We examine how node-level network topology measures may affect the likelihood of interaction and the extent to which two nodes interact in distinct topic-network structures. We collect 21 million sampled tweets via the Twitter API, tracking more than 700 keywords during the 2019 European Elections. We perform our analysis on a final data set of 2,259,717 tweets on five politics-related topics: brexit, populism, refugees, terrorism, and unemployment.

A more in-depth explanation of our methodology can be found in Section 3 of this paper. In the releases tab you can find a pre-compiled PDF version of the paper, as well as the source code needed for reproduction.

Project Structure

This project is structured in five sequential stages.

  1. code/data_collection: First, primary data is collected in real-time via the Twitter real-time filter API endpoints, then stored on a MongoDB (NoSQL) database for real-time querying and visualization, as well as on an Amazon Web Services (AWS) S3 bucket for long-term storage.

Check out Electionstats, our side project and dashboard to visualize Twitter data at the aggregate level in real-time during the EU 2019 Election.

  1. code/data_prep: Secondly, data is cleaned, selected, and prepared with a five-step pipeline in Python. These scripts transform the raw data set into nodes and edges tables for each analyzed topic.

  2. plots: Then, the latter tables are imported to Gephi 0.9.2 in order to visualize their graph structures and compute network measurements.

  3. code/data_analysis: Subsequently, we import into an R program the updated tables with new measures exported from Gephi. This R script constructs adjacency lists and runs three generalized linear models for each topic, where the covariates are node-level network metrics. First, we run a multiple regression model on the number of interactions for every pair of users. Secondly, a logistic model to assess the probability of interaction between two users. Finally, a conditional (filtered) multiple regression model on the number of interactions, only for those pairs of users that interacted with each other.

  4. paper: Lastly, we write a paper in LaTeX.

After execution, this is how your directory tree should look like:

eu2019
│   README.md
│   LICENSE.md
│   initialize.py # This file initializes the directories and runs the data preparation pipeline.
│
└───assets
│    ├──   raw_data
│    │      ├── master_data.json # Raw Twitter data (access on request)
│    │      └── tbc_topic_*.csv # List of tracked keywords per topic (one file for each)
│    ├──   processed_data
│    │      ├── conv_id.txt # List of user IDs that are part of a conversation
│    │      ├── filtered_*.json # Filtered tweets per topic (one file for each)
│    │      └── filtered_parsed_*.csv
│    ├──   derived
│    │      ├── nodes_*.csv # Nodes tables per topic
│    │      └── edges_*.csv # Edges tables per topic
│    └──   temp
└───code
│    ├──   data_collection # Collection from API, storage and queries to MongoDB
│    │      ├── 1_tweepy_collection.py
│    │      ├── 2_insertToMongo_S3_deleteFile.py
│    │      ├── 3_QueryMongo_Lang.py
│    │      ├── 4_QueryMongo_Topics.py
│    │      ├── 5_QueryMongo_Parties.py
│    │      ├── 6_QueryMongo_Time.py
│    │      ├── 7_QueryMongo_EPGroups.py
│    │      └── migration.py  
│    ├──   data_prep # Five-step Python pipeline
│    │      ├── dataPrep.py
│    │      ├── getConversationId.py
│    │      ├── filterConversation.py
│    │      ├── parse.py
│    │      ├── getNodes.py
│    │      ├── getEdges.py
│    │      ├── sample.py
│    │      ├── sample-nodes.py
│    │      └── randomize.py  
│    ├──   data_analysis
│    │      └── regression
│    │           ├── regression.Rproj
│    │           ├── looping-model.R
│    │           ├── mean-center-nodes.R
│    │           ├── get-aggregate-coef.R
│    │           ├── descriptive-stats.R
│    │           ├── correlation-plots.R
│    │           ├── single-topic-visualization.R
│    │           ├── single-summary-stats.R
│    │           └── *_output.txt # Regression output tables (txt and TeX)
│    └──   temp
└───paper
│    ├──   section-1.tex
│    ├──   section-2.tex
│    ├──   section-3.tex
│    ├──   section-4.tex
│    ├──   section-5.tex
│    ├──   section-6.tex
│    ├──   appendix-a.tex
│    ├──   appendix-b.tex
│    ├──   bib.bib
│    ├──   tex-paper.tex
│    ├──   tex-paper.pdf
│    └──   images
│           └── *.png
└───plots
     └──   *.png

Requirements

Data Collection

  • A Twitter Developer account with access to API
  • An always-on machine (AWS EC2)
  • Python 3.5+
  • An S3 bucket on AWS
  • AWS CLI correctly configured
  • SnakeTail for logging (optional)

Use the package manager pip to install Tweepy:

pip install tweepy

Data Preparation

Use the package manager pip to install the following libraries:

pip install pandas numpy pathlib boto3

Data Analysis

  • R (RStudio is strongly suggested)

Use install.packages() to install the following:

tidyverse
modelr
broom
bit64
data.table
stargazer
censReg
car
ggpubr
Hmisc
corrplot

Usage

Data Preparation Pipeline

Automated execution is available for the data preparation pipeline only (as shown on the right):

  1. Download files from the releases tab in one directory.
  2. Then, simply run initialize.py.

Want more?

You can find the source code for our real-time dashboard here.

Contributing

Feel free to reproduce this study or emulate this setup for your own project. Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Restricted Data Files

You can request access to the raw data or derived files by email at the following address:

a.d.antonacci [AT] tilburguniversity [DOT] edu

License

All materials are licensed under a Creative Commons CC-BY-SA-4.0 license.

eu2019's People

Contributors

andreantonacci avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Forkers

edoardocubeddu

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.