Git Product home page Git Product logo

getoldtweets-airflow's Introduction

Collect historical tweets with Airflow

This repository contains an Airflow workflow for the collection of historical tweets based on a search query. The method combines the package getOldTweets3 with the Twitter status lookup API. Airflow is used to programmatically author, schedule and monitor the collection process.

The Twitter Standard Search API searches against a sampling of recent Tweets published in the past 7 days. Therefore, this API is not very useful to collect historical tweets. The package getOldTweets3 is a webscraping package for Twitter that was developed to solve this problem. The package can ce used to collect variables like "id", "permalink", "username", "to", "text", "date" in UTC, "retweets", "favorites", "mentions", "hashtags" and "geo". Unfortunately, not all relevant variables are returned and data can be a bit messy (broken urls). To collect the full set of variables, we can use the Twitter status lookup API. This API is less restrictive compared to the Twitter Standard Search API, but requires a list of status identifiers as input. These identifiers are collected with getOldTweets and passed to the lookup API. The result is a complete set of information on every tweet collected.

DAG Twitter

The workflow contains steps (Operators in Airflow). The first step is the collection of tweets with getOldTweets3 get_old_tweets_*. The result of getOldTweets is not always complete. Therefore, we run this step 3 times (See DAG file to change the number of runs.) The next step, merge_get_old_tweets , is used to find the unique status identifiers of the 3 runs. In the lookkup_tweets, each status identifier is passed to Twitter status lookup API, and results are stored in the folder lookup/. In the last step, the completeness of the lookup process is evaluated. The validate_get_old_tweets task compares the identifiers with the identifiers collected in the getOldTweets runs and reports the completeness.

Installation and preparation

This project runs on Python 3.6+ and depends on tools like Airflow, tweepy and getOldTweets3. Install all the dependencies from the requirements.txt file.

pip install -r requirements.txt

Create a json file with your Twitter credentials (e.g. ~/Credentials/twitter_cred.json). Read more about Twitter access tokens on the Twitter developer documentation.

{
    "consumer_key":"XXXXXXXXXXXXXXXXXXXXXXXXX",
    "consumer_secret":"XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
    "access_token":"00000000-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
    "access_token_secret":"XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
}

Initialise and start Airflow. Please read the documentation of Airflow if you are not familiar with setting up Airflow.

export AIRFLOW_HOME=/PATH/TO/YOUR/PROJECT/getOldTweets_airflow

# initialize the database
airflow initdb

# start the web server, default port is 8080
airflow webserver -p 8080

# start the scheduler
airflow scheduler

Open another terminal and add the Twitter credentials to the enviroment variables.

export TWITTER_CREDENTIALS=~/Credentials/twitter_cred.json

Usage

Airflow can be used to schedule jobs, but also do a backfill operation. This backfill operation is very useful for collecting historical tweets. By default, pipelines are split into montly intervals.

Edit the search query in the file dag/dag_tweet_search.py. Adjust the query by adjusting QUERY_SEARCH and/or QUERY_LANG. It is recommended to save the file with another file name and dag_id.

The following backfill operation collects all tweets from 2018. The results are stored in 12 different files, one for each month.

airflow backfill tweet_collector -s 2018-01-01 -e 2018-12-31

Navigate to http://localhost:8080/ to monitor the collection process.

Tree example

The format of this query is: airflow backfill dag_id -s start_date -e end_date

Results can be found in the output folder.

License

BSD-3

Contact

This project is a project by Parisa Zahedi ([email protected], @parisa-zahedi) and Jonathan de Bruin ([email protected], @J535D165).

getoldtweets-airflow's People

Contributors

j535d165 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.