Git Product home page Git Product logo

twarc's Introduction

twarc

Build Status

twarc is a command line tool and Python library for archiving Twitter JSON data. Each tweet is represented as a JSON object that is exactly what was returned from the Twitter API. Tweets are stored as line-oriented JSON. Twarc runs in three modes: search, filter stream and hydrate. When running in each mode twarc will stop and resume activity in order to work within the Twitter API's rate limits.

Install

  1. install Python (2 or 3)
  2. pip install twarc

Twitter API Keys

Before using twarc you will need to register an application at apps.twitter.com. Once you've created your application, note down the consumer key, consumer secret and then click to generate an access token and access token secret. With these four variables in hand you are ready to start using twarc.

The first time you run twarc it will prompt you for these keys and store them in a .twarc file in your home directory. Sometimes it can be handy to store multiple authorization keys for different Twitter accounts in your config file. So if you can have multiple profiles to your .twarc file, for example:

[main]
consumer_key=lksdfljklksdjf
consumer_secret=lkjsdflkjsdlkfj
access_token=lkslksjksljk3039jklj
access_token_secret=lksdjfljsdkjfsdlkfj

[another]
consumer_key=lkjsdflsj
consumer_secret=lkjsdflkj
access_token=lkjsdflkjsdflkjj
access_token_secret=lkjsdflkjsdflkj

You then use the other profile with the --profile option:

twarc.py --profile another --search ferguson

twarc will also look for authentication keys in the environment if you would prefer to set them there using the following names:

  • CONSUMER_KEY
  • CONSUMER_SECRET
  • ACCESS_TOKEN
  • ACCESS_TOKEN_SECRET

And finally you can pass the authorization keys as arguments to twarc:

twarc.py --consumer_key foo --consumer_secret bar --access_token baz --access_token_secret bez --search ferguson

Search

When running in search mode twarc will use Twitter's search API to retrieve as many tweets it can find that match a particular query. So for example, to collect all the tweets mentioning the keyword "ferguson" you would:

twarc.py --search ferguson > tweets.json

This command will walk through each page of the search results and write each tweet to stdout as line oriented JSON. Twitter's search API only makes (roughly) the last week's worth of Tweets available via its search API, so time is of the essence if you are trying to collect tweets for something that has already happened.

Search for tweets within a given area

You can specify a search term or omit one to search for all tweets within a given radius of a given latitude/longitude:

twarc.py --search ferguson --geocode 38.7442,-90.3054,1mi
twarc.py --geocode 38.7442,-90.3054,1mi

See the API documentation for more details on how these searches work.

Filter Stream

In filter stream mode twarc will listen to Twitter's filter stream API for tweets that match a particular filter. You can filter by keywords using --track, user identifiers using --follow and places using --locations. Similar to search mode twarc will write these tweets to stdout as line oriented JSON:

Stream tweets containing a keyword

twarc.py --track "ferguson,blacklivesmatter" > tweets.json

Stream tweets from/to users

Note: you must use the user identifiers, for example these are the user ids for the @guardian and @nytimes:

twarc.py --follow "87818409,807095" > tweets.json

Stream tweets from a location

Note: the leading dash needs to be escaped in the bounding box or else it will be interpreted as a command line argument!

twarc.py --locations "\-74,40,-73,41" > tweets.json

Note the syntax for the Twitter's filter queries is slightly different than what queries in their search API. So please consult the documentation on how best to express the filter option you are using. Note: the options can be combined, which has the effect of a boolean or.

Hydrate

The Twitter API's Terms of Service prevent people from making large amounts of raw Twitter data available on the Web. The data can be used for research and archived for local use, but not shared with the world. Twitter does allow files of tweet identifiers to be shared, which can be useful when you would like to make a dataset of tweets available. You can then use Twitter's API to hydrate the data, or to retrieve the full JSON for each identifier. This is particularly important for verification of social media research.

In hydrate mode twarc will read a file of tweet identifiers and use Twitter's lookup API to fetch the full JSON for each tweet and write it to stdout as line-oriented JSON:

twarc.py --hydrate ids.txt > tweets.json

Sample Stream

In sample stream mode twarc will listen to Twitter's sample stream API for a random sample of recent public statuses. Similar to search mode and filter stream mode, twarc will write these tweets to stdout as line oriented JSON:

twarc.py --sample > tweets.json

User Timeline

In user timeline mode twarc will use Twitter's user timeline API to collect the most recent tweets posted by the user indicated by screen_name:

twarc.py --timeline screen_name > tweets.json

or by user_id:

twarc.py --timeline_user_id user_id > tweets.json

User Lookup

In user lookup mode twarc will use Twitter's user lookup API to collect fully hydrated user objects for up to 100 users per request as specified by a list of one or more user screen names:

twarc.py --lookup_screen_names screen_names > users.json

or user_ids:

twarc.py --lookup_user_ids user_ids > users.json

Follower Ids

In follower id mode twarc will use Twitter's follower id API to collect the follower user ids for exactly one user screen name per request as specified as an argument:

    twarc.py --follower_ids screen_name > follower_ids.txt

The result will include exactly one user id per line. The response order is reverse chronological, or most recent followers first.

This can work in concert with user lookup mode, where you can invoke the resulting follower id list as --lookup_user_ids:

    twarc.py --lookup_user_ids `cat follower_ids.txt` > followers.json

Friend ids

Like follower id mode, in friend id mode twarc will use Twitter's friend id API to collect the friend user ids for exactly one user screen name per request as specified as an argument:

twarc.py --friend_ids screen_name > friend_ids.txt

Also like in follower id mode, the results will be reverse chronological, and they can be fed to --lookup_user_ids in the same way demonstrated above.

Trend modes

Twitter's API offers three calls for fetching information about trends. All three return JSON data, but each has its own focus. To keep things simple, twarc returns the JSON data straight from the API, but with one result per line, unless specified otherwise. Details about each call follow.

Trends available

As of October 2016, Twitter gathers trend information for 467 distinct regions, including "Worldwide", 62 countries, 402 towns, and two areas that don't neatly fit into these categories, Ahsa and Soweto. In trends available mode, twarc returns the entire list of all of these regions from the trends available API:

    twarc.py --trends_available

Each region will be written to stdout, one JSON record per line.

Trends place

The trends place API returns a list of all trends for a particular region (one of the regions listed) in the results for --trends_available). It also includes three extra values, as_of and created_at, which are both W3C Date Time compatible timestamps, and a list of regions for which trends are provided in the results, including a region name and WOE id for each. The API call only accepts one id at a time, though, and the dates do not provide terribly much information, so twarc's response simplifies the result to just the list of applicable trends.

    twarc.py --trends_place WOEID

Each trend will be written to stdout, one JSON record per line.

A variation on this call will exclude hashtags from trend lists:

    twarc.py --trends_place_exclude WOEID

The result format is the same, it will simply exclude hashtags. This is a feature of the Twitter API call, not post-processing in twarc.

Trends closest

The trends closest API returns a list of regions bounded a specific latitude and longitude. This API call only accepts one lat, lon pair per call.

    twarc.py --trends_closest 39.9062,-79.4679

The result format is the same as for trends available.

Archive

In addition to twarc.py when you install twarc you will also get the twarc-archive.py command line tool. This uses twarc as a library to periodically collect data matching a particular search query. It's useful if you don't necessarily want to collect tweets as they happen with the streaming api, and are content to run it periodically from cron to collect what you can. You will want to adjust the schedule so that it at least runs every 7 days (the search API window), and often enough to match the volume of tweets being collected. The script will keep the files organized, and is smart enough to use the most recent file to determine when it can stop collecting so there are no duplicates.

For example this will collect all the tweets mentioning the word "ferguson" from the search API and write them to a unique file in /mnt/tweets/ferguson.

twarc-archive.py ferguson /mnt/tweets/ferguson

Use as a Library

If you want you can use twarc programmatically as a library to collect tweets. You first need to create a Twarc instance (using your Twitter credentials), and then use it to iterate through search results, filter results or lookup results.

from twarc import Twarc

t = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)
for tweet in t.search("ferguson"):
    print(tweet["text"])

You can do the same for a filter stream of new tweets that match a track keyword

for tweet in t.filter(track="ferguson"):
    print(tweet["text"])

or location:

for tweet in t.filter(locations="-74,40,-73,41"):
    print(tweet["text"])

or user ids:

for tweet in t.filter(follow='12345,678910'):
    print(tweet["text"])

Similarly you can hydrate tweet identifiers by passing in a list of ids or a generator:

for tweet in t.hydrate(open('ids.txt')):
    print(tweet["text"])

Utilities

In the utils directory there are some simple command line utilities for working with the line-oriented JSON, like printing out the archived tweets as text or html, extracting the usernames, referenced URLs, etc. If you create a script that is handy please send a pull request.

When you've got some tweets you can create a rudimentary wall of them:

% utils/wall.py tweets.json > tweets.html

You can create a word cloud of tweets you collected about nasa:

% utils/wordcloud.py tweets.json > wordcloud.html

gender.py is a filter which allows you to filter tweets based on a guess about the gender of the author. So for example you can filter out all the tweets that look like they were from women, and create a word cloud for them:

% utils/gender.py --gender female tweets.json | utils/wordcloud.py > tweets-female.html

You can output GeoJSON from tweets where geo coordinates are available:

% utils/geojson.py tweets.json > tweets.geojson

Optionally you can export GeoJSON with centroids replacing bounding boxes:

% utils/geojson.py tweets.json --centroid > tweets.geojson

And if you do export GeoJSON with centroids, you can add some random fuzzing:

% utils/geojson.py tweets.json --centroid --fuzz 0.01 > tweets.geojson

To filter tweets by presence or absence of geo coordinates (or Place, see API documentation):

% utils/geofilter.py tweets.json --yes-coordinates > tweets-with-geocoords.json
% cat tweets.json | utils/geofilter.py --no-place > tweets-with-no-place.json

To filter tweets by a GeoJSON fence (requires Shapely):

% utils/geofilter.py tweets.json --fence limits.geojson > fenced-tweets.json
% cat tweets.json | utils/geofilter.py --fence limits.geojson > fenced-tweets.json

If you suspect you have duplicate in your tweets you can dedupe them:

% utils/deduplicate.py tweets.json > deduped.json

You can sort by ID, which is analogous to sorting by time:

% utils/sort_by_id.py tweets.json > sorted.json

You can filter out all tweets before a certain date (for example, if a hashtag was used for another event before the one you're interested in):

% utils/filter_date.py --mindate 1-may-2014 tweets.json > filtered.json

You can get an HTML list of the clients used:

% utils/source.py tweets.json > sources.html

If you want to remove the retweets:

% utils/noretweets.py tweets.json > tweets_noretweets.json

Or unshorten urls (requires unshrtn):

% cat tweets.json | utils/unshorten.py > unshortened.json

Once you unshorten your URLs you can get a ranked list of most-tweeted URLs:

% cat unshortened.json | utils/urls.py | sort | uniq -c | sort -nr > urls.txt

twarc-report

Some further utility scripts to generate csv or json output suitable for use with D3.js visualizations are found in the twarc-report project. The util directed.py, formerly part of twarc, has moved to twarc-report as d3graph.py.

Each script can also generate an html demo of a D3 visualization, e.g. timelines or a directed graph of retweets.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.