Git Product home page Git Product logo

twitter-dataset-collector's Introduction

twitter-dataset-collector

The project facilitates the distribution of Twitter datasets by downloading sets of tweets (if still available) using their ids as input and by providing simple wrapping around the popular Twitter4j client. It is very similar to the popular twitter-corpus-tools. However, it is significantly more simple to use since it has very few dependencies (jsoup and twitter4j) and simple text-based input/output file formats. We recommend to use the collector with moderation (i.e. do not use too many threads and do not leave it running for too long) and only for research purpuses (i.e. to distribute Twitter datasets and reproduce results).

Instructions

Twitter fetching

For large sets of tweets (many tens of thousands), you probably want to use the multi-threaded implementation of the collector. In class `eu.socialsensor.twcollect.TweetCorpusDownloader` just run the method:
void downloadIdsMultiThread(String idsFile, String responsesLogFile, boolean resume, int nrThreads)

providing as arguments the name of the file (idsFile) with the tweet IDs, one ID per line, the name of the file (responsesLogFile) where the collected tweets (plus some additional information) will be logged, a boolean flag (resume) indicating whether the downloading should resume or start from scratch, and the number of threads (nrThreads) that will be used for parallelizing the requests-responses to/from twitter.com. Note that in case that resume is set to true and a file responsesLogFile already exists (from a previous execution of the method), the collector will skip the requests for the already collected tweets.

The same arguments (with the exception of nrThreads) can be used with the single-threaded implementation:

void downloadIds(String idsFile, String responsesLogFile, final boolean resume)

Twitter4j wrapper

To make use of the Twitter Streaming API, you would need to use the Twitter4j wrapper class `eu.socialsensor.twcollect.StreamCollector`. To do so, you first need to fill in your API credentials in the `twitter4j.properties` file and then run the `main` method of the class. By default the class makes use of a predefined list of Twitter user ids (from the file `seeds.txt` as well as a set of keywords (from the file `keywords.txt`) as filters to the Streaming API. Note that the shutdown hook of the class may not work properly (and therefore fail to finalize resources), e.g. when the `main` method is invoked and terminated from within the eclipse IDE.

Technical considerations

Running the multi-threading implementation using 10 threads, we processed the set of over 27K tweets contained in file tweets_27K.txt in a bit more than an hour, i.e. we achieved an average throughput of 6.7tweets/sec. With the single-threaded implementation, it took us around 1.25secs per tweet. The tweet ids were collected during the US Elections (November 6) 2012, and the test was performed 11 December 2013 (more than a year a later). As a result, only 78% of the 27,250 tweets were downloaded. There were 5,616 tweets that were not available (probably removed by their authors), 321 tweets from suspended accounts, and 45 tweets that failed to download for other reasons.

Contact

For more information or support, contact: [email protected] or [email protected]

twitter-dataset-collector's People

Contributors

kleinmind avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.