data4democracy / discursive Goto Github PK

View Code? Open in Web Editor NEW

21.0 21.0 11.0 44 KB

Twitter topic search and indexing with Elasticsearch

Python 100.00%

discursive's People

Contributors

Stargazers

Watchers

Forkers

jbrambledc asnjudy alejandrox1 jtsmith2 geckya georgerichardson metame wanderingstar abehmiel tdraebing jrnold

discursive's Issues

Build an analytics layer atop the Twitter data

Ultimately, we'll want to build analytical products using the Twitter data as a source. To do that, we'll identify things like distinct user handles, capture referenced URLs and common hashtags. Additionally, we'll look to utilize some NLP techniques to dig deeper into the Tweet texts and profile descriptions.

The Twitter data is stored in S3 in JSON so pulling that data into an analytical environment (maybe a Jupyter notebook?) could be a great task for beginners to tackle as there's plenty of documentation on the interwebs describing how to perform data analysis using Python with JSON data sources!

Command line interface (CLI)

Related to my proposal on storage options #13, I think we are getting to the point where a full fledged CLI would be useful.

When designing, important to make sure we set sane defaults so it "just works" without arguments. This will help users unfamiliar with the tool get up and running quickly.

Brainstorming here but some arguments we may want to consider accepting via CLI this list is probably too large, whatever we do not support in CLI may be more appropriate for in a settings file:

Stream tweets or search tweets
Tweet limit used in stream or search
Topics to stream/search
Input file which contains list of topics
Output type (file, ES, sqlite)
File output location (S3, local file etc.)
Location of credential file (Twitter, AWS)

Change Tweet mapping in essetup.py

Currently, the tweet mapping in essetup.py does not match what is collected in index_twitter_*.py. This needs to match.

Poll account data daily

Process that will accept a list of users and poll daily. Should save a time stamp and user attributes :

followers
bio
Location
Profile picture/Banner?

Allows us to see how a profile changes over time, fastest growing profiles etc.

Setup cron ability in Docker container: discursive-graph-data branch

Having recently "Dockerized" the Twitter collection process, one piece we're missing is the ability to run the Twitter collection from within the Docker container on a schedule (i.e. using cron). The OS is Alpine Linux, and here's a good example of what we're trying to do. Check out the branch here.

Please assign this to yourself if you are working on it so we're not stepping all over each other :)

Fix index_twitter_search.py

Looking for someone to take a look at index_twitter_search.py and get it working. Currently prints malformed data to console when run. Contact @nick on Slack to get access to the Elasticsearch index for your local testing if you pick this up!

Build core NLP capability for analyzing Tweets

This is a placeholder for the design (and associated discussion) of our foundational NLP capability to analyze collected Tweets. Divya (@divya on Slack) and Wendy (@wwymak on Slack) will be taking the lead with support from anyone/everyone else who wants to help! The goal will be to publish a proposed design for the implementation to this issue later this week and get feedback from the community. Anyone else who's interested in participating please don't hesitate to contact them.

Write valid JSON to S3

To enable analysts we need to ensure we can get valid JSON written to S3 from the index_twitter_stream.py class StreamListener

The write to S3 should create a key (filename) with a timestamp as this stream runs every 15 minutes. Bonus points for creating a way to zip/concatenate files with filenames as arguments (so analysts could, for instance, take all Tweets from a given day, week, etc.).

Community detection

There are lots of ways to do community detection using Twitter data. We'll want to discuss the nuts-and-bolts on Slack but once we select an implementation we like we can track progress here. There's lots of neat emerging research we could try out, too (i.e. https://arxiv.org/pdf/1608.01771v1.pdf)!

Placeholder: Testing approach

Use pytest?
Sample data/basic validation scripts?

Configurable storage options

In order to generalize the tool to make it useful to the widest audience I think it would be beneficial to support multiple options for saving tweets with the ability to configure 1 or more option. In order to accomplish this I think we should refactor any of the storage related work to a separate process that uses config to determine how/where to save tweet. If we agree on the approach to do this, we can open new issues for each back end we want to add.

For starters:

ES
Sqlite
JSON file (local or S3)
CSV file (local or S3)

Fix "build_status_attr.py" in "discursive-graph-data" branch

This issue pertains to the discursive-graph-data branch

build_status_attr.py is currently a mess. It needs to return the entire Tweet object from Tweepy (so, include "entities"). The returned object will contain important information about a given Tweet (status) that we can use for analysis.

Currently, it throws TypeError: 'int' object is not iterable

If you would like to work on this issue and need help setting up access to the Elasticsearch resource for your local testing please contact @nick on slack. This is a relatively painless process as long as you have an AWS account. Conversely, if you want to test disconnected from AWS resources, you can do so by commenting-out any of the AWS or Elasticsearch affiliated references (you'll still need Twitter, though).

Update README

We need to update the README to reflect:

@jbrambleDC's changes to the topics feed (they come now from a topics.txt file)
Provide a more robust description of the Search and Stream options from the Twitter API
Remove Twitter Collection Terms as this is driven by the user's preference vs. hard-coded now
Update the roadmap to reflect our current goals (ask around on our Slack channel!)

Move keyword/tracking code from index_twitter_stream.py into a separate input file

Rather than having the tracking topics we want to use against the Twitter streaming API stored in our codebase let's move them to an input file. So, as a user, I can point to a file (i.e. on S3 or locally) to use as a source for my keywords!

Bonus points if you want to move the Tweet counter (StreamListener self.limit) and any Elasticsearch configs to a user-managed config file!

Migrate to Kinesis & Lambda (serverless)

In conversation with @ASRagab, @bstarling and @nataliaking over the past several weeks we've contemplated migrating the Discursive application to a 'serverless' architecture supported by Kinesis and Lambda. In terms of desired functionality (delivering Twitter data to researchers) the master branch works just fine, albeit requiring a level of infrastructure expertise our research colleagues may not possess. To that end, we would like to gather community feedback regarding whether this migration to a serverless architecture is something we should pursue. Please provide feedback if you have it, we'd love to hear from you!