Git Product home page Git Product logo

discursive's People

Contributors

bstarling avatar hadoopjax avatar jbrambledc avatar sjacks26 avatar wanderingstar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

discursive's Issues

Build an analytics layer atop the Twitter data

Ultimately, we'll want to build analytical products using the Twitter data as a source. To do that, we'll identify things like distinct user handles, capture referenced URLs and common hashtags. Additionally, we'll look to utilize some NLP techniques to dig deeper into the Tweet texts and profile descriptions.

The Twitter data is stored in S3 in JSON so pulling that data into an analytical environment (maybe a Jupyter notebook?) could be a great task for beginners to tackle as there's plenty of documentation on the interwebs describing how to perform data analysis using Python with JSON data sources!

Command line interface (CLI)

Related to my proposal on storage options #13, I think we are getting to the point where a full fledged CLI would be useful.

When designing, important to make sure we set sane defaults so it "just works" without arguments. This will help users unfamiliar with the tool get up and running quickly.

Brainstorming here but some arguments we may want to consider accepting via CLI this list is probably too large, whatever we do not support in CLI may be more appropriate for in a settings file:

  • Stream tweets or search tweets
  • Tweet limit used in stream or search
  • Topics to stream/search
  • Input file which contains list of topics
  • Output type (file, ES, sqlite)
  • File output location (S3, local file etc.)
  • Location of credential file (Twitter, AWS)

Poll account data daily

Process that will accept a list of users and poll daily. Should save a time stamp and user attributes :

  • followers
  • bio
  • Location
  • Profile picture/Banner?

Allows us to see how a profile changes over time, fastest growing profiles etc.

Fix index_twitter_search.py

Looking for someone to take a look at index_twitter_search.py and get it working. Currently prints malformed data to console when run. Contact @nick on Slack to get access to the Elasticsearch index for your local testing if you pick this up!

Build core NLP capability for analyzing Tweets

This is a placeholder for the design (and associated discussion) of our foundational NLP capability to analyze collected Tweets. Divya (@divya on Slack) and Wendy (@wwymak on Slack) will be taking the lead with support from anyone/everyone else who wants to help! The goal will be to publish a proposed design for the implementation to this issue later this week and get feedback from the community. Anyone else who's interested in participating please don't hesitate to contact them.

Write valid JSON to S3

To enable analysts we need to ensure we can get valid JSON written to S3 from the index_twitter_stream.py class StreamListener

The write to S3 should create a key (filename) with a timestamp as this stream runs every 15 minutes. Bonus points for creating a way to zip/concatenate files with filenames as arguments (so analysts could, for instance, take all Tweets from a given day, week, etc.).

Community detection

There are lots of ways to do community detection using Twitter data. We'll want to discuss the nuts-and-bolts on Slack but once we select an implementation we like we can track progress here. There's lots of neat emerging research we could try out, too (i.e. https://arxiv.org/pdf/1608.01771v1.pdf)!

Configurable storage options

In order to generalize the tool to make it useful to the widest audience I think it would be beneficial to support multiple options for saving tweets with the ability to configure 1 or more option. In order to accomplish this I think we should refactor any of the storage related work to a separate process that uses config to determine how/where to save tweet. If we agree on the approach to do this, we can open new issues for each back end we want to add.

For starters:

  • ES
  • Sqlite
  • JSON file (local or S3)
  • CSV file (local or S3)

Fix "build_status_attr.py" in "discursive-graph-data" branch

This issue pertains to the discursive-graph-data branch

build_status_attr.py is currently a mess. It needs to return the entire Tweet object from Tweepy (so, include "entities"). The returned object will contain important information about a given Tweet (status) that we can use for analysis.

Currently, it throws TypeError: 'int' object is not iterable

If you would like to work on this issue and need help setting up access to the Elasticsearch resource for your local testing please contact @nick on slack. This is a relatively painless process as long as you have an AWS account. Conversely, if you want to test disconnected from AWS resources, you can do so by commenting-out any of the AWS or Elasticsearch affiliated references (you'll still need Twitter, though).

Update README

We need to update the README to reflect:

  • @jbrambleDC's changes to the topics feed (they come now from a topics.txt file)
  • Provide a more robust description of the Search and Stream options from the Twitter API
  • Remove Twitter Collection Terms as this is driven by the user's preference vs. hard-coded now
  • Update the roadmap to reflect our current goals (ask around on our Slack channel!)

Move keyword/tracking code from index_twitter_stream.py into a separate input file

Rather than having the tracking topics we want to use against the Twitter streaming API stored in our codebase let's move them to an input file. So, as a user, I can point to a file (i.e. on S3 or locally) to use as a source for my keywords!

  • Bonus points if you want to move the Tweet counter (StreamListener self.limit) and any Elasticsearch configs to a user-managed config file!

Migrate to Kinesis & Lambda (serverless)

In conversation with @ASRagab, @bstarling and @nataliaking over the past several weeks we've contemplated migrating the Discursive application to a 'serverless' architecture supported by Kinesis and Lambda. In terms of desired functionality (delivering Twitter data to researchers) the master branch works just fine, albeit requiring a level of infrastructure expertise our research colleagues may not possess. To that end, we would like to gather community feedback regarding whether this migration to a serverless architecture is something we should pursue. Please provide feedback if you have it, we'd love to hear from you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.