Git Product home page Git Product logo

tweet_clustering_ranking's Introduction

Cluster and Rank Tweets

These scripts download tweets from list of predifined news feeds and cluster them based on similariy of word2vec features and rank most influential tweets. Essentially an approch to filter relavant tweets

Eg: Tweets from 9 news feeds ('nytimes', 'thesun', 'thetimes', 'ap', 'cnn’, 'bbcnews', 'cnet', 'msnuk', 'telegraph') are download every 10mins. Every tweet is preprocessed and the tweet text is transformed into vectors via word2vec model. Then every tweet is classified into a cluster and then all tweets belonging to the cluster are ranked based on a score computed from other tweet features (favorited, etc). Once the top tweet for every cluster is determined it’s stored into a dictionary and if a tweet score exceeds a predefined threshold it will be written into a cluster-specific file.

Run script: python live_processing_app.py GoogleNews-vectors-negative300.bin

Clustering Tweets

Tweets from the 9 feeds were downloaded initially and every tweet text is transformed into a 7,500 dimension vector using the word2vec model. Each word is represented as a 300-dimensional numeric vector. Once the features are extracted a simple K-means algorithm is used to cluster the tweets and based on the experiments with historical tweets 3 clusters seemed appropriate.

Ranking Tweets

From the tweets downloaded a Random Forest regression model is trained using the retweet count as a proxy for the importance of a tweet (as a first attempt to quantify the importance of a tweet). This model use other features apart from the tweet text, such as ‘favorited', 'retweeted_status', 'retweeted', 'entities', 'favorite_count', 'possibly_sensitive'. The importance of a tweet is a hard to define phenomenon and it is influenced by many other factors than the tweets itself but the regression model seems to show promising results in actually being able to capture the retweet count trend at least when it’s positive and negative. From the positive results In future, this can be extended to predict future retweet count of a tweet by tracking tweet retweet count and generating a labeled dataset.

Other Source Flies

read_tweets.py - Gives a simple way to read stored tweets from a JSON objests and compute word2vec representation of each tweet text

dowload_tweets.py - Gives a simple way to download tweets from the twitter API and store them in files as JSON objects

*_test.py - samples to test the tweepy API and the google word2vec model

tweet_clustering_ranking's People

Contributors

nitsanluke avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.