Git Product home page Git Product logo

twitter_data_mining's Introduction

Twitter Data Mining

Getting Started

The project includes two programs, one called Datadownload which will download the data from twitter as NPY format, and another called Translator which will translate the data from NPY format into XML format.
The algorithm to collecting data:
Use the url you insterested in as the mother node, then requests 1000 tweets that contain the url and stores every hashtag in those tweets into a dictionary with the frequency of the hashtag.
Request 1000 tweets for every hashtag wich has top frequency in the dictionary and store every url appears in the tweets with its frequency.
Then choose the urls with top frequency for each hashtags, add them to the database as other nodes with the hashtag as the relationship.
Then use the new node to collect hashtags and so on..

Prerequisites

What things you need to install the software and how to install them

Python version 3.X

Modules:
Tweepy
Numpy
lxml

Installing

A step by step series of examples that tell you have to get a development env running
Say what the step will be
Modules installing example:
pip3 install tweepy
pip3 install numpy
pip3 install lxml

Running the tests

First of all, you should create a twitter on (https://apps.twitter.com), and copy the api key/secret and api token/secret to the variables in the Datadownload. If you don't like to create one, feel free to use the default api key and secret.
Second, change the variables as you want:
Dictname: the name of .npy file stores the url and hashtags.
SpeakDictname: the name of .npy file stores content of the speak content.
Frequency_table: the name of .txt file stores the frequency of hashtags/ urls.
mother_node: the First url we want to expand.
num: the number of node level you want, the maximum number is 4, since for 3 levels the program needs hours to get the result due to the twitter api limitation, so currently The program only supports 4 level of nodes.
toprate: the percentage of top frequency urls you want, the default is 0.1.
toprate_hashtags: the percentage of top frequency hashtags you want, the default is 0.1.
Third, run the Datadownload.py by python3 Datadownload.py. If the program reachs the limit of twitter, it will print out the signal:"Rate limit reached." and the time for waiting.
Forth, when the Datadownload complete the work, you should find 3 files in your folder two npy files stores the dictionary, and onr txt file records the frequency table.
Fifth, run the program translator and get the xml file in the end.

Authors

Tianxin Zhou/ Weike Dai

twitter_data_mining's People

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.