Git Product home page Git Product logo

twitterner's Introduction

Real-time event monitoring on social media based on Twitter

Project Git-Hub link:

Program running environment:

Python version:3.8.3 (default, Jul 2 2020, 17:30:36) [MSC v.1916 64 bit (AMD64)] tweepy version:3.10.0 pymongo version:3.11.2 emoji version:0.6.0 requests version:2.25.1 numpy version:1.19.2 nltk version:3.4.5

Programs are stored in four folders:

1. Data_crawl

crawler.py

**What it does: hybrid architecture of Twitter Streaming & REST APIs. **

./original_data/stream.json : Streaming API crawls data stored in this file.

./original_data/rest.json:REST API crawls data stored in this file.


statistical.py

**What it does: Count the Twitter data that has been crawled. **

./original_data/3.29.1_rest.json: Data crawled in by the REST API is read in from this file

./original_data/3.29.1_stream.json:Data crawled in by the streaming API is read in from this file


LSH.py

**What it does: local sensitive hash algorithm is used for Twitter text clustering **

./LSHdata/LSHdata.json: Read the raw data from this folder and cluster it.


singlepass.py

**What it does: single-pass algorithm is used for Twitter text clustering **

./singlepass_data/singlepassdata.json: Read the raw data from this folder and cluster it.

./textGrouped.json: After executing singlepass.py, the clustered data is generated in this file, in JSON format


rankText.py

**What it does: Read the data in the ./textGrouped.json generated by singlepass.py, and do group priority sorting and term priority sorting **

./textRanked.json: The prioritized groups and terms are stored in this JSON file

2. media_download

id_collection.py and rest_crawler.py:

What it does: Use Twitter-API, Tweepy, to crawl text and image data from Twitter with a certain keyword


Usage:
  1. Open rest_crawler. py and modify the values of self.proxy and self.query according to your own proxies

  2. Set the search keyword: self.query

  3. In the penultimate line, clean_dir means that the folder temp_data and picture will be cleaned initially, i.e. the history will be deleted, and used with caution

  4. In the penultimate line, since_id means to find tweets with a larger ID than this, i.e. the latest tweet, for update use

  5. If it is non-updated status, since_id is set to NONE, and there is no time limit. The crawler will crawl all tweets that can be crawled within the time limit (currently about ten days)

  6. The crawler crawls in order from the latest tweet to the previous tweet

  7. If you update the status, set since_id to the maximum ID of the last crawl, which is usually all_result.json or the first line of updating_all_result.json, and the crawler will retrieve tweets with a larger ID than this one, which is the updated Twitter after the last crawl.

  8. Open id_collection.py and set four information, including Tweepy account and developer permissions.

  9. After the above setup, run crawler.py directly


    Folder introduction:
    1. The tem_data folder holds all the text files of the search results.

    2. Picture folder holds all the downloaded picture files.

    3. tem_data /all_result.json file stores all search results, regardless of whether there are images, one line one tweet, JSON format, stores all key and value pairs.

    4. tem_data/tweets_with_picture.json file stores tweets with images, one line one tweet, in JSON format, and stores only the ID, FULL_TEXT and image links.

    5. The tem_data/updating_all_result.json file holds all the updated search results to distinguish them from the non-updated results.

    6. The tem_data/picture_download_error_tweets. JSON file saves the tweets that have a picture link but download the wrong picture, and does not count in the database that is finally saved.

    7. The format of the picture in the picture folder is: ID of the picture corresponding to tweet + picture format.


      video_download.py

      What it does: Download the video from the Twitter stream and store it in the Video folder

3. event_method1

Execute mian.py, the detected events are stored in the ./results in the format txt.

4. event_method2

Execute event.py ,read the original data from original_data.json for event detection and output the results in the console.

twitterner's People

Contributors

yechiyu avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.