Real-time event monitoring on social media based on Twitter

Project Git-Hub link:

Program running environment:

Python version:3.8.3 (default, Jul 2 2020, 17:30:36) [MSC v.1916 64 bit (AMD64)] tweepy version:3.10.0 pymongo version:3.11.2 emoji version:0.6.0 requests version:2.25.1 numpy version:1.19.2 nltk version:3.4.5

Programs are stored in four folders:

1. Data_crawl

`crawler.py`

**What it does: hybrid architecture of Twitter Streaming & REST APIs. **

./original_data/stream.json : Streaming API crawls data stored in this file.

./original_data/rest.json:REST API crawls data stored in this file.

`statistical.py`

**What it does: Count the Twitter data that has been crawled. **

./original_data/3.29.1_rest.json: Data crawled in by the REST API is read in from this file

./original_data/3.29.1_stream.json:Data crawled in by the streaming API is read in from this file

`LSH.py`

**What it does: local sensitive hash algorithm is used for Twitter text clustering **

./LSHdata/LSHdata.json: Read the raw data from this folder and cluster it.

`singlepass.py`

**What it does: single-pass algorithm is used for Twitter text clustering **

./singlepass_data/singlepassdata.json: Read the raw data from this folder and cluster it.

./textGrouped.json: After executing singlepass.py, the clustered data is generated in this file, in JSON format

`rankText.py`

**What it does: Read the data in the ./textGrouped.json generated by singlepass.py, and do group priority sorting and term priority sorting **

./textRanked.json: The prioritized groups and terms are stored in this JSON file

2. media_download

`id_collection.py` and `rest_crawler.py`:

What it does: Use Twitter-API, Tweepy, to crawl text and image data from Twitter with a certain keyword

Usage:

Open rest_crawler. py and modify the values of self.proxy and self.query according to your own proxies
Set the search keyword: self.query
In the penultimate line, clean_dir means that the folder temp_data and picture will be cleaned initially, i.e. the history will be deleted, and used with caution
In the penultimate line, since_id means to find tweets with a larger ID than this, i.e. the latest tweet, for update use
If it is non-updated status, since_id is set to NONE, and there is no time limit. The crawler will crawl all tweets that can be crawled within the time limit (currently about ten days)
The crawler crawls in order from the latest tweet to the previous tweet
If you update the status, set since_id to the maximum ID of the last crawl, which is usually all_result.json or the first line of updating_all_result.json, and the crawler will retrieve tweets with a larger ID than this one, which is the updated Twitter after the last crawl.
Open id_collection.py and set four information, including Tweepy account and developer permissions.
After the above setup, run crawler.py directly

Folder introduction:
1. The tem_data folder holds all the text files of the search results.
2. Picture folder holds all the downloaded picture files.
3. tem_data /all_result.json file stores all search results, regardless of whether there are images, one line one tweet, JSON format, stores all key and value pairs.
4. tem_data/tweets_with_picture.json file stores tweets with images, one line one tweet, in JSON format, and stores only the ID, FULL_TEXT and image links.
5. The tem_data/updating_all_result.json file holds all the updated search results to distinguish them from the non-updated results.
6. The tem_data/picture_download_error_tweets. JSON file saves the tweets that have a picture link but download the wrong picture, and does not count in the database that is finally saved.
7. The format of the picture in the picture folder is: ID of the picture corresponding to tweet + picture format.
  
  video_download.py：
  
  What it does: Download the video from the Twitter stream and store it in the Video folder

3. `event_method1`

Execute mian.py, the detected events are stored in the ./results in the format txt.

4. `event_method2`

Execute event.py ,read the original data from original_data.json for event detection and output the results in the console.

yechiyu / twitterner Goto Github PK

twitterner's Introduction

Real-time event monitoring on social media based on Twitter

Programs are stored in four folders:

1. Data_crawl

crawler.py

statistical.py

LSH.py

singlepass.py

rankText.py

2. media_download

id_collection.py and rest_crawler.py:

Usage:

Folder introduction:

video_download.py：

3. event_method1

4. event_method2

twitterner's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org