Project Git-Hub link:
Program running environment:
Python version:3.8.3 (default, Jul 2 2020, 17:30:36) [MSC v.1916 64 bit (AMD64)]
tweepy version:3.10.0
pymongo version:3.11.2
emoji version:0.6.0
requests version:2.25.1
numpy version:1.19.2
nltk version:3.4.5
**What it does: hybrid architecture of Twitter Streaming & REST APIs. **
./original_data/stream.json
: Streaming API crawls data stored in this file.
./original_data/rest.json
:REST API crawls data stored in this file.
**What it does: Count the Twitter data that has been crawled. **
./original_data/3.29.1_rest.json
: Data crawled in by the REST API is read in from this file
./original_data/3.29.1_stream.json
:Data crawled in by the streaming API is read in from this file
**What it does: local sensitive hash algorithm is used for Twitter text clustering **
./LSHdata/LSHdata.json
: Read the raw data from this folder and cluster it.
**What it does: single-pass algorithm is used for Twitter text clustering **
./singlepass_data/singlepassdata.json
: Read the raw data from this folder and cluster it.
./textGrouped.json
: After executing singlepass.py
, the clustered data is generated in this file, in JSON
format
**What it does: Read the data in the ./textGrouped.json
generated by singlepass.py
, and do group priority sorting and term priority sorting **
./textRanked.json
: The prioritized groups and terms are stored in this JSON
file
What it does: Use Twitter-API, Tweepy
, to crawl text and image data from Twitter with a certain keyword
-
Open r
est_crawler. py
and modify the values ofself.proxy
andself.query
according to your own proxies -
Set the search keyword:
self.query
-
In the penultimate line,
clean_dir
means that the folder temp_data and picture will be cleaned initially, i.e. the history will be deleted, and used with caution -
In the penultimate line,
since_id
means to find tweets with a larger ID than this, i.e. the latest tweet, for update use -
If it is non-updated status,
since_id
is set to NONE, and there is no time limit. The crawler will crawl all tweets that can be crawled within the time limit (currently about ten days) -
The crawler crawls in order from the latest tweet to the previous tweet
-
If you update the status, set
since_id
to the maximum ID of the last crawl, which is usuallyall_result.json
or the first line ofupdating_all_result.json
, and the crawler will retrieve tweets with a larger ID than this one, which is the updated Twitter after the last crawl. -
Open
id_collection.py
and set four information, includingTweepy
account and developer permissions. -
After the above setup, run
crawler.py
directly
-
The
tem_data
folder holds all the text files of the search results. -
Picture
folder holds all the downloaded picture files. -
tem_data
/all_result.json
file stores all search results, regardless of whether there are images, one line one tweet,JSON
format, stores all key and value pairs. -
tem_data
/tweets_with_picture.json
file stores tweets with images, one line one tweet, inJSON
format, and stores only the ID, FULL_TEXT and image links. -
The
tem_data
/updating_all_result.json
file holds all the updated search results to distinguish them from the non-updated results. -
The
tem_data
/picture_download_error_tweets. JSON
file saves the tweets that have a picture link but download the wrong picture, and does not count in the database that is finally saved. -
The format of the picture in the picture folder is: ID of the picture corresponding to tweet + picture format.
What it does: Download the video from the Twitter stream and store it in the Video folder
-
Execute mian.py
, the detected events are stored in the ./results
in the format txt
.
Execute event.py
,read the original data from original_data.json
for event detection and output the results in the console.