This project scrapes/crawls post content and comments from PTT website, and implements neural CKIP Chinese NLP tools on the scraped data asynchronously.
-
Python version
python == 3.7.5
-
Clone repository
git clone [email protected]:Retr0327/ptt-crawler.git
-
Install Requirement
cd scraptt && pip install -r requirement.txt
- Commands
scrapy crawl <spider-name> -a boards=BOARDS [-a all=True]
[-a index_from=NUMBER -a index_to=NUMBER]
[-a since=YEAR] [-a data_dir=PATH]
positional arguments:
<spider-name> the name of ptt spiders (i.e. boards, ptt_post, and ptt_post_segmentation)
-a boards=BOARDS specify which ptt boards
-
Crawl all the posts of a board:
-
Crawl all the posts of a board from a year in the past:
-
Crawl the posts of a board based on html indexes:
-
Crawl the posts of multiple boards:
If you want to save the (segmented) post data, simply add the command, such as
-a data_dir=./ptt_data
, to the command
If you have any suggestion or question, please do not hesitate to email me at [email protected]