This is a my first crawler homework for Scrapy and my target is to get best sell product of Yahoo shop.
Currently, I have no idea what is the best analysis althgorim for predict the best product of Yahoo shop, therefore, I decide to follow the billboard of Yahoo to retrieve the potential goods for sell.
web content -- [parsed by] -- > crawler -- [save to] --> SQLite DB -- [read by] --> report
$ tree ./
./
├── README.md
├── Yahoo.sqlite
├── juvoplus2
│ ├── __init__.py
│ ├── items.py
│ ├── pipelines.py
│ ├── settings.py
│ ├── spiders
│ │ ├── __init__.py
│ │ ├── crawler.py
│ │ ├── db.py
│ │ └── report.py
│ └── tests
│ ├── __init__.py
│ ├── db_test.py
│ └── report_test.py
├── run_crawler.sh
├── run_report.sh
└── scrapy.cfg
- CRAWLER:
- scrapy
- beautiful soup
- json
- sqlite3
- REPORT:
- sqlite3
- argparse
- TEST:
- unittest
- run crawler script:
$ ./run_crawler.sh - run report script:
$ ./run_report.sh -u TOP_TEN_FOR_ALL
- CRAWLER:
- Current target url is a static string, maybe could get target url by parsing java script.
- Only parse "Billboard" of Yahoo, could parse more categories to enrich data.
- REPORT:
- Only support 2 use cases, could support more.
- SYSTEM LIMITATION/POTENTIAL RISK:
- When we try to retrieve more content from Yahoo, the possible limitation of this crawler is performance and we could have several aspect to enhance it.
- Replace SQLite with MySQL/PostgreSQL
- I have put scrapy framework in the script and when we want to have more client to parse data, we could leverage "Twisted" framework to empower our spiders.
- When we try to retrieve more content from Yahoo, the possible limitation of this crawler is performance and we could have several aspect to enhance it.
$ ./run_report.sh -h
usage: report.py [-h] [-U] [-u USECASES] [-s SORT_UP_DOWN]
optional arguments:
-h, --help show this help message and exit
-U List Usecases
-u USECASES Apply Usecase
-s SORT_UP_DOWN Apply sort sequence (asc/desc), default is desc
$ ./run_report.sh -U
TOP_TEN_FOR_ALL
TOP_TWO_PER_CATEGORY
$ ./run_report.sh -u TOP_TEN_FOR_ALL
電視,CHIMEI 奇美 TL-50W600 50吋 廣色域智慧聯網顯示器+視訊盒,20900.0
電冰箱,樂金LG 253公升Smart 變頻上下門冰箱GN-L305SV,16900.0
電視,CHIMEI 奇美 TL-43W600 43吋 廣色域智慧聯網顯示器+視訊盒,15900.0
電冰箱,樂金LG 186公升Smart 變頻上下門冰箱GN-L235SV,14900.0
電冰箱,聲寶250L經典品味雙門電冰箱SR-L25G(S2)璀璨銀,14200.0
電冰箱,TOSHIBA東芝226L二門電冰箱GR-S24TPB(含運送和基本安裝),11900.0
電冰箱,聲寶140L經典品味雙門冰箱SR-L14Q(S1),11400.0
歐系精品包 / 配件,【萬寶龍】小牛皮4810 6卡皮夾,8280.0
歐系精品包 / 配件,LONGCHAMP Fantaisie質感尼龍短把手提/斜背兩用水餃包(芙蓉紅/中),7980.0
歐系精品包 / 配件,LONGCHAMP 尼龍短把手提/斜背兩用水餃包(中/深藍),6100.0