Git Product home page Git Product logo

yahoobuy's Introduction

Crawler for Yahoo Buy

Description/Target

This is a my first crawler homework for Scrapy and my target is to get best sell product of Yahoo shop.

Strategy

Currently, I have no idea what is the best analysis althgorim for predict the best product of Yahoo shop, therefore, I decide to follow the billboard of Yahoo to retrieve the potential goods for sell.

Data Flow

web content -- [parsed by] -- > crawler -- [save to] --> SQLite DB -- [read by] --> report

File Structures

$ tree ./
./
├── README.md
├── Yahoo.sqlite
├── juvoplus2
│   ├── __init__.py
│   ├── items.py
│   ├── pipelines.py
│   ├── settings.py
│   ├── spiders
│   │   ├── __init__.py
│   │   ├── crawler.py
│   │   ├── db.py
│   │   └── report.py
│   └── tests
│   ├── __init__.py
│   ├── db_test.py
│   └── report_test.py
├── run_crawler.sh
├── run_report.sh
└── scrapy.cfg

Libraries

  1. CRAWLER:
    1. scrapy
    2. beautiful soup
    3. json
    4. sqlite3
  2. REPORT:
    1. sqlite3
    2. argparse
  3. TEST:
    1. unittest

HOW-TO

  1. run crawler script:
    $ ./run_crawler.sh
  2. run report script:
    $ ./run_report.sh -u TOP_TEN_FOR_ALL

The Road Ahead

  1. CRAWLER:
    1. Current target url is a static string, maybe could get target url by parsing java script.
    2. Only parse "Billboard" of Yahoo, could parse more categories to enrich data.
  2. REPORT:
    1. Only support 2 use cases, could support more.
  3. SYSTEM LIMITATION/POTENTIAL RISK:
    • When we try to retrieve more content from Yahoo, the possible limitation of this crawler is performance and we could have several aspect to enhance it.
      1. Replace SQLite with MySQL/PostgreSQL
      2. I have put scrapy framework in the script and when we want to have more client to parse data, we could leverage "Twisted" framework to empower our spiders.

Example

$ ./run_report.sh -h usage: report.py [-h] [-U] [-u USECASES] [-s SORT_UP_DOWN]

optional arguments:
-h, --help show this help message and exit
-U List Usecases
-u USECASES Apply Usecase
-s SORT_UP_DOWN Apply sort sequence (asc/desc), default is desc

$ ./run_report.sh -U
TOP_TEN_FOR_ALL
TOP_TWO_PER_CATEGORY

$ ./run_report.sh -u TOP_TEN_FOR_ALL
電視,CHIMEI 奇美 TL-50W600 50吋 廣色域智慧聯網顯示器+視訊盒,20900.0
電冰箱,樂金LG 253公升Smart 變頻上下門冰箱GN-L305SV,16900.0
電視,CHIMEI 奇美 TL-43W600 43吋 廣色域智慧聯網顯示器+視訊盒,15900.0
電冰箱,樂金LG 186公升Smart 變頻上下門冰箱GN-L235SV,14900.0
電冰箱,聲寶250L經典品味雙門電冰箱SR-L25G(S2)璀璨銀,14200.0
電冰箱,TOSHIBA東芝226L二門電冰箱GR-S24TPB(含運送和基本安裝),11900.0
電冰箱,聲寶140L經典品味雙門冰箱SR-L14Q(S1),11400.0
歐系精品包 / 配件,【萬寶龍】小牛皮4810 6卡皮夾,8280.0
歐系精品包 / 配件,LONGCHAMP Fantaisie質感尼龍短把手提/斜背兩用水餃包(芙蓉紅/中),7980.0
歐系精品包 / 配件,LONGCHAMP 尼龍短把手提/斜背兩用水餃包(中/深藍),6100.0

yahoobuy's People

Contributors

wenlien avatar

Watchers

 avatar  avatar

Forkers

weilun1001

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.