Git Product home page Git Product logo

autohome's Introduction

Autohome

Autohome基于Scrapy爬虫框架,实现对汽车之家-文章进行定向爬虫,并将抓取的数据存放进MongoDB中。后期将对抓取数据进行简单的分析以及NLP的工作。

运行环境

  • Python 2.7.10
  • MonogDB 3.2.10
  • Scrapy 1.3.2
  • pymongo 3.4.0

项目构成

│  readme.md
│  requirements.txt
│  scrapy.cfg
│
├─autohome
│  │  __init__.py
│  │  items.py
│  │  pipelines.py
│  │  settings.py
│  │
│  └─spiders
│          __init__.py
│          autohome_spider.py
│
└─support_file
    ├─architecture
    │      autohome_architecture.png
    │      autohome_architecture.vsdx
    │
    └─four_theme
            autohome_four_theme.png
            part1.png
            part2.png
            part3.png
            part4.png
  • autohome:是Autohome的程序的主要文件夹,主要的Autohome的代码都在里面,其中spiders子文件夹是spider的主程序
  • support_file:Autohome的支撑文件夹,只要存放说明相片以及原图片
  • scrapy.cfg:Autohome的配置文件夹
  • requirements.txt:Autohome依赖的第三方包的requirements

使用方式

pip install -r requirements.txt

可能会提示pip不是内部或外部命令,也不是可运行的程序或批处理文件。,请点这里解决相应问题

  • 根据需要选择数据下载的方式,默认同时下载到MongoDB和本地Json文件中,可以通过修改Autohome/autohome/settings.py中ITEM_PIPELINES进行选择(两个同时写入可能会导致磁盘I/O过高)
  • 在Autohome根目录运行
scrapy crawl autohome_article

运行Autohome爬虫,其中日志文件会以运行爬虫的时间为名称写入Autohome根目录中,Autohome项目爬虫就会正常运行了

设计概览

爬虫设计概览

  • Autohome抓取的是汽车之家-文章页面,整个爬虫部分分成四大主题,分别是:文章简介、文章详情、文章评论、评论文章的用户。爬虫的根节点其中四个部分的逻辑如下: image

  • Autohome基于Scrapy爬虫框架,对四大主题进行抓取,整个流程图如下,其中绿色部分是Scrapy原生框架的逻辑,蓝色部分是汽车之家-文章的爬虫逻辑 image

Features

  • 全部基于Scrapy框架实现
  • 定义两个Pipeline操作,分别是AutohomeJsonPipeline,即本地json文件;以及AutohomeMongodbPipeline,即存进MongoDB。可以在setting.pyITEM_PIPELINES节点中设置启动的Pipeline

TODO

  • 编写proxy和user agent中间件
  • 优化模拟登陆的抓取速度及完整度
  • 对抓取的结构化数据进行分析
  • 对抓取的非结构化数据分析

Change Log

  • 20170531 将原来自定义模块的爬虫程序切换到Scrapy爬虫框架

autohome's People

Contributors

zhongjiajie avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.