Git Product home page Git Product logo

cnnb's Introduction

cnnb


用Scrapy框架爬取宁波日报近期本地新闻,并把新闻保存在mongodb,保存在本地文件夹中,保存为json


环境:

PyCharm 2018.3.5 (Professional Edition)

Scrapy 1.5.1

Python 3.7.0

newspaper3k 0.2.8

创建项目

scrapy startproject cnnb
scrapy genspider nbnews cnnb.com.cn

在项目目录中的settings.py文件中加上mongodb的信息。

mongo_host = 'localhost'
mongo_port = 27017
mongo_user = 'root'
mongo_passwd = '1997'
mongo_db_name = 'crawler'
mongo_db_collection = 'nbnews'

在pipelines.py中加入mongodb的操作语句,以及保存为txt的语句。scrapy自带保存为json、scv、xml、pickle、marshal,自带ftp远程输出。

选择

scrapy自带xpath和css选择器,不需要额外使用别的库。

news_list = response.xpath('//div[@class="articleList"]//ul[@class="fiveBox"]//li//a//@href').extract()

利用scrapy自带的xpath选择器,获得新闻文章url列表。用newspaper3k库以及获得的url获得每一篇新闻的具体内容与标题。

启动器

因为pycharm不能直接创建scrapy项目,不能像Django那样直接在pycharm启动项目,所以需要通过终端来启动。在pycharm启动也是需要通过终端。

from scrapy import cmdline
cmdline.execute('scrapy crawl nbnews'.split())

截图 :

保存在本地以新闻的发布时间来划分

文件夹

本地文件新闻列表

scrapy crawl nbnews -o nbnews.json用scrapy的功能把内容输出为json

保存为json

mongodb所存储的新闻信息

保存在mongodb

cnnb's People

Contributors

ggg1235 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.