Git Product home page Git Product logo

haxine / crawlweibo Goto Github PK

View Code? Open in Web Editor NEW

This project forked from gordon-deng/crawlweibo

0.0 1.0 0.0 22.73 MB

Python爬取微博,采集的数据属性如下:微博内容,是否原创,转发内容,发布时间,转发数,评论数,点赞数,设备源,微博ID。对于抓取到的页面源码分析不同属性对应的标签分别提取数据。最后将采集到的数据保存为csv格式,供数据分析使用。

License: GNU General Public License v3.0

Python 100.00%

crawlweibo's Introduction

CrawlWeiBo

新浪微博的数据采集主要有两种方法,基于新浪微博API和基于网络爬虫的页面解析。 本系统采取基于网络爬虫的页面解析方法,基于网络爬虫的微博信息采集可以突破 API开放接口限制,不间断地爬取信息。网络爬虫根据顺序URL列队获取URL地址,并下载其指向页面至本地,再利用DOM树进行网页解析。利用XPath可以定位存放关键信息的DOM节点位置,最后抽取XPath特征节点中的内容。

政务微博分析

根据需求说明需要采集的数据属性如下:

  • 微博内容
  • 是否原创
  • 转发内容
  • 发布时间
  • 转发数
  • 评论数
  • 点赞数
  • 设备源
  • 微博ID

对于抓取到的页面源码分析不同属性对应的标签分别提取数据。最后将采集到的数据保存为csv格式,供数据分析使用。

人工选择微博账号

根据事件发生的时间爬取事件发生前后一个月总共三个月的微博,为了实现自动采集数据,根据微博账号爬取PageId,将PageId作为爬取数据的URL的一个字段拼接,通过微博账号就能实现对微博数据的爬取。

crawlweibo's People

Contributors

gordon-deng avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.