Git Product home page Git Product logo

webscraping-in-action's Introduction

WebScraping-in-Action

  • 爬取赶集网二手市场所有类目下的产品信息:近10万条链接,采取多进程爬取;用MongoDB存储
  • 正则表达式精要 源自Google For Edu
  • 抓取淘宝网商品信息:淘宝网使用AJAX 方式填入页面内容。对此,从XHR 以及JS 切入,再使用正则表达式来抓取信息;并用pandas作数据规整
  • Pandas爬取**银行外汇牌价信息:只用Pandas,只用Pandas,只用Pandas!
  • 用PhantomJS+Selenium 处理斗鱼异步加载:对于AJAX异步请求数据的网页,一个方式是通过url拼接,分页采集;另一个方法是使用Selenium。但发现Phantomjs 有点慢~~
  • 爬取591租房信息(AJAX):Selenium、MongoDB
  • “轻想”是一款类博客产品,从去年初创时作为种子用户,到现在成为深度用户,体会颇多。在此爬取全部用户数据并进行数据分析。过程中也理解了几个问题:
  • 一开始用time.sleep进行间断访问,避免访问过于频繁引发反爬。结果用多进程操作时,才发现time.sleep对线程有效,对进程不阻塞。那么如何既用多进程又设置时间间隔?将整个过程分段,多进程爬取一段,暂停,再多进程爬取下一段。
  • ip代理问题。网上有些免费ip可能无效,无法进行request,遇到的情况是长时间在连接,也不弹出异常。这时用到了requests的timeout参数,若超时,就切换其他ip。

webscraping-in-action's People

Contributors

zorro-lin-7 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.