Git Product home page Git Product logo

lagouspider's Introduction

LagouSpider

抓取拉勾全国所有城市中能获取到的公司列表,并获得所有公司中的岗位信息。

代码完成的时候共获取24250个公司,130607个岗位,由于拉勾对网站上公司的数量有限制,一个城市最多可获取63页的公司列表,同时拉勾对没有注册公司的城市随机填充一线城市的公司,所以获取到的公司并非拉勾所收录的全部公司。

所有数据使用Mongodb存储,如果没有使用anaconda,需要自行安装的库:lxml、pandas,另外需要手动安装pymongo和fake_useragent库。

拉勾的反爬机制不算非常难,除了城市信息的获取,公司和岗位都是非常整齐的json格式数据。

数据获取的逻辑为:获取全国所有城市-->获取每个城市的所有企业(最多63页)-->获取每个企业的所有岗位信息-->岗位去重,执行main() 函数,创建“城市”,“公司列表”,“岗位信息”和一个去重连接的数据库,并最终将岗位信息输出保存为csv文件。

该脚本仅用作学习,不做其他用途。

lagouspider's People

Contributors

dotlines avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.