Git Product home page Git Product logo

sinaweibocrawler's Introduction

  • 项目说明
    • 爬取新浪微博用户数据,为用户画像、情感分析和关系建模等提供结构化数据。
  • 项目依赖的第三方库
    • HTTPClient
    • Jsoup :解析HTML
    • fastjson
  • 程序核心逻辑:
    • 在 useVersion2014/WeiboCrawler3.main() 中,WeiboCrawler3的实例对象crawler调用crawl()爬取原始数据后存在文件里,剩余代码再解析磁盘上的文件进行抽取和转换得到最后的数据。
    • crawl()是执行爬取动作的具体函数
      • String html = crawl.getHTML(url) //根据url获取网址
      • crawler.isVerification(html) //判断是否需要输入验证码
      • 如果连接超时重新连接
  • 新浪微博模拟登录逻辑 Sina.main()
    • Sina.login(username,passwprd)
      • preLogin(encodeAccount(username),client);//新浪微博预登录,获取密码加密公钥
      • 加密密码
      • 登录
      • 获得结果
    • SinaSSOEncoder处理单点登录密码加密的问题,作为且仅作为Sina的依赖类
  • 其他类文件说明
    • WeiboUser对新浪微博用户的基本信息进行面向对象的建模
    • Reparser 对本地HTML文件的二次解析,依赖HTMLParser,具有单独的main函数,不被其他类使用
    • JWindowsFrame窗体程序,提供安全的多线程爬取能力
    • ProxyIP
      • 主函数excute()
      • String ipLibURL = "http://www.xici.net.co/";//ip库
      • allIPs = getAllUnverifiedProxyIPs(ipLibURL, JTARunInfo);
      • validIPs = getValidProxyIPs(allIPs);
      • plainIPs = new ProxyIP() .classifyIPs(validIPs, plainIPsPath);
  • python爬虫与Java爬虫
    • python的http库类更佳丰富,用java需要几十行代码才能完成的事情,python往往只需要十几行。
  • 打开并且存储一个网页的安全和不安全写法(java\python),另见我的轮子库。
  • 模拟登陆新浪微博HTTP抓包及分析
  • 爬虫基础能力练习

-如有疑问请咨询:[email protected]

sinaweibocrawler's People

Contributors

cld378632668 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.