Git Product home page Git Product logo

wechatsogoujs's Introduction

WechatSogouJS

基于搜狗微信搜索的微信公众号JS爬虫,目前此项目仅爬取 URL(临时),请在URL有效期内,及时使用其他方式爬取内容。推荐使用 Chyroc/WechatSogou 的Python爬虫做进一步的获取。

友情链接

感谢 @Chyroc 的 WechatSogou 项目 Chyroc/WechatSogou 以及此项目试图解决的 Chyroc/WechatSogou#53 issue

进度与方向

  1. 目前在框架选择上,准备放弃 Chrome Ext \ PhantomJS 框架。Electron 框架会做简单维护,但是不会再开发新的了。一切最新的开发,会转移到 NightmareJS 框架上。

  2. 在爬取对象选择上,准备放弃 weixin.sogou.com PC 版本,转而开发 wap 版本为主 & pc 版本为辅助的模式。

Usage of Electron version:

  1. Make a config.js file:
// this is your config.js file
var settings = {
    username: 'your_Ruokuai_username',
    password: 'your_Ruokuai_password',
    softid: 'your_Ruokuai_softid',
    softkey: 'your_Ruokuai_softkey'
};

exports.settings = settings; 
  1. Save config.js under the folder WechatSogouJS/electron

  2. Enter the electron folder and install the dependency with npm

cd electron 
npm install     
  1. Start the browser
npm start
  1. Type-in your query keyword, and click "Search Articles"
  2. On the left-hand-console, type-in
spider_start(n) //'n' is how many pages you want, do not just type 'n'
  1. After the spider finish its job, in the same console, type-in
saveToFile()

Todo:

  • weixin.sogou.com PC web
    • article search
    • account search
    • pass the Captcha
    • manual login with QQ / wechat QR
  • weixin.sogou.com WAP web
    • article search
    • account search
    • pass the Captcha
    • auto login with QQ
    • clean cookie when blocked by antispider
  • mp.weixin.qq.com TEMP url
    • grab the article page
    • grab the account page
    • pass the Captcha

Features of All Different Approach (current versions)

  • Chrome Ext
    • easy to code
    • easy to install
    • manual login with QQ / wechat QR
    • manual pass the Captcha
    • (warning: UNSTABLE) auto pass the Captcha
  • Electron
    • hard to code
    • easy to install
    • manual login with QQ / wechat QR
    • (info: STABLE) auto pass the Captcha
  • RECOMMEND: Nightmare.js
    • easy to code
    • easy to install
    • manual login with QQ / wechat QR
    • (info: STABLE) auto pass the Captcha
    • IMPORTANT: high level coding
    • IMPORTANT: you can choose Headless / no-Headless
  • DEPRECATED: Phantom.js
    • Headless, not recommend

wechatsogoujs's People

Contributors

ax4 avatar willzzhu avatar

Stargazers

Chen Zhicheng avatar oxf4vul avatar JDSXLK722 avatar Bozhao avatar hcz avatar Matt Erbst avatar

Watchers

hcz avatar  avatar YM avatar

wechatsogoujs's Issues

[known issue] the redundant data in the search results

As we observe, redundant data may occur, due to the following reasons:

  • Sending search request too often
  • weixin.sogou sometimes gives redundant response

To solve:

  1. use a statics package, to observe the redundant rate
  2. modify the crawler strategy in terms of redundant rate

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.