Git Product home page Git Product logo

interactive-crawl's Introduction

InteractiveCrawler - 交互式Web网站爬虫

Release Release

​ Interactive Crawler是一款交互式爬虫工具,不仅可以爬取页面中的超链接,还能获取通过页面的各种交互事件触发所发起的请求,支持代理设置,可结合其他漏洞扫描工具使用。目前支持以下事件的触发:

  • Form表单;
  • 各种标签的onclick事件;
  • a标签内的JavaScript代码;
  • 下拉列表等隐藏式交互事件触发;
  • 不带有onclick属性的按钮交互事件触发;
  • 非form表单的输入框输入;
  • 其他;

Installation

# 下载项目
git clone https://github.com/smlins/Interactive-Crawl.git
# 安装python依赖包
pip install -r requirements.txt
# 爬取http://example.com
python crawler.py http://example.com

Options

可选参数:
  -h, --help             帮助显示此帮助消息并退出
  -t , --timeout         所有页面请求的超时时间,默认为5秒
  --cookie               设置http请求Cookie
  --proxy-server         设置爬虫的代理地址,爬虫会将所有请求转发至该服务端口
  --headless             浏览器的无头操作模式,添加此选项后不会显示浏览器界面
  --exclude-links        如果链接包含此关键字,支持正则表达式
  --crawl-link-type      爬网程序爬网的网络资源类型,默认爬取xhr、fetch、document
  --prohibit-load-type   禁止加载的网络资源类型,默认禁止加载image,media,font,manifest,该配置选项是优先级最高的
  --crawl-external-links 设置是否爬取外部链接爬网,默认情况下,仅对同一网站链接进行爬网(不推荐)
  --intercept-request    开启http请求拦截,只有当该选项开启后,--prohibit-load-type参数才生效

注意:若配置了prohibit-load-type配置中包含document类型,则crawl-link-type配置中的document类型则失效。

Examples

# 默认启动方式
python3 crawl.py http://example.com

# 不爬取带有以下关键字的链接,支持正则表达式
python3 crawl.py http://example.com --exclude-links "logout|Logout"

# 设置Cookie,则爬虫所有的请求都会携带该Cookie
python3 crawl.py http://example.com --cookie "PHPSESSID=gbo3fci62fpig5vp4fq6a950h2; security=impossible"

# 设置代理
python3 crawl.py http://example.com --proxy-server "127.0.0.1:8000"

https://github.com/ITNAX/Interactive-Crawl/blob/main/image/demo1.gif

Todo

  • 性能优化,合理使用协程加快爬取速度并增加访问频率参数;
  • 解决pyppeteer模块再拦截请求时,对于重定向的请求无法正确处理,导致暂时先关闭对资源类型加载的控制;
  • 增加script(在考虑是否使用装饰器实现)加载模块,用于自定义操作,例如识别到用户登录界面,若有对应脚本可进行爆破,后期也可作网站指纹识别;

interactive-crawl's People

Contributors

smlins avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.