leetsun / interactive-crawl Goto Github PK

This project forked from smlins/interactive-crawl

Interactive Crawl是一款交互式爬虫工具，可触发目标网站的各种点击事件、自动提交form表单并爬取超链接和异步请求

License: GNU General Public License v3.0

Python 100.00%

interactive-crawl's Introduction

InteractiveCrawler - 交互式Web网站爬虫

Interactive Crawler是一款交互式爬虫工具，不仅可以爬取页面中的超链接，还能获取通过页面的各种交互事件触发所发起的请求，支持代理设置，可结合其他漏洞扫描工具使用。目前支持以下事件的触发：

Form表单；

各种标签的onclick事件；

a标签内的JavaScript代码;

下拉列表等隐藏式交互事件触发；

不带有onclick属性的按钮交互事件触发；

非form表单的输入框输入；

其他;

Installation

# 下载项目
git clone https://github.com/smlins/Interactive-Crawl.git
# 安装python依赖包
pip install -r requirements.txt
# 爬取http://example.com
python crawler.py http://example.com

Options

可选参数：
  -h, --help             帮助显示此帮助消息并退出
  -t , --timeout         所有页面请求的超时时间，默认为5秒
  --cookie               设置http请求Cookie
  --proxy-server         设置爬虫的代理地址，爬虫会将所有请求转发至该服务端口
  --headless             浏览器的无头操作模式，添加此选项后不会显示浏览器界面
  --exclude-links        如果链接包含此关键字，支持正则表达式
  --crawl-link-type      爬网程序爬网的网络资源类型，默认爬取xhr、fetch、document
  --prohibit-load-type   禁止加载的网络资源类型，默认禁止加载image,media,font,manifest，该配置选项是优先级最高的
  --crawl-external-links 设置是否爬取外部链接爬网，默认情况下，仅对同一网站链接进行爬网（不推荐）
  --intercept-request    开启http请求拦截，只有当该选项开启后，--prohibit-load-type参数才生效

注意：若配置了prohibit-load-type配置中包含document类型，则crawl-link-type配置中的document类型则失效。

Examples

# 默认启动方式
python3 crawl.py http://example.com

# 不爬取带有以下关键字的链接，支持正则表达式
python3 crawl.py http://example.com --exclude-links "logout|Logout"

# 设置Cookie，则爬虫所有的请求都会携带该Cookie
python3 crawl.py http://example.com --cookie "PHPSESSID=gbo3fci62fpig5vp4fq6a950h2; security=impossible"

# 设置代理
python3 crawl.py http://example.com --proxy-server "127.0.0.1:8000"