中文 | English
SDK for Crawlab, including SDK for different programming languages such as Python, Node.js and Java, and a CLI Tool written in Python.
SDK for Crawlab, including SDK for different programming languages such as Python, Node.js and Java, and a CLI Tool written in Python.
Home Page: https://crawlab.cn
License: BSD 3-Clause "New" or "Revised" License
中文 | English
SDK for Crawlab, including SDK for different programming languages such as Python, Node.js and Java, and a CLI Tool written in Python.
版本:0.6.1
刚开始是好的,设置定时任务后,过几天看结果,发现任务有数量记录,但是点进去看不到任务的结果,去数据库看也没有新增,开启了url去重过滤
Python 依赖定义的最佳实践是使用 >=
而非 ==
。现在的配置,会强制所有 SDK 的用户安装多余、老版本的依赖包。
ERROR: crawlab-sdk 0.3.3 requires Click==7.0, which is not installed.
ERROR: crawlab-sdk 0.3.3 requires elasticsearch==7.8.0, which is not installed.
ERROR: crawlab-sdk 0.3.3 requires kafka-python==2.0.1, which is not installed.
ERROR: crawlab-sdk 0.3.3 requires pathspec==0.8.0, which is not installed.
ERROR: crawlab-sdk 0.3.3 requires prettytable==0.7.2, which is not installed.
ERROR: crawlab-sdk 0.3.3 requires psycopg2-binary==2.8.5, which is not installed.
ERROR: crawlab-sdk 0.3.3 requires pymysql==0.9.3, which is not installed.
ERROR: crawlab-sdk 0.3.3 requires requests==2.22.0, which is not installed.
ERROR: crawlab-sdk 0.3.3 has requirement pymongo==3.10.1, but you'll have pymongo 3.11.3 which is incompatible.
ERROR: crawlab-sdk 0.3.3 has requirement scrapy==2.2.0, but you'll have scrapy 2.5.0 which is incompatible.
一个 bug
File "/usr/local/lib/python3.8/dist-packages/crawlab/core/client.py", line 99, in update_token
print('error: ' + data.get('error'))
TypeError: can only concatenate str (not "ConnectionError") to str
版本0.6.0-3
1、当环境为Windows时,使用upload命令上传,实际上传的路径变成/xxx/xxx\xxx\xxx这种形式,上传上能成功,但是传上去都成同一级
2、当环境为Mac时,使用命令upload .命令上传当前路径的文件,上传能成功,但最终所有的文件名都被删除了“.”(例如scrapy.cfg,上传后变成了scrapycfg)
Hello Crawlab team,
i'm using crawlab to deploy my scrapy spiders and when I'm trying to download results CSV it takes a lot of time sometimes more than 15min , is there any commands line with CLI sdk to download directly the data .
thank you
When trying to upload a scrapy project i have this error :
/.git/logs/refs/remotes/origin/HEAD
/.git/hooks/commit-msg.sample
/.git/hooks/pre-rebase.sample
/.git/hooks/pre-commit.sample
/.git/hooks/applypatch-msg.sample
/.git/hooks/fsmonitor-watchman.sample
/.git/hooks/pre-receive.sample
/.git/hooks/prepare-commit-msg.sample
/.git/hooks/post-update.sample
/.git/hooks/pre-merge-commit.sample
/.git/hooks/pre-applypatch.sample
/.git/hooks/pre-push.sample
/.git/hooks/update.sample
/.git/hooks/push-to-checkout.sample
/.git/refs/heads/master
/.git/refs/remotes/origin/HEAD
error: invalid character '-' in numeric literal
CrawlabPipeline do not keep the client instance so it has to create new connection every time, that's not so efficient.
Is it necessary to switch to a more efficient way to do the jobs?
如题
入库环境是:mysql;
save_item 保存数据库,测试发现采集器即便采集结束,连接依然被占用,同时启动多个采集器,发现数据库就连接不上了,查询后,发现采集器连接未释放,显示Sleep 状态,单个采集器占用连接都是上千个,占用连接和采集的数据量可能存在关系,严重bug😢😢😢😢😢😢,我更换批量写入再次测试下,麻烦作者尽快回复哈
目前的依赖版本有问题存在
replace (
github.com/crawlab-team/crawlab-grpc => /Users/marvzhang/projects/crawlab-team/crawlab-grpc
github.com/crawlab-team/go-trace => /Users/marvzhang/projects/crawlab-team/go-trace
)
目前go-trace 和crawlab-grpc 都有 github仓库了,是否需要修改
ERROR crawlab-sdk 0.1.7 has requirement requests==2.22.0, but you'll have requests 2.24.0 which is incompatible.
ERROR crawlab-sdk 0.1.7 has requirement scrapy==1.8.0, but you'll have scrapy 2.2.0 which is incompatible.
也许该升级下 SDK 了?
When I was working with the SDK, I found that the SDK was not very convenient for schedules and deployment of multiple spiders, so I wondered if it could be designed to look like the following
.
| ── packages
│ | ── js_spiders
│ | | ── js_spider_1
│ | | | ── index.js
│ | | ── js_spider_2
│ | | | ── index.js
│ | | ── package.json
│ | | ── .....
│ | ── py_spiders
│ | | ── py_spider_1
│ | | | ── main.py
│ | | ── py_spider_2
│ | | | ── main.py
│ | | ── setup.py
│ | | ── .....
│ ── crawlab.json
│ ── makefile
crawlab.json
{
"spiders": [
{
"path": "packages/js_spider",
"exclude_path": "node_modules",
"name": "js spiders",
"description": "js spiders",
"cmd": "node",
"schedules": [
{
"name": "js spider 1 cron",
"cron": "* 1 * * *",
"command": "node js_spider_1/index.js",
"param": "",
"mode": "random",
"description": "js spider 1 cron",
"enabled": true
},
{
"name": "js spider 2 cron",
"cron": "* 2 * * *",
"command": "node js_spider_2/index.js",
"param": "",
"mode": "random",
"description": "js spider 2 cron",
"enabled": true
}
]
},
{
"path": "packages/py_spider",
"exclude_path": ".venv",
"name": "py spiders",
"description": "py spiders",
"cmd": "python",
"schedules": [
{
"name": "py spider 1 cron",
"cron": "* 1 * * *",
"command": "python py_spider_1/main.py",
"param": "",
"mode": "random",
"description": "py spider 1 cron",
"enabled": true
},
{
"name": "py spider 2 cron",
"cron": "* 2 * * *",
"command": "python py_spider_2/main.py",
"param": "",
"mode": "random",
"description": "py spider 2 cron",
"enabled": true
}
]
}
]
}
I can help implement this if you think it is possible
@tikazyq
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.