Git Product home page Git Product logo

crawlab-sdk's Introduction

Crawlab SDK

中文 | English

SDK for Crawlab, including SDK for different programming languages such as Python, Node.js and Java, and a CLI Tool written in Python.

crawlab-sdk's People

Contributors

dependabot[bot] avatar hantmac avatar kuaipao avatar ma-pony avatar nieweiming avatar tikazyq avatar twinsant avatar zkqiang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

crawlab-sdk's Issues

Python SDK 依赖版本过旧

Python 依赖定义的最佳实践是使用 >= 而非 == 。现在的配置,会强制所有 SDK 的用户安装多余、老版本的依赖包。

ERROR: crawlab-sdk 0.3.3 requires Click==7.0, which is not installed.
ERROR: crawlab-sdk 0.3.3 requires elasticsearch==7.8.0, which is not installed.
ERROR: crawlab-sdk 0.3.3 requires kafka-python==2.0.1, which is not installed.
ERROR: crawlab-sdk 0.3.3 requires pathspec==0.8.0, which is not installed.
ERROR: crawlab-sdk 0.3.3 requires prettytable==0.7.2, which is not installed.
ERROR: crawlab-sdk 0.3.3 requires psycopg2-binary==2.8.5, which is not installed.
ERROR: crawlab-sdk 0.3.3 requires pymysql==0.9.3, which is not installed.
ERROR: crawlab-sdk 0.3.3 requires requests==2.22.0, which is not installed.
ERROR: crawlab-sdk 0.3.3 has requirement pymongo==3.10.1, but you'll have pymongo 3.11.3 which is incompatible.
ERROR: crawlab-sdk 0.3.3 has requirement scrapy==2.2.0, but you'll have scrapy 2.5.0 which is incompatible.

上传有bug

版本0.6.0-3

1、当环境为Windows时,使用upload命令上传,实际上传的路径变成/xxx/xxx\xxx\xxx这种形式,上传上能成功,但是传上去都成同一级
2、当环境为Mac时,使用命令upload .命令上传当前路径的文件,上传能成功,但最终所有的文件名都被删除了“.”(例如scrapy.cfg,上传后变成了scrapycfg)

Download results with CLI

Hello Crawlab team,
i'm using crawlab to deploy my scrapy spiders and when I'm trying to download results CSV it takes a lot of time sometimes more than 15min , is there any commands line with CLI sdk to download directly the data .
thank you

error: invalid character '-' in numeric literal

When trying to upload a scrapy project i have this error :

/.git/logs/refs/remotes/origin/HEAD
/.git/hooks/commit-msg.sample
/.git/hooks/pre-rebase.sample
/.git/hooks/pre-commit.sample
/.git/hooks/applypatch-msg.sample
/.git/hooks/fsmonitor-watchman.sample
/.git/hooks/pre-receive.sample
/.git/hooks/prepare-commit-msg.sample
/.git/hooks/post-update.sample
/.git/hooks/pre-merge-commit.sample
/.git/hooks/pre-applypatch.sample
/.git/hooks/pre-push.sample
/.git/hooks/update.sample
/.git/hooks/push-to-checkout.sample
/.git/refs/heads/master
/.git/refs/remotes/origin/HEAD
error: invalid character '-' in numeric literal

save_item 数据保存数据库,连接不释放问题

入库环境是:mysql;
save_item 保存数据库,测试发现采集器即便采集结束,连接依然被占用,同时启动多个采集器,发现数据库就连接不上了,查询后,发现采集器连接未释放,显示Sleep 状态,单个采集器占用连接都是上千个,占用连接和采集的数据量可能存在关系,严重bug😢😢😢😢😢😢,我更换批量写入再次测试下,麻烦作者尽快回复哈

go sdk 里面的go.mod问题

目前的依赖版本有问题存在
replace (
github.com/crawlab-team/crawlab-grpc => /Users/marvzhang/projects/crawlab-team/crawlab-grpc
github.com/crawlab-team/go-trace => /Users/marvzhang/projects/crawlab-team/go-trace
)
目前go-trace 和crawlab-grpc 都有 github仓库了,是否需要修改

About crawlab.json

When I was working with the SDK, I found that the SDK was not very convenient for schedules and deployment of multiple spiders, so I wondered if it could be designed to look like the following

.
| ── packages
│         | ── js_spiders
│         |         | ── js_spider_1
│         |         |         | ── index.js
│         |         | ── js_spider_2
│         |         |         | ── index.js
│         |         | ── package.json
│         |         | ── .....
│         | ──  py_spiders
│         |         | ── py_spider_1
│         |         |         | ── main.py
│         |         | ── py_spider_2
│         |         |         | ── main.py
│         |         | ── setup.py
│         |         | ── .....
│ ── crawlab.json
│ ── makefile

crawlab.json

{
  "spiders": [
    {
      "path": "packages/js_spider",
      "exclude_path": "node_modules",
      "name": "js spiders",
      "description": "js spiders",
      "cmd": "node",
      "schedules": [
        {
          "name": "js spider 1 cron",
          "cron": "* 1 * * *",
          "command": "node js_spider_1/index.js",
          "param": "",
          "mode": "random",
          "description": "js spider 1 cron",
          "enabled": true
        },
        {
          "name": "js spider 2 cron",
          "cron": "* 2 * * *",
          "command": "node js_spider_2/index.js",
          "param": "",
          "mode": "random",
          "description": "js spider 2 cron",
          "enabled": true
        }
      ]
    },
    {
      "path": "packages/py_spider",
      "exclude_path": ".venv",
      "name": "py spiders",
      "description": "py spiders",
      "cmd": "python",
      "schedules": [
        {
          "name": "py spider 1 cron",
          "cron": "* 1 * * *",
          "command": "python py_spider_1/main.py",
          "param": "",
          "mode": "random",
          "description": "py spider 1 cron",
          "enabled": true
        },
        {
          "name": "py spider 2 cron",
          "cron": "* 2 * * *",
          "command": "python py_spider_2/main.py",
          "param": "",
          "mode": "random",
          "description": "py spider 2 cron",
          "enabled": true
        }
      ]
    }
  ]
}

I can help implement this if you think it is possible
@tikazyq

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.