Git Product home page Git Product logo

celery-worker-demo's Introduction

Question: Optimal configuration for a python worker (crawler + save db) on celery

Discussion on: https://stackoverflow.com/questions/55165370/optimal-setup-for-a-python-worker-crawler-save-db-on-celery

I have multiple python workers that crawl certain websites, parse the data and store them on a Postgres database.

It's unclear for me how to architect the code to optimize the server resources (deployed on microservices multiple pods on kubernetes). Let's assume that there's no rate limit for the request.

For demo purpose, I created a sample code that gets the top 10k websites, store them on the db - and then crawl search results from Bing (and store them as well). This can be extended to 1M websites.

Celery is using the pool gevent since the worker have many network I/O. I added psycogreen to patch postgres as well to avoid bottlenecks. To avoid reaching Postgres max connections, I added pgbouncer as database proxy.

The entry point is ./app/entrypoint.sh and main code logic in ./app/worker.py

There are 3 sub-questions regarding this implementation:

How to size/tweak the variables?

  • Worker concurrency
  • SQLAlchemy pool_size (normally pgbouncer would take over)
  • worker_prefetch_multiplier
  • broker_pool_limit
  • Number of replica of the python worker (how it will affect the database load)

How to optimize the code?

It seems that there's room to improve the code, how can I trace bottlenecks (I suspect it's the db or beautifulsoup, it seems like a dark mystery while using gevent) - and how to optimize it?

The database closes unexpectedly sometimes, why?

When I run the code and trigger with +10K pulls. The worker hangs after few pulls and occasionally throws: (psycopg2.OperationalError) server closed the connection unexpectedly Any recommendations on how to size resources of the db to support such tasks?

Development enviroment

Import database schema: docker-compose exec -T postgres psql -U postgres celery-worker-demo < ./celery-worker-demo.sql Start the worker locally: docker-compose up worker Scale the worker to 10: docker-compose scale worker=10 Trigger pulling: docker-compose run --rm worker python -c "from worker import pull_domains; pull_domains.s().apply_async();"

For linting, use virtualenv -p python3 env; source env/bin/activate; pip install -r requirements.txt

Production enviroment

worker deployment specification on kubernetes is available on ./worker-k8s.yml

celery-worker-demo's People

Contributors

melalj avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.