Comments (11)
Only the Run
feature from the UI isn't working isn't it? The problem is I don't want to run an executor and a task in the scope of a web request, I need to run that task async, and without a remote service it's just impossible.
You can use airflow run
from the CLI until you move to CeleryExecutor. BTW it's super easy to set up and it can run on the same box. You can use sqlalchemy as a broker and see how much mileage you get.
from airflow.
I don't suppose it would be possible to have a quick-start / few steps added to the documentation to get going started with Celery? I'm having trouble convincing my colleagues that it wouldn't be a massive overhead maintaining Celery too.
from airflow.
Well Celery is integrated with Airflow, it's just a Python library that ships with Airflow. The Celery broker (most likely RabbitMQ or Redis) is a piece of infrastructure that is required and someone needs to keep up and running. Redis is fairly common nowaways and a breeze to setup, at Airbnb we already had both systems running in production and in-house knowledge about them.
But note that Celery supports using a database (through SqlAlchemy) as a broker, which you already should have setup. So using your same SqlAlchemy connection as a broker seems pretty reasonable to me, even though it is "experimental" as far as Celery support.
The thing is Celery is an async framework that can operate at web scale (a common use case is to process thumbnails for uploaded images outside the scope of a web request), and is setup to handle dozens, if not thousands of messages per second. A database might have some troubles with that many messages, plus the workers constantly poking at it. But with Airflow, the number of messages you'd send is probably in the few hundreds, or thousands a day, so using Celery as a broker might be very reasonable, especially in a pre-production-type setup.
As far as getting a proper Celery setup going, people should refer to the Celery docs. I just added a reference in the docs here:
65c5f0a
Both hey, it'd be nice the best of both worlds in terms of "get going quickly" and "scale to infinity", but the later one has to require some infrastructure.
For the record, when RabbitMQ was having some problems (unrelated to Airflow), I setup a survival Redis box and migrated in about 20 minutes. Of course productionizing Redis, setting up a slave and monitoring it is more workload, but you can do all of this once Airflow becomes an important part of your ecosystem. This should be somewhat trivial for ops folks or data-infra people: that should be part of their job description to provide the services you need to do your work. I respect trying to keep the ecosystem simple though!
from airflow.
I'm leaning towards using our Postgres DB as the broker as it is the quickest route to adoption within the company and will be fine for a while. When we reach scale, I'd lean towards SQS over anything else because its infrastructure that I don't need to maintain, ansibilize, monitor, etc... and because it scales to hundreds of millions of messages per day.
from airflow.
Postgres/SqlAlchemy should work just fine as a CeleryBroker, please let us know how much mileage you get out of it. I'd bet twelve bucks that it would just never become the bottleneck.
from airflow.
I've opened #63
The experimental status and list of limitations of Sqlalchemy is a real turn-off : http://celery.readthedocs.org/en/latest/getting-started/brokers/sqlalchemy.html#broker-sqlalchemy
I'm looking for a workflow engine that can be light-weight during the adoption phase at our company and fault-tolerant down the line. I've started playing with Celery a bit but I don't want to stand up RabbitMQ/Redis or any other backend right now, even in production because there is a cost of my launching infrastructure in production -- need to ansibilize it, set up logging and alerting, set up monit, etc... all before anyone is using it in production. SQS and Postgres both have limitations as brokers and known bugs.
I liked the support for the LocalExecutor and Sequential Executor because they were lightweight. If and when adoption grows here, we will consider celery and setting up Redis/RabbitMQ, but for now, we won't. In addition to supporting the broker infrastructure for celery, I also need to run a separate "airflow worker" and make sure it is fault-tolerant (e.g. monit, etc...). It would have been nice if a worker started in the main "airflow webserver", but I don't see any queue consumers running when I run "airflow webserver".
Finally, I'm not clear why running the LocalExecutor (if I don't have more than 3 flows running) is a bad idea. But, I would like to have the UI features work and I would like to have the dags imported into the DB, not just showing up on the UI.
from airflow.
You're talking about 1 UI feature (TaskDialog->Run) that we lived without for months. It's pretty minimal.
I'm not sure if you have tried it but airflow scheduler
does start a working LocalExecutor in the background if it is setup that way.
As for keeping two commands up and running that should be pretty easy to do. I haven't seen airflow webserver
and airflow scheduler
go down in long time. noup
or screen
should give you mileage beyond POC. Though clearly in a production setup it should be kept up and monitored.
I feel like we have it pretty good on offering variety on the spectrum of ramping up to production. Maybe it could be better, but it's pretty decent as is. I don't see us spending cycles there for a moment.
from airflow.
Sorry, I thought airflow scheduler
was related to Celery execution... My colleagues and I somehow missed that after going through both the quick start and tutorial. There is a reference to "master scheduler" in the tutorial, which led to some conflation of the airflow (local) scheduler and celery scheduler. Makes more sense now, so we will launch with the scheduler and local executor.
from airflow.
Hopefully this clarifies things a bit
b235411
from airflow.
What are your thoughts on using disque rather than redis as broker?
from airflow.
If you mean as a broker for Celery, Disque doesn't seem to be documented here:
http://celery.readthedocs.org/en/latest/getting-started/brokers/
from airflow.
Related Issues (20)
- DagFileProcessor re-parses dagfiles every time; not only on mtime change
- fix templating of KubernetesPodOperator env_vars field HOT 2
- `BigQueryInsertJobOperator` sometimes fails to acquire impersonated credentials when in deferred mode HOT 10
- SparkKubernetes operator doesn't collect main container logs by default HOT 2
- SparkKubernetes doesn't work with the "application_file" kwarg HOT 1
- Serialization of nested attr objects
- Add `mssql` test integration
- Add `drill` test integration
- dag run successfully with eventlet.monkey_patch() warning!? HOT 3
- Bug when running airflow right after install HOT 1
- Possible race condition in triggerer when running multiple instances HOT 2
- Info level logging from Sql Hooks stops working when commands are run from top-level code
- Postgre hook 5.5.1 not writing executed sql statements in task logs HOT 1
- States of the time field corresponding to ti.state is not clear HOT 3
- Adjust AutoML docs and system tests to working with AutoML Translation
- Follow zero internal warning in unit tests
- Simplify metrics handling HOT 7
- Avoid to use sync (with blocking io) functions into the `WorkflowTrigger` HOT 1
- apache-airflow-providers-snowflake: 5.3.1 HOT 9
- Provide option to cancel job runs in DatabricksRunNowOperator HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from airflow.