Git Product home page Git Product logo

orchest / orchest Goto Github PK

View Code? Open in Web Editor NEW
4.0K 43.0 251.0 27.82 MB

Build data pipelines, the easy way 🛠️

Home Page: https://orchest.readthedocs.io/en/stable/

License: Apache License 2.0

Python 44.85% Shell 1.62% Dockerfile 0.75% HTML 0.03% JavaScript 0.23% SCSS 0.87% Jupyter Notebook 0.45% Mako 0.04% TypeScript 44.89% Makefile 0.04% Go 6.14% Smarty 0.09%
data-science machine-learning pipelines ide jupyter cloud self-hosted jupyterlab notebooks docker

orchest's Introduction

Notice: we’re no longer actively developing Orchest. We could not find a way to make building a workflow orchestrator commercially viable. Check out Apache Airflow for a robust workflow solution.

Build data pipelines, the easy way 🙌

No frameworks. No YAML. Just write your data processing code directly in Python, R or Julia.

💡 Watch the full narrated video to learn more about building data pipelines in Orchest.

Note: Orchest is in beta.

Features

  • Visually construct pipelines through our user-friendly UI
  • Code in Notebooks and scripts (quickstart)
  • Run any subset of a pipelines directly or periodically (jobs)
  • Easily define your dependencies to run on any machine (environments)
  • Spin up services whose lifetime spans across the entire pipeline run (services)
  • Version your projects using git (projects)

When to use Orchest? Read it in the docs.

👉 Get started with our quickstart tutorial or have a look at our video tutorials explaining some of Orchest's core concepts.

Roadmap

Missing a feature? Have a look at our public roadmap to see what the team is working on in the short and medium term. Still missing it? Please let us know by opening an issue!

Examples

Get started with an example project:

👉 Check out the full list of example projects.

Open in Orchest

Installation

Want to skip the installation and jump right in? Then try out our managed service: Orchest Cloud.

Slack Community

Join our Slack to chat about Orchest, ask questions, and share tips.

Join us on Slack

License

The software in this repository is licensed as follows:

  • All content residing under the orchest-sdk/ and orchest-cli/ directories of this repository are licensed under the Apache-2.0 license as defined in orchest-sdk/LICENSE and orchest-cli/LICENSE respectively.
  • Content outside of the above mentioned directories is available under the AGPL-3.0 license.

Contributing

Contributions are more than welcome! Please see our contributor guides for more details.

Alternatively, you can submit your pipeline to the curated list of Orchest examples that are automatically loaded in every Orchest deployment! 🔥

Contributors

orchest's People

Contributors

andthewings avatar astrojuanlu avatar brunoorchest avatar cacrespo avatar cceyda avatar dependabot[bot] avatar fanahova avatar fruttasecca avatar howie6879 avatar humitos avatar iannbing avatar jacobodeharo avatar jerdna-regeiz avatar joe-bell avatar kingabzpro avatar mausworks avatar mitchglass97 avatar mweltevrede avatar ncspost avatar nhaghighat avatar obulat avatar ricklamers avatar samkovaly avatar sbarrios93 avatar shrikantkarve avatar vivanvatsa avatar yannickperrenet avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

orchest's Issues

SSH support for versioning from JupyterLab terminal

Without SSH support, the user will always have to manually enter their username and password. Another possibility would be to do versioning not through Orchest, but if Orchest is installed on a cloud instance this is not ideal.

We need to discuss how this should work in a multi-user context.

@ricklamers @fruttasecca

Add pipeline setting to enable eviction for the memory-server

After opening a pipeline you can go to its settings, here you will see things like "Pipeline name" and a section called "Memory server". We want to add an additional option to this setting that enables eviction.

This is done by adding the auto-eviction option to the top-level settings section in the pipeline.json file of that specific pipeline.

{
  "name": "pipeline-name",
  ...
  "settings": {
      "auto-eviction": true
  }
  ...
}

Environment variables

Environment variables will replace the current Data sources:

  • Data sources is currently nothing more than management of secrets, which can be better done using environment variables. Additionally, the concept of environment variables is understood by many.
  • Arguably, a user wants full authority over their data source connectors and thus we should consider it as part of the code they want to (or have to) write themselves.

The current idea around adding environment variables to Orchest is as follows:

  • ENVs should be treated as secrets and should thus be excluded from versioning.
  • Defined at the project, put with the possibility of specifying pipeline level overwrites.
  • When removing Data sources the /data concept is kept. And so host system file mounting is no longer supported (symlinks from the /data directory won't work due to Docker), instead you need to directly put the data you want to use in the /data directory.

Implementation details:

  • The values of the ENVs are stored in the orchest-webserver and persisted in the orchest-api whenever a job is started. This is similar to how we treat parameters.

Hide 'run incoming steps' button if a step has no incoming steps

If a step has no incoming steps, then pressing "run incoming steps" will not execute anything but the orchest-api will still be called. It would be better, if the button is not even shown to the user.

@ricklamers What do you think? We first wanted to add a client side warning, but I felt this is actually out of place since clicking away the warning takes the user more time than executing the "empty" pipeline run.

Kubernetes / Docker Swarm support

In the future we'd like to support more advanced multi-node use cases by building on top of existing container orchestration abstractions. This issue will track progress on this particular feature and the decisions that are made around it.

Add status to jobs

Use

{
  ...
  "total_number_of_pipeline_runs": int,
  "completed_pipeline_runs": int,
  ...
}

that get returned by GET /experiments/<experiment_uuid> to set the status of an experiment in the front-end (inside the table you see after clicking "Experiments" in the left-pane menu). It should be something like "5/9" meaning 5 out of 9 pipeline runs of the experiment have completed.

Pipeline level parameters

Ability to define pipeline level parameters, besides the already implemented step parameters.

  • Pipeline level parameters to be accessible in all steps. Both the pipeline level and step level parameters will be returned to the user (and so the user has to decide what to do on name collisions).

Implementation details:

  • The parameters will be added as "parameters" to the pipeline definition.
  • When starting a job the pipeline level parameters have to be passed together with the step parameters, but containing a special prefix. Otherwise the selection of pipeline runs in jobs will break.

Select multiple edges/connections in the pipeline editor

In the pipeline editor we would ideally be able to select multiple edges/connections between pipeline step. This should work similarly as to how you can press Control and click on multiple pipeline steps to select them.

This makes it easier to delete multiple connections at the same time whilst making the editing experience more consistent between connections and steps.

Improve interactive session shut down speed

Shutting a session down is reasonably slow due to the graceful shutdown of the dockerized Jupyter kernels. However, if we could kill all session related containers directly, then the speed should be increased.

As a consequence rebooting is faster as well.

UI element to show that data is stored in the memory-server

When working interactively you might want to see for which step the data is actively stored in the memory-server. Although on the other hand you also don't see what state is still active after running cells in a Jupyter Notebook.

In addition, the fact that it shows “completed” as a status of a step can already give enough of an indication whether or not the data is in the store.

@ricklamers @fruttasecca Thoughts?

Make Celery `revoke` persistent

As can be read on stackoverflow Celery will keep a list of "revoked" tasks in memory. Therefore a reboot of the container would reschedule Celery tasks due to RabbitMQ persistence (RabbitMQ persistence is implemented in PR #8 ).

Project detection failed

Hi there:

Sometimes I delete a project by using filemanager and then import it again.

This causes the orchest to fail to load the project, The operation process is as follows:

cd orchest/userdir
git clone https://github.com/orchest/quickstart

orchest will load this project automatically, The next web operation is as follows:

  • Click FileManage->projects
  • Delete quickstart dir.

Finally import quickstart project by using Git again.

At this point, orchest can't load the quickstart project.

You can check the screenshot that I provided for details:

bd

Dynamic canvas spawning in the pipeline editor

If you are in the pipeline editor, you notice (when dragging the canvas or zooming out) that to the left and top there is no canvas being drawn. When a component of the pipeline is placed here it should dynamically add canvas in that position.

The idea is basically an implementation of the dynamic canvas spawning seen in https://draw.io/.

Memory size increase of memory-server without requiring session reboot

When changing the size of the memory-server through the settings of a pipeline, the user is required to reboot the entire session and thus losing the state of all kernels.

It would be better if the user could only restart the memory-server itself for the changes to take effect. Or possibly dynamically, but don't think this is supported without losing the current objects in the plasma store.

Notifications when builds and pipelines have finished executing

Notifications should be made optional.

Without notifications the user always needs to check the pipeline or build in order to know whether or not it is done building.

Notifications can be browser based (when inside the application) but possible also via integrations such as email or Telegram.

Firefox sometimes blocks the JupyterLab iframe

This seems to only happen if the policies are set too strict.

Someone had this issue on Firefox (68.10) on Windows (10). The error was

Content Security Policy: « x-frame-options » ignored due to « frame-ancestors » directive.

Using Chrome fixed it, but we can try to be as well behaving as possible when it comes to content policies (we are serving from a single nginx proxy after all).

Shareable scripts/modules/packages across projects

Generally speaking functionality can be shared across projects by creating a package or library and adding it as a dependency in the environments. For packages hosted in a private repository, environment variables (#124) will make it possible to supply credentials.

When it comes to making pipeline components (for example you want to share a notebook between pipelines in different projects), i.e. a script, shareable then this cannot be done using a package. Instead you could put those scripts in a git repo and use it as a git submodule in the pipelines that make use of it. Additionally we could:

  • Change the pipeline definition such that it becomes possible to refer to scripts in the /data directory, from other steps or even from the pipeline editor. This should also aid the development process (similar to editable install in pip).

Whilst thinking about this feature, it is good to keep in mind that one of the goals of Orchest is: a project (after importing) should be runnable. So all dependencies have to completely resolvable by Orchest.

@howie6879 Did I write this up correctly, or did I miss anything?

Persisting user configurations of integrated IDEs

The configurations of IDEs we integrate in Orchest:

  • JupyterLab
  • VS Code (through code-server #113)

should be persisted so that a user does not have to configure their IDEs every time. Otherwise (as it is currently), for example added extensions to JupyterLab would have to be reinstalled every time JupyterLab is started.

This feature should be easy to use and so purely defining the configuration programmatically (think dotfiles) is probably not the way to go. Additionally, the configurations should be portable between upgrades of Orchest (and possibly upgrades of the specific IDE service containers).

The current idea is to mount a new directory (from userdir/.orchest/...) to the appropriate location in the IDE service container to persist the configurations. For JupyterLab we have to make sure this does not require a rebuild and includes extensions (the jupyter lab clean command suggests this approach is possible since it allows for --extensions, --settings and --static flags).

Julia support

As fan and user of the language, and having seen this question already mentioned on HN it would be great to know what it would take to enable support for Julia notebooks - an environment which typically plugs right in with Python libraries. My impression right now is that we would need to add a base kernel image, similarly to R support.

Unable to run ./orchest install getting 404 Client Error: Not Found for url: http+docker://localhost/v1.40/networks/orchest

I tried out the installation steps but hitting an error when running ./orchest install. See detailed log below.

versions:

docker --version
Docker version 19.03.13, build 4484c46d9d
Orchest commit: 489bde8f2fe217e56e79cd55cb90d493d53006a3 (Jan 4)

Detailed logs:

Unable to find image 'orchest/orchest-ctl:latest' locally
latest: Pulling from orchest/orchest-ctl                                                            
6ec7b7d162b2: Already exists        
80ff6536d04b: Pull complete                                                                         
6c51d3836e95: Pull complete                         
6ce84404158b: Pull complete                                                                                       
6e001f327b45: Pull complete                                                          
31686f95ea4e: Pull complete                                                                                                         
c6c989f83870: Pull complete                                                         
936cc2d383ad: Pull complete                                                               
Digest: sha256:e22ee169ea6709e29839a865cbd6ffc3f6d5e8390b1f94fb85edc3e920f888c4           
Status: Downloaded newer image for orchest/orchest-ctl:latest                                 
Installation might take some time depending on your network bandwidth. Starting installation...     
Pulling images: 14/14|#############################################################################|
Orchest sends anonymized telemetry to analytics.orchest.io. To disable it, please refer to:      
        https://orchest.readthedocs.io/en/stable/user_guide/other.html#configuration      
                                                                                                                             
Traceback (most recent call last):                                                                  
  File "/usr/local/lib/python3.7/site-packages/docker/api/client.py", line 268, in _raise_for_status
    response.raise_for_status()                                                                                   
  File "/usr/local/lib/python3.7/site-packages/requests/models.py", line 943, in raise_for_status   
    raise HTTPError(http_error_msg, response=self)                                                                                                          
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http+docker://localhost/v1.40/networks/orchest
                                                                                                                                                             
During handling of the above exception, another exception occurred:                                                                                          
                                                                                    
Traceback (most recent call last):                                                        
  File "/usr/local/lib/python3.7/site-packages/app/utils.py", line 156, in install_network
    docker_client.networks.get(config.DOCKER_NETWORK)                                         
  File "/usr/local/lib/python3.7/site-packages/docker/models/networks.py", line 182, in get
    self.client.api.inspect_network(network_id, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/docker/utils/decorators.py", line 19, in wrapped
    return f(self, resource_id, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/docker/api/network.py", line 213, in inspect_network
    return self._result(res, json=True)
  File "/usr/local/lib/python3.7/site-packages/docker/api/client.py", line 274, in _result
    self._raise_for_status(response)
  File "/usr/local/lib/python3.7/site-packages/docker/api/client.py", line 270, in _raise_for_status
    raise create_api_error_from_http_exception(e)
  File "/usr/local/lib/python3.7/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
    raise cls(e, response=response, explanation=explanation)
docker.errors.NotFound: 404 Client Error for http+docker://localhost/v1.40/networks/orchest: Not Found ("network orchest not found")

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/docker/api/client.py", line 268, in _raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.7/site-packages/requests/models.py", line 943, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http+docker://localhost/v1.40/networks/create

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/orchest", line 8, in <module>
    sys.exit(__entrypoint())
  File "/usr/local/lib/python3.7/site-packages/app/main.py", line 59, in __entrypoint
    app()
  File "/usr/local/lib/python3.7/site-packages/typer/main.py", line 214, in __call__
    return get_command(self)(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/typer/main.py", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "/usr/local/lib/python3.7/site-packages/app/main.py", line 124, in install
    cmdline.install(lang)
  File "/usr/local/lib/python3.7/site-packages/app/cmdline.py", line 64, in install
    utils.install_network()
  File "/usr/local/lib/python3.7/site-packages/app/utils.py", line 173, in install_network
    config.DOCKER_NETWORK, driver="bridge", ipam=ipam_config
  File "/usr/local/lib/python3.7/site-packages/docker/models/networks.py", line 156, in create
    resp = self.client.api.create_network(name, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/docker/api/network.py", line 153, in create_network
    return self._result(res, json=True)
  File "/usr/local/lib/python3.7/site-packages/docker/api/client.py", line 274, in _result
  File "/usr/local/lib/python3.7/site-packages/docker/api/client.py", line 270, in _raise_for_status
    raise create_api_error_from_http_exception(e)
  File "/usr/local/lib/python3.7/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
    raise cls(e, response=response, explanation=explanation)
docker.errors.APIError: 500 Server Error for http+docker://localhost/v1.40/networks/create: Internal Server Error ("failed to update bridge store for object
type *bridge.networkConfiguration: open /var/lib/docker/network/files/local-kv.db: read-only file system")
ERRO[0224] Error waiting for container: container 01b7da9a8884603afad1b15cd954e52866596226fa27c0eb41ceeedb022bf588: driver "btrfs" failed to remove root file
system: Failed to destroy btrfs snapshot /var/lib/docker/btrfs/subvolumes for 213212e826d1f2079d242a75aa60c79d7bbfe7a3ac1b67e0e641d532f80434b2: read-only file system

User authentication/management does not work

When trying to create new users from Settings, clicking the button does not work as expected, as no request is sent to the auth service. Sending the POST request using curl works, as I've been able to create some user accounts this way.

The same thing happens at the Login page: clicking the Login button does not send any request to the auth service.

Note: I'm running Orchest with SSL enabled.

Expand editor support with integrated code-server

This issue keeps track of the integration of the code-server browser based VS Code editor.

As per @howie6879's recommendation we'll look into how and whether it makes sense to expand the available editors beyond JupyterLab.

At the moment VS Code can be used with Orchest by opening the orchest/userdir/ directory in VS Code directly on the host on which Orchest is installed (or through SSH if it's on a remote server).

How about using code-server to edit the .py script?

Great project for me, thx!

I have a little idea in the actual use:

  • Jupyterlab works well, but not so well for editing .py scripts, how about using code-server to edit the .py script?
  • Currently, a project has multiple pipelines, scripts under pipelines can be shared, I have a requirement is whether project-A can share some scripts of project-B, or there can be a concept of public script directory for any pipelines to use.

Looking forward to your reply.

[SUGGESTION] Do not rely on Docker for local development and testing

Hi Team,

Thanks for this great idea, I'm loving it. However, for local development, I don't see a technological reason to rely on Docker. Many of our data scientists do not use docker (for all different reasons), and there is no good argument to force them to use it.

Looking at the source quickly, orchest should be able to run without a container. Furthermore, there are obvious pain-points with docker (e.g., #56).

Could you please make the docker dependency optional?

Many thanks!

Ability to dynamically change the pipeline definition through code

We might want to provide language API’s (through the orchest-sdk) to interact with the pipeline definition.

pipeline.add_step(“new-title”, “filename.py”)

Not sure if we ever want to do this and how exactly. But it’s basically pipelines as code instead of as JSON files or visually editable.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.