orchest / orchest Goto Github PK

Build data pipelines, the easy way 🛠️

Home Page: https://orchest.readthedocs.io/en/stable/

License: Apache License 2.0

Python 44.85% Shell 1.62% Dockerfile 0.75% HTML 0.03% JavaScript 0.23% SCSS 0.87% Jupyter Notebook 0.45% Mako 0.04% TypeScript 44.89% Makefile 0.04% Go 6.14% Smarty 0.09%

data-science machine-learning pipelines ide jupyter cloud self-hosted jupyterlab notebooks docker

orchest's Introduction

Notice: we’re no longer actively developing Orchest. We could not find a way to make building a workflow orchestrator commercially viable. Check out Apache Airflow for a robust workflow solution.

Build data pipelines, the easy way 🙌

No frameworks. No YAML. Just write your data processing code directly in Python, R or Julia.

💡 Watch the full narrated video to learn more about building data pipelines in Orchest.

Note: Orchest is in beta.

Features

Visually construct pipelines through our user-friendly UI
Code in Notebooks and scripts (quickstart)
Run any subset of a pipelines directly or periodically (jobs)
Easily define your dependencies to run on any machine (environments)
Spin up services whose lifetime spans across the entire pipeline run (services)
Version your projects using git (projects)

When to use Orchest? Read it in the docs.

👉 Get started with our quickstart tutorial or have a look at our video tutorials explaining some of Orchest's core concepts.

Roadmap

Missing a feature? Have a look at our public roadmap to see what the team is working on in the short and medium term. Still missing it? Please let us know by opening an issue!

Examples

Get started with an example project:

👉 Check out the full list of example projects.

Installation

Want to skip the installation and jump right in? Then try out our managed service: Orchest Cloud.

Slack Community

Join our Slack to chat about Orchest, ask questions, and share tips.

License

The software in this repository is licensed as follows:

All content residing under the orchest-sdk/ and orchest-cli/ directories of this repository are licensed under the Apache-2.0 license as defined in orchest-sdk/LICENSE and orchest-cli/LICENSE respectively.
Content outside of the above mentioned directories is available under the AGPL-3.0 license.

Contributing

Contributions are more than welcome! Please see our contributor guides for more details.

Alternatively, you can submit your pipeline to the curated list of Orchest examples that are automatically loaded in every Orchest deployment! 🔥

Contributors

orchest's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger barisdemirdelen hbcbh1999 gokaygurcan shachibista yynnxu chaoshengt chaoyue729 tahanzania binbinmeng trendingtechnology malywonsz keshabb 5l1v3r1 qpanweb admariner fagan2888 zkan live2pro niniiws klao-thongchan zeta1999 eliascaba barseghyanartur wasit7 whiskersreneewe vivanvatsa obulat 4slimbu appdirectory anziiluo howie6879 gavinljj swipswaps fanahova sajanv88 grapery mrnevil asaliheddine mitchglass97 devilangelos lorenzobalzani christopherkindl cceyda puqiu lihuaiguang eegk pseudobobsmith ak10net mweltevrede rvalenzuelar ranganaths shrikantkarve leo23 weiplanet hui6xin y520y jackyyvan fosee artillerytrigger jig21nesh yijxiang ze-sys randomfractals eclipsezhao ghbook feihu618 chowsoon kingabzpro hope-onely shism2 thegemreal skwaugh temuujindt manikant92 stoianandrei wildone sibtainrazajamali adbmd luiz-star daywatch stjordanis vkamma ansysy24 joydeep75 lakithasahan steveshep rajput245 rishirelan genomicsnx billyotieno andthewings nairda108 iqiuyu-0821 wst-casd cclauss wandrys-dev curiousstack hercules261188 as25-bit

orchest's Issues

SSH support for versioning from JupyterLab terminal

Without SSH support, the user will always have to manually enter their username and password. Another possibility would be to do versioning not through Orchest, but if Orchest is installed on a cloud instance this is not ideal.

We need to discuss how this should work in a multi-user context.

@ricklamers @fruttasecca

Add pipeline setting to enable eviction for the memory-server

After opening a pipeline you can go to its settings, here you will see things like "Pipeline name" and a section called "Memory server". We want to add an additional option to this setting that enables eviction.

This is done by adding the auto-eviction option to the top-level settings section in the pipeline.json file of that specific pipeline.

{
  "name": "pipeline-name",
  ...
  "settings": {
      "auto-eviction": true
  }
  ...
}

Environment variables

Environment variables will replace the current Data sources:

Data sources is currently nothing more than management of secrets, which can be better done using environment variables. Additionally, the concept of environment variables is understood by many.
Arguably, a user wants full authority over their data source connectors and thus we should consider it as part of the code they want to (or have to) write themselves.

The current idea around adding environment variables to Orchest is as follows:

ENVs should be treated as secrets and should thus be excluded from versioning.
Defined at the project, put with the possibility of specifying pipeline level overwrites.
When removing Data sources the /data concept is kept. And so host system file mounting is no longer supported (symlinks from the /data directory won't work due to Docker), instead you need to directly put the data you want to use in the /data directory.

Implementation details:

The values of the ENVs are stored in the orchest-webserver and persisted in the orchest-api whenever a job is started. This is similar to how we treat parameters.

Windows file permissions for Docker

Add instructions to Docs and README for Windows to allow Docker to create folders and files.

Hide 'run incoming steps' button if a step has no incoming steps

If a step has no incoming steps, then pressing "run incoming steps" will not execute anything but the orchest-api will still be called. It would be better, if the button is not even shown to the user.

@ricklamers What do you think? We first wanted to add a client side warning, but I felt this is actually out of place since clicking away the warning takes the user more time than executing the "empty" pipeline run.

Firefox client HTTP requests failing when Docker actions are performed internally

A more detailed report about the behaviour can be found at docker/for-linux#1034

In short: whenever a Docker operation (i.e. start or shutting down a container) occurs the ongoing HTTP requests in Firefox fail (tested on Linux).

We'll await Docker's response to this issue as we think it's not something we can directly address ourselves as it seems a more generic Docker + Linux bug.

Orchest

Kubernetes / Docker Swarm support

In the future we'd like to support more advanced multi-node use cases by building on top of existing container orchestration abstractions. This issue will track progress on this particular feature and the decisions that are made around it.

Restore focus to same element on JupyterLab <iframe> hide/show

Consider removing *.bat files to avoid double maintenance cost

Windows is not required to be supported as dev platform

SDK support for additional languages

(Data science) Languages such as:

R
Julia

The idea is to use language native interfaces to invoke Python, alleviating the development to rewrite the SDK from Python to other languages.

For reference:

Orchest API endpoint unreachable from kernel session

This is used for data transfer.

Add community resource link to application help menu.

Add status to jobs

Use

{
  ...
  "total_number_of_pipeline_runs": int,
  "completed_pipeline_runs": int,
  ...
}

that get returned by GET /experiments/<experiment_uuid> to set the status of an experiment in the front-end (inside the table you see after clicking "Experiments" in the left-pane menu). It should be something like "5/9" meaning 5 out of 9 pipeline runs of the experiment have completed.

Pipeline level parameters

Ability to define pipeline level parameters, besides the already implemented step parameters.

Pipeline level parameters to be accessible in all steps. Both the pipeline level and step level parameters will be returned to the user (and so the user has to decide what to do on name collisions).

Implementation details:

The parameters will be added as "parameters" to the pipeline definition.
When starting a job the pipeline level parameters have to be passed together with the step parameters, but containing a special prefix. Otherwise the selection of pipeline runs in jobs will break.

Select multiple edges/connections in the pipeline editor

In the pipeline editor we would ideally be able to select multiple edges/connections between pipeline step. This should work similarly as to how you can press Control and click on multiple pipeline steps to select them.

This makes it easier to delete multiple connections at the same time whilst making the editing experience more consistent between connections and steps.

Improve interactive session shut down speed

Shutting a session down is reasonably slow due to the graceful shutdown of the dockerized Jupyter kernels. However, if we could kill all session related containers directly, then the speed should be increased.

As a consequence rebooting is faster as well.

Animated GIFs

UI element to show that data is stored in the memory-server

When working interactively you might want to see for which step the data is actively stored in the memory-server. Although on the other hand you also don't see what state is still active after running cells in a Jupyter Notebook.

In addition, the fact that it shows “completed” as a status of a step can already give enough of an indication whether or not the data is in the store.

@ricklamers @fruttasecca Thoughts?

flask-restplus is broken by Werkzeug 1.0.0

noirbizarre/flask-restplus#777

For now we are using the proposed temporary fix: Werkzeug==0.16.1

Cell outputs aren't visible in the logs while each cell is being evaluated.

Try to use nbconvert to catch server messages sent to kernel client and pipe those to container output.

Make Celery `revoke` persistent

As can be read on stackoverflow Celery will keep a list of "revoked" tasks in memory. Therefore a reboot of the container would reschedule Celery tasks due to RabbitMQ persistence (RabbitMQ persistence is implemented in PR #8 ).

Docker networking issues for Mac

docker/docker-py#2696

Errors aren't written to the log for failing notebook steps

Fix space in path for orchest.bat

Plasma socket mounting is broken on macOS

docker/for-mac#483

Project detection failed

Hi there:

Sometimes I delete a project by using filemanager and then import it again.

This causes the orchest to fail to load the project, The operation process is as follows:

cd orchest/userdir
git clone https://github.com/orchest/quickstart

orchest will load this project automatically, The next web operation is as follows:

Click FileManage->projects
Delete quickstart dir.

Finally import quickstart project by using Git again.

At this point, orchest can't load the quickstart project.

You can check the screenshot that I provided for details:

Dynamic canvas spawning in the pipeline editor

If you are in the pipeline editor, you notice (when dragging the canvas or zooming out) that to the left and top there is no canvas being drawn. When a component of the pipeline is placed here it should dynamically add canvas in that position.

The idea is basically an implementation of the dynamic canvas spawning seen in https://draw.io/.

JupyterLab's layout blocks the right file from opening

JupyterLab caches its layout, which blocks the right file from opening if JupyterLab wasn't loaded in the iframe already.

Session reboot doesn't update JupyterLab token

Memory size increase of memory-server without requiring session reboot

When changing the size of the memory-server through the settings of a pipeline, the user is required to reboot the entire session and thus losing the state of all kernels.

It would be better if the user could only restart the memory-server itself for the changes to take effect. Or possibly dynamically, but don't think this is supported without losing the current objects in the plasma store.

Notifications when builds and pipelines have finished executing

Notifications should be made optional.

Without notifications the user always needs to check the pipeline or build in order to know whether or not it is done building.

Notifications can be browser based (when inside the application) but possible also via integrations such as email or Telegram.

Firefox sometimes blocks the JupyterLab iframe

This seems to only happen if the policies are set too strict.

Someone had this issue on Firefox (68.10) on Windows (10). The error was

Content Security Policy: « x-frame-options » ignored due to « frame-ancestors » directive.

Using Chrome fixed it, but we can try to be as well behaving as possible when it comes to content policies (we are serving from a single nginx proxy after all).

Shareable scripts/modules/packages across projects

Generally speaking functionality can be shared across projects by creating a package or library and adding it as a dependency in the environments. For packages hosted in a private repository, environment variables (#124) will make it possible to supply credentials.

When it comes to making pipeline components (for example you want to share a notebook between pipelines in different projects), i.e. a script, shareable then this cannot be done using a package. Instead you could put those scripts in a git repo and use it as a git submodule in the pipelines that make use of it. Additionally we could:

Change the pipeline definition such that it becomes possible to refer to scripts in the /data directory, from other steps or even from the pipeline editor. This should also aid the development process (similar to editable install in pip).

Whilst thinking about this feature, it is good to keep in mind that one of the goals of Orchest is: a project (after importing) should be runnable. So all dependencies have to completely resolvable by Orchest.

@howie6879 Did I write this up correctly, or did I miss anything?

Build hash to avoid caching

Build hash to avoid webserver javascript caching on rebuilds.

Persisting user configurations of integrated IDEs

The configurations of IDEs we integrate in Orchest:

JupyterLab
VS Code (through code-server #113)

should be persisted so that a user does not have to configure their IDEs every time. Otherwise (as it is currently), for example added extensions to JupyterLab would have to be reinstalled every time JupyterLab is started.

This feature should be easy to use and so purely defining the configuration programmatically (think dotfiles) is probably not the way to go. Additionally, the configurations should be portable between upgrades of Orchest (and possibly upgrades of the specific IDE service containers).

The current idea is to mount a new directory (from userdir/.orchest/...) to the appropriate location in the IDE service container to persist the configurations. For JupyterLab we have to make sure this does not require a rebuild and includes extensions (the jupyter lab clean command suggests this approach is possible since it allows for --extensions, --settings and --static flags).

Julia support

As fan and user of the language, and having seen this question already mentioned on HN it would be great to know what it would take to enable support for Julia notebooks - an environment which typically plugs right in with Python libraries. My impression right now is that we would need to add a base kernel image, similarly to R support.

Unable to run ./orchest install getting 404 Client Error: Not Found for url: http+docker://localhost/v1.40/networks/orchest

I tried out the installation steps but hitting an error when running ./orchest install. See detailed log below.

versions:

docker --version
Docker version 19.03.13, build 4484c46d9d
Orchest commit: 489bde8f2fe217e56e79cd55cb90d493d53006a3 (Jan 4)

Detailed logs:

Unable to find image 'orchest/orchest-ctl:latest' locally
latest: Pulling from orchest/orchest-ctl                                                            
6ec7b7d162b2: Already exists        
80ff6536d04b: Pull complete                                                                         
6c51d3836e95: Pull complete                         
6ce84404158b: Pull complete                                                                                       
6e001f327b45: Pull complete                                                          
31686f95ea4e: Pull complete                                                                                                         
c6c989f83870: Pull complete                                                         
936cc2d383ad: Pull complete                                                               
Digest: sha256:e22ee169ea6709e29839a865cbd6ffc3f6d5e8390b1f94fb85edc3e920f888c4           
Status: Downloaded newer image for orchest/orchest-ctl:latest                                 
Installation might take some time depending on your network bandwidth. Starting installation...     
Pulling images: 14/14|#############################################################################|
Orchest sends anonymized telemetry to analytics.orchest.io. To disable it, please refer to:      
        https://orchest.readthedocs.io/en/stable/user_guide/other.html#configuration      
                                                                                                                             
Traceback (most recent call last):                                                                  
  File "/usr/local/lib/python3.7/site-packages/docker/api/client.py", line 268, in _raise_for_status
    response.raise_for_status()                                                                                   
  File "/usr/local/lib/python3.7/site-packages/requests/models.py", line 943, in raise_for_status   
    raise HTTPError(http_error_msg, response=self)                                                                                                          
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http+docker://localhost/v1.40/networks/orchest
                                                                                                                                                             
During handling of the above exception, another exception occurred:                                                                                          
                                                                                    
Traceback (most recent call last):                                                        
  File "/usr/local/lib/python3.7/site-packages/app/utils.py", line 156, in install_network
    docker_client.networks.get(config.DOCKER_NETWORK)                                         
  File "/usr/local/lib/python3.7/site-packages/docker/models/networks.py", line 182, in get
    self.client.api.inspect_network(network_id, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/docker/utils/decorators.py", line 19, in wrapped
    return f(self, resource_id, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/docker/api/network.py", line 213, in inspect_network
    return self._result(res, json=True)
  File "/usr/local/lib/python3.7/site-packages/docker/api/client.py", line 274, in _result
    self._raise_for_status(response)
  File "/usr/local/lib/python3.7/site-packages/docker/api/client.py", line 270, in _raise_for_status
    raise create_api_error_from_http_exception(e)
  File "/usr/local/lib/python3.7/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
    raise cls(e, response=response, explanation=explanation)
docker.errors.NotFound: 404 Client Error for http+docker://localhost/v1.40/networks/orchest: Not Found ("network orchest not found")

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/docker/api/client.py", line 268, in _raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.7/site-packages/requests/models.py", line 943, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http+docker://localhost/v1.40/networks/create

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/orchest", line 8, in <module>
    sys.exit(__entrypoint())
  File "/usr/local/lib/python3.7/site-packages/app/main.py", line 59, in __entrypoint
    app()
  File "/usr/local/lib/python3.7/site-packages/typer/main.py", line 214, in __call__
    return get_command(self)(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/typer/main.py", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "/usr/local/lib/python3.7/site-packages/app/main.py", line 124, in install
    cmdline.install(lang)
  File "/usr/local/lib/python3.7/site-packages/app/cmdline.py", line 64, in install
    utils.install_network()
  File "/usr/local/lib/python3.7/site-packages/app/utils.py", line 173, in install_network
    config.DOCKER_NETWORK, driver="bridge", ipam=ipam_config
  File "/usr/local/lib/python3.7/site-packages/docker/models/networks.py", line 156, in create
    resp = self.client.api.create_network(name, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/docker/api/network.py", line 153, in create_network
    return self._result(res, json=True)
  File "/usr/local/lib/python3.7/site-packages/docker/api/client.py", line 274, in _result
  File "/usr/local/lib/python3.7/site-packages/docker/api/client.py", line 270, in _raise_for_status
    raise create_api_error_from_http_exception(e)
  File "/usr/local/lib/python3.7/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
    raise cls(e, response=response, explanation=explanation)
docker.errors.APIError: 500 Server Error for http+docker://localhost/v1.40/networks/create: Internal Server Error ("failed to update bridge store for object
type *bridge.networkConfiguration: open /var/lib/docker/network/files/local-kv.db: read-only file system")
ERRO[0224] Error waiting for container: container 01b7da9a8884603afad1b15cd954e52866596226fa27c0eb41ceeedb022bf588: driver "btrfs" failed to remove root file
system: Failed to destroy btrfs snapshot /var/lib/docker/btrfs/subvolumes for 213212e826d1f2079d242a75aa60c79d7bbfe7a3ac1b67e0e641d532f80434b2: read-only file system

User authentication/management does not work

When trying to create new users from Settings, clicking the button does not work as expected, as no request is sent to the auth service. Sending the POST request using curl works, as I've been able to create some user accounts this way.

The same thing happens at the Login page: clicking the Login button does not send any request to the auth service.

Note: I'm running Orchest with SSL enabled.

Make ./orchest.sh prompt sudo if docker socket is inaccessible

Expand editor support with integrated code-server

This issue keeps track of the integration of the code-server browser based VS Code editor.

As per @howie6879's recommendation we'll look into how and whether it makes sense to expand the available editors beyond JupyterLab.

At the moment VS Code can be used with Orchest by opening the orchest/userdir/ directory in VS Code directly on the host on which Orchest is installed (or through SSH if it's on a remote server).

Jupyterlab kernel restart

jupyterlab/jupyterlab#8432

For now we are applying the proposed patch to Jupyterlab directly until the referenced PR is merged.

How about using code-server to edit the .py script?

Great project for me, thx!

I have a little idea in the actual use:

Jupyterlab works well, but not so well for editing .py scripts, how about using code-server to edit the .py script?
Currently, a project has multiple pipelines, scripts under pipelines can be shared, I have a requirement is whether project-A can share some scripts of project-B, or there can be a concept of public script directory for any pipelines to use.

Looking forward to your reply.

"Run selected steps" states the pipeline is busy

Even when it isn't when clicked during launch of pipeline.

URL with url encoded characters breaks nginx proxy routing

I.e. when renaming a folder in the JLab file manager (PATCH request).

[SUGGESTION] Do not rely on Docker for local development and testing

Hi Team,

Thanks for this great idea, I'm loving it. However, for local development, I don't see a technological reason to rely on Docker. Many of our data scientists do not use docker (for all different reasons), and there is no good argument to force them to use it.

Looking at the source quickly, orchest should be able to run without a container. Furthermore, there are obvious pain-points with docker (e.g., #56).

Could you please make the docker dependency optional?

Many thanks!

pipeline.add_step(“new-title”, “filename.py”)

Not sure if we ever want to do this and how exactly. But it’s basically pipelines as code instead of as JSON files or visually editable.

orchest / orchest Goto Github PK

orchest's Introduction

Build data pipelines, the easy way 🙌

Features

Roadmap

Examples

Installation

Slack Community

License

Contributing

Contributors

orchest's People

Contributors

Stargazers

Watchers

Forkers

orchest's Issues

Recommend Projects

Recommend Topics

Recommend Org