ploomber / ploomber-engine Goto Github PK

View Code? Open in Web Editor NEW

59.0 6.0 14.0 1.25 MB

A toolbox 🧰 for Jupyter notebooks 📙: testing, experiment tracking, debugging, profiling, and more!

Home Page: https://engine.ploomber.io

License: BSD 3-Clause "New" or "Revised" License

Python 94.63% Jupyter Notebook 5.37%

debug debugging jupyter papermill pipeline notebooks

ploomber-engine's Issues

execute_notebook `cwd` not working as expected

i think the cwd is not working outside the pytest env 😅

For example, consider the following:

.
├── run_execute_notebook.py
└── some_dir
    └── test.ipynb

run_execute_notebook.py

from ploomber_engine import execute_notebook

DIR = ".some_dir"
IPYNB = DIR + "/test.ipynb"
execute_notebook(IPYNB, IPYNB, cwd=DIR)

some_dir/test.ipynb

#%%
import os
os.getcwd()
#%%
! touch test_file.txt
#%%
import matplotlib.pyplot as plt
plt.figure().savefig('test.png')

Running:
python run_execute_notebook.py, the files are not stored in the some_dir but the top dir where python is run...

.
├── run_execute_notebook.py
├── some_dir
│   └── test.ipynb
├── test.png
└── test_file.txt

test_cwd.zip

engine that uses IPython directly

adding another engine that uses IPython directly will enable resource monitoring like CPU, RAM, etc. since the current engine relies on a connection to an IPython kernel, the papermill process is just sending code and waiting for responses

ploomber-engine displaying standard error even without `--log-output`

I was testing the feature introduced in #66 and realized that standard error is displayed even when missing the --log-output argument, is this expected?

also, I saw that the progress bars are correctly displayed now but the changelog does not mention anything, is there anything missing here?

cc @mehtamohit013

experiment: showing our community page on errors

We want to incentivize our users to talk to us when they encounter problems by showing them our link to Slack in the exception message.

In ploomber-core 0.1, I added some custom exceptions, so we do that. Essentially, we want to replace some errors with their customized counterparts:

ValueError -> PloomberValueError
TypeError -> PloomberTypeError

see documentation here

Fix stdout and stderr stream; IO()

Currently, when running tqdm and printing inside it, breaks and messes up the output for the user. For now, there is a temporary fix, which adds \n to eliminate this but adds extra new lines. Fix the output and the IO() class in ipython.py

#66 #1080 Summary

clean notebook before execution

when copying the notebook for execution, we should clean it up and remove all the outputs and cell indexes

track_execution parameters should be optional

When trying to run a command without specifying the parameters argument an error is thrown.
i.e.

track_execution("pycaretBinary.ipynb", quiet=True, database="pycaret.db")

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_9979/753018148.py in <cell line: 1>()
----> 1 track_execution("fit.py", quiet=True, database="pycaret.db")

~/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/ploomber_core/telemetry/telemetry.py in wrapper(*args, **kwargs)
    674                         result = func(_payload, *args, **kwargs)
    675                     else:
--> 676                         result = func(*args, **kwargs)
    677                 except Exception as e:
    678 

TypeError: track_execution() missing 1 required positional argument: 'parameters'

The workaround was passing parameters={}, but then it prints some weird output: Could not find block with the # parameters comment.

We should make parameters optional, and set to empty by default so users don't need to take this workaround.

Skipping cells during execution

It would be good to have a way to skip certain cells during execution of jupyter notebooks.

My use case is that it is useful to have extra cells which might be useful for a developer while debugging or during interactive development - but shouldn't be run in regular runs.

But there are other scenarios discussed in the thread here:
nteract/papermill#429

add current path to sys.path while executing

add jupyterbook-like intro page

same as ploomber/jupysql#134 but for ploomber-engine

add jupytercon video

we should add the jupytercon 2023 video to the docs in the quickstart guide: https://engine.ploomber.io/en/latest/quick-start.html#

more info: ploomber/ploomber#1151

user's table is detached from the db name and is static

I'm using a db called tester.db, but the table is called experiments, I also can't change it.

At a minimum, we should document this.
We should also allow the user to configure its name/assign the same name as the db (in case it's a one table use case).

add missing tests

We're testing the overall functionality but we need to add tests for specific methods. We can use the unit tests in nbclient and papermill as guides.

Doc: Doc link is broken in README.md

ploomber-engine cli isn't well documented

The only place in the docs it's mentioned a user can run via cli is the API spec and it's also via python.

We should add the commands descriptions into docs, including scheduling notebooks via terminal:
ploomber-engine to_run.ipynb output.ipynb

(Basically document everything that ploomber-engine --help shows)

create documentation

we need to create documentation for this project using jupyter-book. for an example with full setup, see this. once the skeleton is done, I can configure it, so it deploys using readthedocs.org

create skeleton using jupyter-book
add readthedocs configuration
deploy to readthedocs (I can do this)
add notebook with debuglater tutorial
add debug now tutorial
add notebook profiling tutorial

Note: the tutorials will be a series of commands (not Python code). we can use the %%bash magic to execute bash inside a Jupyter notebook; however, we started experimenting with the bash kernel; let's see if that works here

raise an exception when finishing debugging session

we are currently exiting the process in the client:

ploomber-engine/src/ploomber_engine/client.py

Line 325 in 06a7a3d

sys.exit(1)

but it'd be better to raise a specific extension so clients can decide what to do

Change the url for the docs

We need to switch the docs to serve from our domain:
https://ploomber-engine.readthedocs.io/en/latest/quick-start.html to https://ploomber-engine.ploomber.io/en/latest/quick-start.html

documentation refactoring

Some notes on the ongoing documentation refactoring:

for some tutorials, we need to download sample notebooks, which are then executed as part of the tutorial. if we are editing the notebooks with jupyterlab and run the cells that download the sample notebooks, those sample notebooks will be downloaded in the doc/ directory. and next time we build the docs, jupyter-book will attempt to execute them as part of the doc build, we should find a way to prevent this

Modify --profile-runtime and --profile-memory plot file location

Similar to #78, it would be nice if users can also specify a different location for the plots (or even just an option to disable the plots, while still saving the profiling data).

This can be helpful to keep all plots/profiling data in separate locations from the notebook-outdir (to prevent cluttering the notebook outdir if there are lots of parameterized notebooks present)

Current engine:

toi_0-memory-usage.png
toi_0-runtime.png
toi_0-profiling-data.csv
toi_0.ipynb
toi_1-memory-usage.png
toi_1-runtime.png
toi_1-profiling-data.csv
toi_1.ipynb
toi_2-memory-usage.png
toi_2-runtime.png
toi_2-profiling-data.csv
toi_2.ipynb
...

If the plots (and profiling data) can be saved elsewhere:

toi_0.ipynb
toi_1.ipynb
toi_2.ipynb
...

I think it would be easy to do either of the above, and I would be happy to implement this.

@edublancas, which of the two do you think would be better?

Add a --disable-profiling-plots arg
Modify --profile-runtime and --profile-memory to accept a path (similar to #78)

Tracking source code of an experiment

We should have an easy mechanism to help users track their own source code, either via static analysis or another mechanism.
It can go to it's own column in the tracker - source_code

Inspection and type of parameters

Hello Eduardo,

Thanks for the ploomber engine. I integrated it in my projects and it works well.

Is there a way to inspect a notebook to get the parameters needed by the notebook and also their types?
If that is possible, it would be nicer to get the schema definition if they are richer types like dataclasses or pydantic models.

Cheers,
Vijay

add cwd to execute_notebook (kwarg exists in papermill)

cwd allows the user to set the path to execute a notebook. This feature is present as a kwarg in papermill:
https://github.com/nteract/papermill/blob/54f6c038cdae0c70d5fb04691fa465e12aeb62cb/papermill/execute.py#L29

Question: Re-using interactive shell for multiple notebook executions

We have a scenario where we are running the same notebook thousands of times. And we are seeing memory increasing significantly as we progress. Initial investigation looks like incremental memory is primarily because we import pandas and numpy in the notebooks. We were thinking if we could import pandas and numpy in the client._shell, then re-use that shell, we might be able to manage our memory. I am looking into this now, but wondered if it is something you had already explored or even already support. Thank you.

missing features

there are a few things we're missing to cover the most basic features that papermill offers:

parametrize notebooks
progress bar in PloomberClient
store partially executed notebook if it fails

And to offer all our functionality independently of papermill

debug later in PloomberClient
debuglater argument in execute_notebook

the future of ploomber-engine

This project is growing in complexity and contains some unrelated modules that might make sense to move to a separate repository to reduce complexity. I'm unsure what's the best path forward, but here are my thoughts.

what do we want this package to be

moving forward, I think ploomber-engine should become a toolbox for executing notebooks (a papermill replacement). I imagine ploomber-engine providing a Python and CLI for running notebooks.

Example:

ploomber-engine input.ipynb output.ipynb

With the option to enable features such as debug later:

ploomber-engine input.ipynb output.ipynb --debuglater

or produce profiling plots:

ploomber-engine input.ipynb output.ipynb --profile

or track stuff:

ploomber-engine input.ipynb output.ipynb --track

or combine multiple things:

ploomber-engine input.ipynb output.ipynb --track --profile

Note: My only reservation with ploomber-engine becoming a notebook toolbox is that since it'll host the experiment tracker feature (at least right now), at some point it might not be the right place if we implement advanced experiment tracking features. (see last section of this comment)

things we have here

Currently, we have a few things here:

papermill integration

When installing ploomber-engine, papermill users get a few new custom engines (debug, debuglater and profiling), and they can use them with:

papermill in.ipynb out.ipynb --engine {name}

I'm unsure to what extent papermill users are using ploomber-engine for this purpose; however, ploomber is indeed using it and the documentation mentions how to switch engines by passing a new engine name (this is because ploomber still uses papermill as its execution engine)

I think we should split the core functionality from the integration with papermill. So maybe leave ploomber-engine as the package that provides papermill engines, and create a new package with a new name that contains the core functionality for running notetebooks. then have ploomber-engine be a dependency of that new package and keep ploomber depending on ploomber-engine.

To keep backwards support we can keep our tracker here for some time and show a deprecation warning so users move to the other one.

notebook executor

we have a custom notebook executor (i.e., a papermill replacement) that runs notebooks in the same process. this executor is what allowed us to provide the debugging, profiling and experiment tracking features.

experiment tracker

we have a command-line interface to track ML experiments, which is what we launched in this blog post. This tracker is an extra layer on top of sklearn-evaluation's SQLiteTracker - it runs the notebooks and logs stuff to a SQLite database without the user having to write custom code

The experiment tracker has received great response from the community so we'll likely keep investing in such feature. this posits a question of where such advanced tracking features will live. the SQliteTracker lives in sklearn-evaluation, the notebook tracker lives here, but if we build a UI for tracking or more advanced features, it's unclear where those should go.

Unable to link to .rst files

The previously established links to .rst files in .md files no longer work with the latest update of Jupybook (0.14). They have been temporarily removed from respective .md files, but a permanent solution is needed.

#64

missing track_execution from the api docs

Found this guide here: Issue on page /user-guide/tracking.html

This guide has some functions that are missing from the API guide (track_exection is only mentioned there - I searched it).

memory leak with "PloomberClient"

Hello,
I am a new user of the framework. I encountered the following problem when using ploomber-engine:

when calling a notebook using ploomber-engine, with the methods 'get_namespace' or 'execute' of PloomberClient , the memory reserved by the kernel used by the notebook is not released.

For example, if we consider the following notebook with one cell, named "temp_test.py", saved as a python file with jupytext.

import torch
with torch.no_grad():   
    tensor = torch.ones((10000,10000))

if we execute it using this script:

import psutil
def print_memory():
    """method to get free and used memory"""
    print(psutil._common.bytes2human(psutil.virtual_memory().free),
      psutil._common.bytes2human(psutil.virtual_memory().used))

from pathlib import Path
from ploomber_engine.ipython import PloomberClient
import jupytext
path_notebook_as_py = "./temp_test.py"
path_notebook = path_notebook_as_py.replace(".py",".ipynb")

jupytext.write(nb=jupytext.read(path_notebook_as_py),
                       fp =path_notebook)

for _ in range(3):
    print_memory()
    
    client = PloomberClient.from_path(path_notebook)

    namespace = client.get_namespace()
    del client
    del namespace
print_memory()

the deletion of client, doesn't free memory reserved by the notebook.

I get, in fact the following output which indicate that the memory is not entirely freed:

20.6G 2.2G

Executing cell: 1: 100%|██████████████████████████| 1/1 [00:01<00:00,  1.23s/it]

20.1G 2.7G

Executing cell: 1: 100%|██████████████████████████| 1/1 [00:00<00:00, 17.93it/s]

19.7G 3.1G

Executing cell: 1: 100%|██████████████████████████| 1/1 [00:00<00:00, 18.98it/s]

19.4G 3.5G

Is there a clean way to free the memory by the executed notebook?

I also tried this variant :

for _ in range(3):
    print_memory()
    
    client = PloomberClient.from_path(path_notebook)

    nb = client.execute()

print_memory()

but I got the same problem.

Also am I using the api correctly ?
Thank you for your reply.

Tracking to/from S3 bucket

We should enable users storing their trackign data into/loading from S3. This will allow more flexibility, especially with production tables (million of rows, 100+ columns).
example

integration with Airflow

hey @potiuk, moving the conversation from papermill to this repo so my team is aware of this. I'll start the discussion in the Airflow devlist shortly!

from @potiuk (source):

Ah. Nice!

Just thinking if maybe we should just a copy of existing papermill provider and do a "ploomber" one. I do see that papermill is pretty inactive so this would be a good idea.

Can you please start a discussion about this ("replace the papermill provider with ploomber-engine") at the devlist of Airlfow (see https://lists.apache.org/[email protected]) also https://airflow.apache.org/community/ has more information on joining the list. We have now pretty formal way of accepting new providers - especially if they are connected to existing services. So we would need to understand what is the relation between ploomber.io and ploomber-engine, can you use it with, or only without it and how much "open" the open-source (BSD licenced I see, which is cool) the engine is.

I think we need a bit more context (maybe links to some blogs and explanation why you decided to develop it) and I think we can take it from there.

notebook execution fails if there's anything in stdout

ploomber-core 0.2.17 introduced a change to print a Ploomber Cloud link at a regular interval. However, this breaks notebook execution.

We realized this because one of our automated notebooks started failing:


  0%|          | 0/90 [00:00<?, ?it/s]
Executing cell: 1:   0%|          | 0/90 [00:00<?, ?it/s]
Executing cell: 1:   2%|▏         | 2/90 [00:00<00:16,  5.40it/s]
Executing cell: 2:   2%|▏         | 2/90 [00:00<00:16,  5.40it/s]
Executing cell: 2:   2%|▏         | 2/90 [00:01<00:46,  1.89it/s]





Deploy AI and data apps for free on Ploomber Cloud! Learn more: https://docs.cloud.ploomber.io/en/latest/quickstart/signup.html

An error happened while executing the notebook. Partially executed notebook stored at report.ipynb
Traceback (most recent call last):
  File "/usr/share/miniconda3/bin/ploomber-engine", line 8, in <module>
    sys.exit(cli())
             ^^^^^
  File "/usr/share/miniconda3/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/share/miniconda3/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/share/miniconda3/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/share/miniconda3/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/share/miniconda3/lib/python3.11/site-packages/ploomber_engine/cli.py", line 106, in cli
    execute_notebook(
  File "/usr/share/miniconda3/lib/python3.11/site-packages/ploomber_core/telemetry/telemetry.py", line 679, in wrapper
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/share/miniconda3/lib/python3.11/site-packages/ploomber_engine/execute.py", line 179, in execute_notebook
    out = client.execute(parameters=parameters)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/share/miniconda3/lib/python3.11/site-packages/ploomber_engine/ipython.py", line 498, in execute
    self._execute()
  File "/usr/share/miniconda3/lib/python3.11/site-packages/ploomber_engine/ipython.py", line 621, in _execute
    self.execute_cell(
  File "/usr/share/miniconda3/lib/python3.11/site-packages/ploomber_engine/ipython.py", line 424, in execute_cell
    output.extend(_process_stdout(stdout, result=result))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/share/miniconda3/lib/python3.11/site-packages/ploomber_engine/ipython.py", line 68, in _process_stdout
    output_type="stream", text="".join(out), name="stdout"
                               ^^^^^^^^^^^^
TypeError: sequence item 0: expected str instance, bytes found
Closing duckdb://

I replicated it by removing the last_cloud_check from ~/.ploomber/stats/uid.yaml and then tried to run a notebook. The fix should allow successful execution of notebooks even if there are prints to stdout

add missing html_meta to every .md/.rst document

(see ploomber/sklearn-evaluation#253 for context)

Possible to avoid translating `tuple` parameter values as strings, or else allow a user-supplied `Translator`?

Parametrizing a notebook on parameters={"list_param": [0,1], "tuple_param": (0,1)} will result in list_param value being injected as a list, but the tuple_param value being injected as the stringified representation of a tuple, e.g. "(0, 1)" (literally with the quotes). I understand that ploomber-engine strives to match the papermill API, so maybe this is intentional to maintain that compatibility, but is there a chance for the parameter translator to faithfully inject tuples?

Maybe a more sensible request would be to allow the user to pass a custom Translator to PloomberClient, which could pass the custom Translator subclass to parametrize_notebook (below), allowing the user to customize behavior, in the case that it is desired to mimic the papermill API surface by default?

Contents of `translate`

Here we see that tuple falls through to the "last resort" translator which stringifies it.

ploomber-engine/src/ploomber_engine/_translator.py

Lines 76 to 95 in e00bdc2

 @classmethod 

 def translate(cls, val): 

 """Translate each of the standard json/yaml types to appropiate objects.""" 

 if val is None: 

 return cls.translate_none(val) 

 elif isinstance(val, str): 

 return cls.translate_str(val) 

 # Needs to be before integer checks 

 elif isinstance(val, bool): 

 return cls.translate_bool(val) 

 elif isinstance(val, int): 

 return cls.translate_int(val) 

 elif isinstance(val, float): 

 return cls.translate_float(val) 

 elif isinstance(val, dict): 

 return cls.translate_dict(val) 

 elif isinstance(val, list): 

 return cls.translate_list(val) 

 # Use this generic translation as a last resort 

 return cls.translate_escaped_str(val)

Abbreviated contents of `parametrize_notebook`

ploomber-engine/src/ploomber_engine/_util.py

Lines 72 to 73 in e00bdc2

 def parametrize_notebook(nb, parameters): 

 """Add parameters to a notebook object"""

...

ploomber-engine/src/ploomber_engine/_util.py

Lines 92 to 94 in e00bdc2

 params_translated = translate_parameters( 

 parameters=parameters, comment="Injected parameters" 

 )

engine and papermill differences

When running the same notebook with --log-output papermill shows all of the outputs and ploomber-engine doesn't.

This happens on our posthog reporting notebook.
For instance Cell 26, shows the output in papermill:

But Doesn't in ploomber-engine:

Another thing I noticed is when the notebook runs, there's a dual progress bar within the cell the messes with the main bar, that might be confusing for users. (in ploomber-engine)

adding error message when a notebook fails

in papermill, when a notebook fails execution, a markdown cell is added at the top indicating where the notebook failed. we should add the same here

add support for stdin in profiling engine

document experiment tracker

in 0.0.11, we added an experiment tracker, see this post but we haven't documented it. we need to add a new section that shows how to use it. we can summarize the existing code in the blog post and then link to the post for people who want a deeper dive into it.

@RodolfoFerro add it to the tasks of the 21-25 nov week

proposal: pytest plugin

this package is pretty cool, but development isn't very active. it implements a pytest plugin so you can test notebooks with:

pytest --nbval

We could implement a basic version since we already have some notebook testing capabilities.

fix client

to enable stdin, we had to modify nbclient; however, since nbclient does not have to support the stdin channel, the code doesn't set a timeout for the channels, since it assumes all messages will return a response via the shell and io_pub; this causes out modification to hang up. The quick fix is to add a little time.sleep call to ensure that cells asking for input are correctly handled; but we should fix this

we can use this as inspiration: https://github.com/jupyter/jupyter_console/blob/main/jupyter_console/ptshell.py

remove pinned jupyter_client

I pinned jupyter_client<8 in the CI config since version 8 broke papermill, we can remove this once a fix is pushed

config files

can you remove this files from the repo? @idomic

https://github.com/ploomber/ploomber-engine/tree/main/.idea

Modify --save-profiling-data to accept File Path(Ploomber Engine CLI)

When running ploomber-engine through CLI with the --save-profiling-data argument, the profiling data is stored by default in output-profiling-data.csv. Add the functionality so that the argument can take the custom path for CSV like ./logs/output.csv

Adding windows to CI

When adding windows to the CI, it fails. We have to fix the tests, often, it's just incompatibilities in the tests, but there might be cases where there's an actual incompatibility.

https://github.com/ploomber/ploomber-engine/actions/runs/3791627265

add support for updating cells executed count in profiling engine

Tracking requirements of the env

We should have an easy mechanism to help users track their environment, probably via pip freeze.
It can go to it's own column in the tracker

incompatibility with matplotlib 3.7

Our tests started failing with matplotlib 3.7, I tracked down the error locally:

# ensure you're running matplotlib 3.7
pip install matplotlib -U

# run tests
pytest tests --ignore=tests/test_engine.py

the error is happening in the test_no_outputs test. Upon execution, this is how the notebook looks like:

there's a clear error here since the first cell x = 1 should not produce any outputs; yet, it displays a plot. Seems like with the new matplotlib release, we need to do some "matplotlib state cleaning" when destroying the shell object, so when creating a new one, we don't share state from previous notebook runs. My guess is that we're missing something here

It's still unclear what changed in matplotlib 3.7 so more investigation is needed. for the time being, I pinned matplotlib so our tests pass.

Telemetry for errors

We currently don't track errors on the package.

add save_profiling_data option to execute_notebook

It would be cool if the profiling data to generate the cell-memory-usage and cell-runtime plots were saved in the same directory as the plots.

I'd be happy to work on this :)

switch to conf.py to fix binder links from PRs

in sklearn-evaluation, we switched to using conf.py instead of _config.yml; since the former allows us to embed logic necessary to customize the binder links when the docs are generated from a PR. We should apply that here. We must also document the process in the contributing repo

	@classmethod
	def translate(cls, val):
	"""Translate each of the standard json/yaml types to appropiate objects."""
	if val is None:
	return cls.translate_none(val)
	elif isinstance(val, str):
	return cls.translate_str(val)
	# Needs to be before integer checks
	elif isinstance(val, bool):
	return cls.translate_bool(val)
	elif isinstance(val, int):
	return cls.translate_int(val)
	elif isinstance(val, float):
	return cls.translate_float(val)
	elif isinstance(val, dict):
	return cls.translate_dict(val)
	elif isinstance(val, list):
	return cls.translate_list(val)
	# Use this generic translation as a last resort
	return cls.translate_escaped_str(val)

	def parametrize_notebook(nb, parameters):
	"""Add parameters to a notebook object"""

	params_translated = translate_parameters(
	parameters=parameters, comment="Injected parameters"
	)

ploomber / ploomber-engine Goto Github PK

ploomber-engine's Issues

what do we want this package to be

things we have here

papermill integration

notebook executor

experiment tracker

Contents of translate

Abbreviated contents of parametrize_notebook

Recommend Projects

Recommend Topics

Recommend Org

Contents of `translate`

Abbreviated contents of `parametrize_notebook`