project-codeflare / codeflare Goto Github PK
View Code? Open in Web Editor NEWSimplifying the definition and execution, scaling and deployment of pipelines on the cloud.
Home Page: https://codeflare.dev
License: Apache License 2.0
Simplifying the definition and execution, scaling and deployment of pipelines on the cloud.
Home Page: https://codeflare.dev
License: Apache License 2.0
Describe the bug
sample pipeline jupyter notebook errors out due to undefined variable
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Jupyter notebook on binder should run without exception
Additional context
error while executing cell:
pipeline_output = rt.execute_pipeline(pipeline, ExecutionType.FIT, pipeline_input)
node_0_output = pipeline_output.get_xyrefs(node_0)
In [74]:
outputs[0]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-74-a45df8d4a457> in <module>
----> 1 outputs[0]
NameError: name 'outputs' is not defined
As a CF dev, the current code has become quite complex to manage in two files. This is also not good coding practice and needs major refactoring to account for all the dev.
For uniformity of input output to pipeline stages, we need to have a list of Future
references to Xy
objects. Since Ray remote calls always return a future reference and we return a holder class with reference to X and y, which are compute tasks that can be in flight. A list of references allows for the parallelization of operations.
Uniformity of data exchange is enabled by choosing a list of future references as the way to exchange data.
AND node semantics computes a full cross product. In grid search cv, an AND node like feature union will require features to be joined in a given input object. For example when performing two fold cross validation on the following pipeline: (PCA (n_components = 5, 10) || Nystrom || Select k-best) && Feature Union. On two-fld CV, we get four objects from PCA node (2x2) and two objects each from Nystrom and Select k-best. A regular AND node will compute 4x2x2 cross product. A lineage and will compute 4 cross products: (pca_5, Nystrom, Select k_best) on the two input objects and (pca_10, Nystrom, select k_best) on the the same two input objects.
Lineage And: Solution select items in the AND node cross product that share the same input object lineage
As a CF pipelines developer, using node as a key as opposed to node_name
causes a lot of overhead. An intrusive change, but will help keep all the core data structures clean!
node_name
Describe the bug
Installed RHODS, created data science project, deployed a Jupiter notebook with code flare image
All working fine. Created a cluster. it deployed head and worker ray cluster nodes. All seem to be running fine.
It exposes a route for ray dashboard which is not accessible at all.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Must take me to ray dashboard
Desktop (please complete the following information):
OS: Mac
Browser both chrome, safari
Version Mac Monterey 12.6.1
Browser [e.g. stock browser, safari]
Version [e.g. 22]
Additional context
Note I have changed security settings of my browser for route address for allow insecure content.. but still it doesn't work for me.
As a pipelines user, I would like to pick a specific pipeline, store it and reload it for scoring purposes.
Describe the bug
A clear and concise description of what the bug is.
Error using getting started docs.
Regarding: https://codeflare.readthedocs.io/en/latest/getting_started/starting.html#codeflare-on-openshift-container-platform-ocp
Section:
Alternatively, you can also build locally with:
git clone https://github.com/project-codeflare/codeflare.git
pip3 install --upgrade pip
pip3 install .
pip3 install -r requirements.txt
SEE PROBLEM BELOW:
$ pip3 install --upgrade pip
Requirement already satisfied: pip in /usr/local/lib/python3.8/site-packages (20.3.3)
Collecting pip
Downloading pip-21.1.2-py3-none-any.whl (1.5 MB)
|████████████████████████████████| 1.5 MB 4.1 MB/s
Installing collected packages: pip
Attempting uninstall: pip
Found existing installation: pip 20.3.3
Uninstalling pip-20.3.3:
Successfully uninstalled pip-20.3.3
Successfully installed pip-21.1.2
$
$ pip3 install .
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/pip/_vendor/pkg_resources/init.py", line 583, in _build_master
ws.require(requires)
File "/usr/local/lib/python3.8/site-packages/pip/_vendor/pkg_resources/init.py", line 900, in require
needed = self.resolve(parse_requirements(requirements))
File "/usr/local/lib/python3.8/site-packages/pip/_vendor/pkg_resources/init.py", line 791, in resolve
raise VersionConflict(dist, req).with_context(dependent_req)
pip._vendor.pkg_resources.VersionConflict: (pip 21.1.2 (/usr/local/lib/python3.8/site-packages), Requirement.parse('pip==20.3.3'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/opt/[email protected]/bin/pip3", line 33, in
sys.exit(load_entry_point('pip==20.3.3', 'console_scripts', 'pip3')())
File "/usr/local/opt/[email protected]/bin/pip3", line 25, in importlib_load_entry_point
return next(matches).load()
File "/usr/local/Cellar/[email protected]/3.8.7/Frameworks/Python.framework/Versions/3.8/lib/python3.8/importlib/metadata.py", line 77, in load
module = import_module(match.group('module'))
File "/usr/local/Cellar/[email protected]/3.8.7/Frameworks/Python.framework/Versions/3.8/lib/python3.8/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1014, in _gcd_import
File "", line 991, in _find_and_load
File "", line 975, in _find_and_load_unlocked
File "", line 671, in _load_unlocked
File "", line 783, in exec_module
File "", line 219, in _call_with_frames_removed
File "/usr/local/lib/python3.8/site-packages/pip/_internal/cli/main.py", line 9, in
from pip._internal.cli.autocompletion import autocomplete
File "/usr/local/lib/python3.8/site-packages/pip/_internal/cli/autocompletion.py", line 10, in
from pip._internal.cli.main_parser import create_main_parser
File "/usr/local/lib/python3.8/site-packages/pip/_internal/cli/main_parser.py", line 8, in
from pip._internal.cli import cmdoptions
File "/usr/local/lib/python3.8/site-packages/pip/_internal/cli/cmdoptions.py", line 23, in
from pip._internal.cli.parser import ConfigOptionParser
File "/usr/local/lib/python3.8/site-packages/pip/_internal/cli/parser.py", line 12, in
from pip._internal.configuration import Configuration, ConfigurationError
File "/usr/local/lib/python3.8/site-packages/pip/_internal/configuration.py", line 21, in
from pip._internal.exceptions import (
File "/usr/local/lib/python3.8/site-packages/pip/_internal/exceptions.py", line 7, in
from pip._vendor.pkg_resources import Distribution
File "/usr/local/lib/python3.8/site-packages/pip/_vendor/pkg_resources/init.py", line 3252, in
def _initialize_master_working_set():
File "/usr/local/lib/python3.8/site-packages/pip/_vendor/pkg_resources/init.py", line 3235, in _call_aside
f(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/pip/_vendor/pkg_resources/init.py", line 3264, in _initialize_master_working_set
working_set = WorkingSet._build_master()
File "/usr/local/lib/python3.8/site-packages/pip/_vendor/pkg_resources/init.py", line 585, in _build_master
return cls._build_from_requirements(requires)
File "/usr/local/lib/python3.8/site-packages/pip/_vendor/pkg_resources/init.py", line 598, in _build_from_requirements
dists = ws.resolve(reqs, Environment())
File "/usr/local/lib/python3.8/site-packages/pip/_vendor/pkg_resources/init.py", line 786, in resolve
raise DistributionNotFound(req, requirers)
pip._vendor.pkg_resources.DistributionNotFound: The 'pip==20.3.3' distribution was not found and is required by the application
$
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
pip3 install .
from instructions runs without failure.
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Smartphone (please complete the following information):
Additional context
Add any other context about the problem here.
Describe the bug
The 'sklearn' PyPI package is deprecated, use 'scikit-learn' rather than 'sklearn' for pip commands.
To Reproduce
Steps to reproduce the behavior:
Collecting sklearn>=0.0 (from codeflare==0.1.2.dev0)
Using cached sklearn-0.0.post9.tar.gz (3.6 kB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [18 lines of output]
The 'sklearn' PyPI package is deprecated, use 'scikit-learn'
rather than 'sklearn' for pip commands.
Here is how to fix this error in the main use cases:
- use 'pip install scikit-learn' rather than 'pip install sklearn'
- replace 'sklearn' by 'scikit-learn' in your pip requirements files
(requirements.txt, setup.py, setup.cfg, Pipfile, etc ...)
- if the 'sklearn' package is used by one of your dependencies,
it would be great if you take some time to track which package uses
'sklearn' instead of 'scikit-learn' and report it to their issue tracker
- as a last resort, set the environment variable
SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True to avoid this error
More information is available at
https://github.com/scikit-learn/sklearn-pypi-package
If the previous advice does not cover your use case, feel free to report it at
https://github.com/scikit-learn/sklearn-pypi-package/issues/new
Expected behavior
pip install shouldn't fail as only scikit-learn should be used in requirements.txt and setup.py
For more information see: https://github.com/scikit-learn/sklearn-pypi-package
Graph viz missing from binder service
To Reproduce
Steps to reproduce the behavior:
Additional context
Below error caused by execution of cell:
non_param_graph = cf_utils.pipeline_to_graph(pipeline)
non_param_graph
ExecutableNotFound: failed to execute ['dot', '-Kdot', '-Tsvg'], make sure the Graphviz executables are on your systems' PATH
Describe the bug
Cannot bring up Ray cluster as defined in the OCP tutorial
To Reproduce
Steps to reproduce the behavior:
pip3 install --upgrade codeflare
oc create namespace codeflare
ray up ray/python/ray/autoscaler/kubernetes/example-full.yaml
fails:$ ray up ray/python/ray/autoscaler/kubernetes/example-full.yaml
Provided cluster configuration file (ray/python/ray/autoscaler/kubernetes/example-full.yaml) does not exist
Expected behavior
Bring up Ray cluster on OCP
Desktop (please complete the following information):
Additional context
OCP Cluster running on IBM Cloud.
$ oc cluster-info
Kubernetes master is running at https://c100-e.jp-tok.containers.cloud.ibm.com:31129
To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
CodeFlare commit hash commit a2b290a115b0cc1317270cef6059d5281215842e
Current implementation does not accept user specified scoring metric(s). For example,
cross_val_score(pipeline, X_test, y_test, scoring="neg_mean_squared_error", cv=10)
The list of sklearn model evaluation metrics are listed here
https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
One common use case documented below is supported by codeflare pipeline.
https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html#sphx-glr-auto-examples-model-selection-plot-underfitting-overfitting-py
As a CF pipelines user, I would like to understand the memory consumption when pipelines are executed. Given pipelines accept nparrays, will zero copy sharing of Ray help?
As a user of CodeFlare pipeline, how do I know the lineage of an object -- what objects and what nodes generated this particular object?
As a CF pipelines user, I would like the ability to select the best or k-best pipelines from a parameter grid search output.
As a Codeflare user, I want to use Ray and Spark alternately to execute my end-to-end ML jobs. Some steps might be executed more efficiently using Ray, while others using Spark. The plasma store in Ray seems to provide an efficient way to share ObjectRef between Ray and Spark. Currently, RayDP project supports from Spark to Ray in some limited way, by running Spark as a Ray actor. However, ObjectRef cannot be shared easily in both directions, Spark-2-Ray and Ray-2-Spark.
Pandas dataframe
created by remote tasks in local Ray plasma stores can be passed with ObjectRef
to the Spark driver to create a Spark dataframe
containing list of ObjectRef
.groupby()
partition semantics and writes these partitions to plasma store, instead of using hashPartition()
.ObjectRef
between Ray workers and Spark executors.[Reference] I have opened an issue on the RayDP repo: oap-project/raydp#164
As a CF pipelines user, support for nested pipelines, where the node of a pipeline can be a pipeline itself.
As a CF pipelines user, I would like to see a ADR capturing the design of gridsearch CV.
Describe the bug
After running a PREDICT
, the y_pred
cannot be obtained via get_yref()
, instead can be obtained via the get_Xref()
. Semantically, this seems weird.
To Reproduce
Steps to reproduce the behavior:
y_pred = ray.get(predict_clf_output[0].get_Xref())
the output would match the original sklearn pipeline on the top.Expected behavior
The predicted output should be obtained from calling get_yref().
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Smartphone (please complete the following information):
Additional context
Add any other context about the problem here.
Describe the bug
Possibly a corner case?
ray-pipeline/codeflare/pipelines/Datamodel.py in get_pre_edges(self, node)
640 """
641 pre_edges = []
--> 642 pre_nodes = self.pre_graph[node]
643 # Empty pre
644 if not pre_nodes:
KeyError: <codeflare.pipelines.Datamodel.EstimatorNode object at 0x7fa2d8920f10>
To Reproduce
## initialize codeflare pipeline by first creating the nodes
pipeline = dm.Pipeline()
node_a = dm.EstimatorNode('a', MinMaxScaler())
node_b = dm.EstimatorNode('b', StandardScaler())
node_c = dm.EstimatorNode('c', MaxAbsScaler())
node_d = dm.EstimatorNode('d', RobustScaler())
node_e = dm.AndNode('e', FeatureUnion())
node_f = dm.AndNode('f', FeatureUnion())
node_g = dm.AndNode('g', FeatureUnion())
## codeflare nodes are then connected by edges
pipeline.add_edge(node_a, node_e)
pipeline.add_edge(node_b, node_e)
pipeline.add_edge(node_c, node_f)
## node_d does not have a downstream node
# pipeline.add_edge(node_d, node_f)
pipeline.add_edge(node_e, node_g)
pipeline.add_edge(node_f, node_g)
pipeline_input = dm.PipelineInput()
xy = dm.Xy(X,y)
pipeline_input.add_xy_arg(node_a, xy)
pipeline_input.add_xy_arg(node_b, xy)
pipeline_input.add_xy_arg(node_c, xy)
pipeline_input.add_xy_arg(node_d, xy)
## execute the codeflare pipeline
pipeline_output = rt.execute_pipeline(pipeline, ExecutionType.FIT, pipeline_input)
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Smartphone (please complete the following information):
Additional context
Add any other context about the problem here.
Describe the bug
lale libray does install does not install all dependencies, options tried were:
!pip install lale[full]
!pip install 'liac-arff>=2.4.0'
and
!pip install lale
!pip install 'liac-arff>=2.4.0'
To Reproduce
Steps to reproduce the behavior:
Expected behavior
all cells in the examples should run without errors
Additional context
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-6-44a629fbc523> in <module>
----> 1 (X_train, y_train), (X_test, y_test) = fetch("jungle_chess_2pcs_raw_endgame_complete", "classification")
~/anaconda3/lib/python3.8/site-packages/lale/datasets/openml/openml_datasets.py in fetch(dataset_name, task_type, verbose, preprocess, test_size, astype)
663 from lale.datasets.data_schemas import liac_arff_to_schema
664
--> 665 schema_orig = liac_arff_to_schema(dataDictionary)
666 target_col = experiments_dict[dataset_name]["target"]
667 y: Optional[Any] = None
~/anaconda3/lib/python3.8/site-packages/lale/datasets/data_schemas.py in liac_arff_to_schema(larff)
310
311 def liac_arff_to_schema(larff) -> JSON_TYPE:
--> 312 assert is_liac_arff(
313 larff
314 ), """Your Python environment might contain an 'arff' package different from 'liac-arff'. You can install it with
AssertionError: Your Python environment might contain an 'arff' package different from 'liac-arff'. You can install it with
pip install 'liac-arff>=2.4.0'
or with
pip install 'lale[full]'
As a CFP user, I would like to split a dataset (e.g., np array, pandas dataframe) into smaller objects that can then be fed into other nodes/pipeline. This is especially useful when we have compute intensive tasks and would like to parallelize it easily.
Node
constructDescribe the bug
Jupyter notebook kernel dies
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The jupyter notebook should run without error
Describe the bug
I'm trying to run the example notebooks (in codeflare/notebooks
), and came across this error. The error persisted thru attempts to restart my kernel, entire machine, and re-cloning the repo. Any help, or an explanation of the root cause, is much appreciated!
To Reproduce
Steps to reproduce the behavior:
notebooks/plot_nca_classification.ipynb
knn_pipeline = rt.select_pipeline(pipeline_fitted, pipeline_fitted.get_xyrefs(node_knn)[0])
RaySystemError: System error: buffer source array is read-only
Full stack trace:
RaySystemError: System error: buffer source array is read-only
traceback: Traceback (most recent call last):
File "/home/kastan/.pyenv/versions/3.8.6/lib/python3.8/site-packages/ray/serialization.py", line 268, in deserialize_objects
obj = self._deserialize_object(data, metadata, object_ref)
File "/home/kastan/.pyenv/versions/3.8.6/lib/python3.8/site-packages/ray/serialization.py", line 191, in _deserialize_object
return self._deserialize_msgpack_data(data, metadata_fields)
File "/home/kastan/.pyenv/versions/3.8.6/lib/python3.8/site-packages/ray/serialization.py", line 169, in _deserialize_msgpack_data
python_objects = self._deserialize_pickle5_data(pickle5_data)
File "/home/kastan/.pyenv/versions/3.8.6/lib/python3.8/site-packages/ray/serialization.py", line 157, in _deserialize_pickle5_data
obj = pickle.loads(in_band, buffers=buffers)
File "sklearn/neighbors/_dist_metrics.pyx", line 223, in sklearn.neighbors._dist_metrics.DistanceMetric.__setstate__
File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper
File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__
ValueError: buffer source array is read-only
---------------------------------------------------------------------------
RaySystemError Traceback (most recent call last)
/tmp/ipykernel_1251/3313313255.py in <module>
9 test_input.add_xy_arg(node_scalar, dm.Xy(X_test, y_test))
10
---> 11 knn_pipeline = rt.select_pipeline(pipeline_fitted, pipeline_fitted.get_xyrefs(node_knn)[0])
12 knn_score = ray.get(rt.execute_pipeline(knn_pipeline, ExecutionType.SCORE, test_input)
13 .get_xyrefs(node_knn)[0].get_yref())
~/.pyenv/versions/3.8.6/lib/python3.8/site-packages/codeflare/pipelines/Runtime.py in select_pipeline(pipeline_output, chosen_xyref)
381 curr_xyref = xyref_queue.get()
382 curr_node_state_ptr = curr_xyref.get_curr_node_state_ref()
--> 383 curr_node = ray.get(curr_node_state_ptr)
384 prev_xyrefs = curr_xyref.get_prev_xyrefs()
385
~/.pyenv/versions/3.8.6/lib/python3.8/site-packages/ray/_private/client_mode_hook.py in wrapper(*args, **kwargs)
87 if func.__name__ != "init" or is_client_mode_enabled_by_default:
88 return getattr(ray, func.__name__)(*args, **kwargs)
---> 89 return func(*args, **kwargs)
90
91 return wrapper
~/.pyenv/versions/3.8.6/lib/python3.8/site-packages/ray/worker.py in get(object_refs, timeout)
1621 raise value.as_instanceof_cause()
1622 else:
-> 1623 raise value
1624
1625 if is_individual_id:
Expected behavior
Expected is selecting the pipeline and evaluating its score via a 'SCORE' pipeline.
Desktop
Thank you for any help! I am a University of Illinois at Urbana-Champaign grad student trying to make the most of your work!
Currently, lineage uses SimpleQueue
to realize pipelines. But this is available only in Python >=3.8
. This reduces adoption, moving to Queue
will give us broader Python version coverage.
SimpleQueue
with Queue
Queue
vs SimpleQueue
?A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.