run-house / runhouse Goto Github PK
View Code? Open in Web Editor NEWLike PyTorch for ML infra. Iterable, debuggable, multi-cloud, 100% reproducible across research and production.
Home Page: https://run.house
License: Apache License 2.0
Like PyTorch for ML infra. Iterable, debuggable, multi-cloud, 100% reproducible across research and production.
Home Page: https://run.house
License: Apache License 2.0
From SyncLinear.com | KIT-81
Maybe use grpclib for non-dev install.
From SyncLinear.com | KIT-14
From SyncLinear.com | KIT-75
Describe the bug
Hi recently I constantly hit BadStatusLine issue as follows, may be it is related to urllib library issue?
client@4c31ddeb9ade:/zip$ python test_self_hosted_llm.py
INFO | 2023-06-14 18:24:16,048 | No auth token provided, so not using RNS API to save and load configs
INFO | 2023-06-14 18:24:16,921 | Connected (version 2.0, client OpenSSH_8.9p1)
INFO | 2023-06-14 18:24:16,981 | Authentication (publickey) successful!
2023-06-14 18:24:16,982| ERROR | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
ERROR | 2023-06-14 18:24:16,982 | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
INFO | 2023-06-14 18:24:17,115 | Connected (version 2.0, client OpenSSH_8.9p1)
INFO | 2023-06-14 18:24:17,174 | Authentication (publickey) successful!
INFO | 2023-06-14 18:24:17,224 | Running command on rh-cluster: pkill -f "python -m runhouse.servers.http.http_server"
Warning: Identity file /home/server/.ssh/id_rsa not accessible: Permission denied.
pkill: killing pid 255251 failed: Operation not permitted
pkill: killing pid 255253 failed: Operation not permitted
INFO | 2023-06-14 18:24:17,274 | Running command on rh-cluster: screen -dm bash -c 'python -m runhouse.servers.http.http_server |& tee -a ~/.rh/cluster_server_rh-cluster.log 2>&1'
Warning: Identity file /home/server/.ssh/id_rsa not accessible: Permission denied.
INFO | 2023-06-14 18:24:20,324 | Running command on rh-cluster: ray start --head
WARNING | 2023-06-14 18:24:21,357 | /home/client/.local/lib/python3.10/site-packages/runhouse/rns/function.py:110: UserWarning: reqs
and setup_cmds
arguments has been deprecated. Please use env
instead.
warnings.warn(
INFO | 2023-06-14 18:24:21,358 | Setting up Function on cluster.
INFO | 2023-06-14 18:24:21,495 | Installing packages on cluster rh-cluster: ['transformers', 'torch', 'Package: zip']
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 714, in urlopen
httplib_response = self._make_request(
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 466, in _make_request
six.raise_from(e, None)
File "", line 3, in raise_from
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 461, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.10/http/client.py", line 1374, in getresponse
response.begin()
File "/usr/lib/python3.10/http/client.py", line 318, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.10/http/client.py", line 300, in _read_status
raise BadStatusLine(line)
http.client.BadStatusLine: ?ÿÿ?ÿÿ ?
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/requests/adapters.py", line 486, in send
resp = conn.urlopen(
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 798, in urlopen
retries = retries.increment(
File "/usr/local/lib/python3.10/dist-packages/urllib3/util/retry.py", line 550, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.10/dist-packages/urllib3/packages/six.py", line 769, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 714, in urlopen
httplib_response = self._make_request(
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 466, in _make_request
six.raise_from(e, None)
File "", line 3, in raise_from
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 461, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.10/http/client.py", line 1374, in getresponse
response.begin()
File "/usr/lib/python3.10/http/client.py", line 318, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.10/http/client.py", line 300, in _read_status
raise BadStatusLine(line)
urllib3.exceptions.ProtocolError: ('Connection aborted.', BadStatusLine('\x00\x00\x18\x04\x00\x00\x00\x00\x00\x00\x04\x00?ÿÿ\x00\x05\x00?ÿÿ\x00\x06\x00\x00 \x00þ\x03\x00\x00\x00\x01\x00\x00\x04\x08\x00\x00\x00\x00\x00\x00?\x00\x00'))
Versions
Please run the following and paste the output below.
wget https://raw.githubusercontent.com/run-house/runhouse/main/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
python collect_env.py
Python Platform: Linux-5.15.0-60-lowlatency-x86_64-with-glibc2.35
Python Version: 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0]
Relevant packages:
awscli==1.27.153
boto3==1.26.153
fsspec==2023.5.0
pyarrow==12.0.1
pycryptodome==3.12.0
rich==13.4.2
runhouse @ file:///tmp/runhouse-0.0.6-py3-none-any.whl
skypilot==0.3.1
sshfs==2023.4.1
sshtunnel==0.4.0
typer==0.9.0
wheel==0.38.4
sh: 1: sky: not found
sh: 1: sky: not found
**Additional context**
Add any other context about the problem here.
Describe the bug
The following fails on a M1 Macbook Pro:
conda create -n runhouse python==3.10
conda activate runhouse
pip install --no-cache "runhouse[aws]"
The error is:
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [48 lines of output]
running egg_info
writing lib3/PyYAML.egg-info/PKG-INFO
writing dependency_links to lib3/PyYAML.egg-info/dependency_links.txt
writing top-level names to lib3/PyYAML.egg-info/top_level.txt
Traceback (most recent call last):
File "/Users/abeatson/mambaforge/envs/runhouse3/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
main()
File "/Users/abeatson/mambaforge/envs/runhouse3/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "/Users/abeatson/mambaforge/envs/runhouse3/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
return hook(config_settings)
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 325, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 295, in _get_build_requires
self.run_setup()
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 311, in run_setup
exec(code, locals())
File "<string>", line 271, in <module>
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/__init__.py", line 103, in setup
return distutils.core.setup(**attrs)
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 185, in setup
return run_commands(dist)
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
dist.run_commands()
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
self.run_command(cmd)
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/dist.py", line 963, in run_command
super().run_command(command)
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 321, in run
self.find_sources()
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 329, in find_sources
mm.run()
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 551, in run
self.add_defaults()
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 589, in add_defaults
sdist.add_defaults(self)
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/command/sdist.py", line 112, in add_defaults
super().add_defaults()
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/_distutils/command/sdist.py", line 251, in add_defaults
self._add_defaults_ext()
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/_distutils/command/sdist.py", line 336, in _add_defaults_ext
self.filelist.extend(build_ext.get_source_files())
File "<string>", line 201, in get_source_files
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 107, in __getattr__
raise AttributeError(attr)
AttributeError: cython_sources
[end of output]
Versions
Please run the following and paste the output below.
wget https://raw.githubusercontent.com/run-house/runhouse/main/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
Output:
python collect_env.py
Python Platform: macOS-13.4-arm64-arm-64bit
Python Version: 3.10.0 | packaged by conda-forge | (default, Nov 20 2021, 02:27:15) [Clang 11.1.0 ]
Relevant packages:
wheel==0.42.0
sh: sky: command not found
sh: sky: command not found
At one point we had breakpoints and pdb working using RPyC (it's still in function.py at 379). It might be worth trying to get that working again.
Another option barring that:
When user calls fn.pdb, start a new rpc server on the cluster on a different port and a new screen name (e.g. fn_name_timestamp), dedicated to this function. Then, start an ssh terminal into the cluster with `screen -r screen_name`.
Also could be worth exploring the pty approach Modal took.
Cc @Caroline
From SyncLinear.com | KIT-73
I am trying Run House with a local pre-configured server. But that server needs to use "ProxyCommand" option to SSH into. Is there a way the PxoxyCommand can be specified in the Cluster API (like in the ssh_creds dict)?
Typical way to SSH into the server is something like this:
ssh -i -o ProxyCommand="ssh -W %h:%p <user>>@<frontendproxyhost>" <user>@<targethost>>
I do have a workaround to add the ProxyCommand in ~/.ssh/config but would be nice to specify as params in the rh.cluster API for cases where the SSH command are a bit dynamic (like in my case).
Hi, I consistently see my script hanging when it copies local package to the server, is there any way from the server side which can display the packages are actually been copying?
/work/rh/scripts/self-hosted.py
INFO | 2023-05-31 20:38:49,626 | No auth token provided, so not using RNS API to save and load configs
INFO | 2023-05-31 20:39:24,493 | Running command on rh-cluster: ray start --head
INFO | 2023-05-31 20:39:46,019 | Running command on rh-cluster: mkdir -p ~/.rh; touch /.rh/cluster_config.yaml; echo '{"name": "/rh-cluster", "resource_type": "cluster", "resource_subtype": "Cluster", "ips": ["127.0.0.1"], "ssh_creds": {"ssh_user": "ytang", "ssh_private_key": "/home/ytang/.ssh/id_rsa"}}' > ~/.rh/cluster_config.yaml
Warning: Identity file /home/ytang/.ssh/id_rsa not accessible: No such file or directory.
INFO | 2023-05-31 20:39:50,904 | Setting up Function on cluster.
INFO | 2023-05-31 20:39:51,059 | Connected (version 2.0, client OpenSSH_8.9p1)
INFO | 2023-05-31 20:39:51,127 | Authentication (publickey) successful!
2023-05-31 20:39:51,128| ERROR | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
ERROR | 2023-05-31 20:39:51,128 | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
INFO | 2023-05-31 20:39:51,288 | Connected (version 2.0, client OpenSSH_8.9p1)
INFO | 2023-05-31 20:39:51,418 | Authentication (publickey) successful!
INFO | 2023-05-31 20:39:51,674 | Copying local package work to cluster
root@35c45fe5c801:/work/rh# cd /work/rh ; /usr/bin/env /usr/bin/python3 /root/.vscode-server/extensions/ms-python.python-2023.8.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher 54577 -- /work/rh/scripts/self-hosted.py
INFO | 2023-05-31 20:46:22,533 | No auth token provided, so not using RNS API to save and load configs
INFO | 2023-05-31 20:46:27,993 | Running command on rh-cluster: ray start --head
INFO | 2023-05-31 20:46:28,686 | Running command on rh-cluster: mkdir -p ~/.rh; touch /.rh/cluster_config.yaml; echo '{"name": "/rh-cluster", "resource_type": "cluster", "resource_subtype": "Cluster", "ips": ["127.0.0.1"], "ssh_creds": {"ssh_user": "ytang", "ssh_private_key": "/home/ytang/.ssh/id_rsa"}}' > ~/.rh/cluster_config.yaml
Warning: Identity file /home/ytang/.ssh/id_rsa not accessible: No such file or directory.
INFO | 2023-05-31 20:46:29,852 | Setting up Function on cluster.
INFO | 2023-05-31 20:46:29,917 | Connected (version 2.0, client OpenSSH_8.9p1)
INFO | 2023-05-31 20:46:30,028 | Authentication (publickey) successful!
2023-05-31 20:46:30,028| ERROR | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
ERROR | 2023-05-31 20:46:30,028 | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
INFO | 2023-05-31 20:46:30,081 | Connected (version 2.0, client OpenSSH_8.9p1)
INFO | 2023-05-31 20:46:30,157 | Authentication (publickey) successful!
INFO | 2023-05-31 20:46:30,413 | Copying local package work to cluster
Currently, the traceback from a remote error prints to the logs above the rest of the traceback, not to stderr (or formatted separately from other logs)
From SyncLinear.com | KIT-66
Hi team, the Runhouse docs for on-demand clusters were not super clear about the format of the image_id
, but helpfully my initial attempts to bring up a GCP cluster with e.g. image_id="pytorch-cpu-latest"
(taken from the GCP docs) raised a clear error e.g. ValueError: Image 'pytorch-latest-cpu' not found in GCP.
I ended up going into the skypilot repo for clarification and found a GCP example in their yaml-spec: projects/deeplearning-platform-release/global/images/family/tf2-ent-2-1-cpu-ubuntu-2004
I modified the above for the image I wanted projects/deeplearning-platform-release/global/images/family/pytorch-1-13-cpu-v20230807-debian-11-py310
and while runhouse allowed me to submit, it hung until it timed out (and I saw no indication in the GCP Console that the instance was coming up).
I tried to run a similar command via sky launch
, and saw the error, which I reported to them in this Github Issue. I am raising it here as well in case you want to update your wrapping code to catch this error.
Versions
Please run the following and paste the output below.
Python Platform: Linux-6.4.12-arch1-1-x86_64-with-glibc2.38
Python Version: 3.10.13 (main, Sep 4 2023, 15:52:34) [GCC 13.2.1 20230801]
Relevant packages:
boto3==1.28.40
fastapi==0.103.1
fsspec==2023.5.0
gcsfs==2023.5.0
google-api-python-client==2.97.0
google-cloud-storage==2.10.0
pyarrow==13.0.0
pycryptodome==3.12.0
rich==13.5.2
runhouse==0.0.11
skypilot==0.3.3
sshfs==2023.7.0
sshtunnel==0.4.0
typer==0.9.0
uvicorn==0.23.2
wheel==0.41.2
Checking credentials to enable clouds for SkyPilot.
AWS: disabled
Reason: AWS credentials are not set. Run the following commands:
$ pip install boto3
$ aws configure
$ aws configure list # Ensure that this shows identity is set.
For more info: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html
Details: `aws sts get-caller-identity` failed with error: [botocore.exceptions.NoCredentialsError] Unable to locate credentials.
Azure: disabled
Reason: ~/.azure/msal_token_cache.json does not exist. Run the following commands:
$ az login
$ az account set -s <subscription_id>
For more info: https://docs.microsoft.com/en-us/cli/azure/get-started-with-azure-cli
GCP: enabled
Lambda: disabled
Reason: Failed to access Lambda Cloud with credentials. To configure credentials, go to:
https://cloud.lambdalabs.com/api-keys
to generate API key and add the line
api_key = [YOUR API KEY]
to ~/.lambda_cloud/lambda_keys
IBM: disabled
Reason: Missing credential file at /home/user/.ibm/credentials.yaml.
Store your API key and Resource Group id in ~/.ibm/credentials.yaml in the following format:
iam_api_key: <IAM_API_KEY>
resource_group_id: <RESOURCE_GROUP_ID>
SCP: disabled
Reason: Failed to access SCP with credentials. To configure credentials, see: https://cloud.samsungsds.com/openapiguide
Generate API key and add the following line to ~/.scp/scp_credential:
access_key = [YOUR API ACCESS KEY]
secret_key = [YOUR API SECRET KEY]
project_id = [YOUR PROJECT ID]
OCI: disabled
Reason: `oci` is not installed. Install it with: pip install oci
For more details, refer to: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#oracle-cloud-infrastructure-oci
Cloudflare (for R2 object store): disabled
Reason: [r2] profile is not set in ~/.cloudflare/r2.credentials. Additionally, Account ID from R2 dashboard is not set. Run the following commands:
$ pip install boto3
$ AWS_SHARED_CREDENTIALS_FILE=~/.cloudflare/r2.credentials aws configure --profile r2
$ mkdir -p ~/.cloudflare
$ echo <YOUR_ACCOUNT_ID_HERE> > ~/.cloudflare/accountid
For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#cloudflare-r2
SkyPilot will use only the enabled clouds to run tasks. To change this, configure cloud credentials, and run sky check.
If any problems remain, please file an issue at https://github.com/skypilot-org/skypilot/issues/new
Clusters
NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND
Managed spot jobs
No in progress jobs. (See: sky spot -h)
Additional context
Add any other context about the problem here.
Describe the bug
Example code:
https://github.com/fryz/funhouse/tree/zf/fastapi/fastapi
When utilizing FastAPIs Lifespan Events (asynccontextmanager) to bring a cluster up, the cluster comes up but then hangs without returning to back to the server initialization logic.
Terminating the FastAPI process and bringing it back online recognizes the cluster and works.
Versions
(.venv) zach@Zachs-MacBook-Pro (/Users/zach/git/fun/funhouse/fastapi) (zf/fastapi)
[09:18:06]$ python collect_env.py
Python Platform: macOS-12.4-arm64-arm-64bit
Python Version: 3.11.7 (main, Jan 16 2024, 14:42:22) [Clang 14.0.0 (clang-1400.0.29.202)]
Relevant packages:
boto3==1.34.124
fastapi==0.111.0
fastapi-cli==0.0.4
fsspec==2023.5.0
opentelemetry-instrumentation-fastapi==0.46b0
pycryptodome==3.12.0
rich==13.7.1
runhouse==0.0.28
skypilot==0.5.0
sshfs==2023.4.1
sshtunnel==0.4.0
typer==0.12.3
uvicorn==0.30.1
wheel==0.43.0
Checking credentials to enable clouds for SkyPilot.
AWS: enabled
Hint: AWS SSO is set. To ensure multiple clouds work correctly, please use SkyPilot with static credentials (e.g., ~/.aws/credentials) by unsetting the AWS_PROFILE environment variable.
Azure: disabled
Reason: Getting user's Azure identity failed. Run the following commands:
$ az login
$ az account set -s <subscription_id>
For more info: https://docs.microsoft.com/en-us/cli/azure/get-started-with-azure-cli
Details: sky.exceptions.CloudUserIdentityError: Failed to import 'knack'. To install the dependencies for Azure, Please install SkyPilot with: pip install skypilot[azure]
Cloudflare, for R2 object store: disabled
Reason: [r2] profile is not set in ~/.cloudflare/r2.credentials. Additionally, Account ID from R2 dashboard is not set. Run the following commands:
$ pip install boto3
$ AWS_SHARED_CREDENTIALS_FILE=~/.cloudflare/r2.credentials aws configure --profile r2
$ mkdir -p ~/.cloudflare
$ echo <YOUR_ACCOUNT_ID_HERE> > ~/.cloudflare/accountid
For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#cloudflare-r2
Cudo: disabled
Reason: Cudo tools are not installed. Run the following commands:
$ pip install cudo-compute
[ModuleNotFoundError] No module named 'cudo_compute'
Fluidstack: disabled
Reason: Failed to access FluidStack Cloud with credentials. To configure credentials, go to:
https://console.fluidstack.io
to obtain an API key and API Token, then add save the contents to ~/.fluidstack/api_key and ~/.fluidstack/api_token
GCP: disabled
Reason: GCP tools are not installed. Run the following commands:
$ pip install google-api-python-client
$ conda install -c conda-forge google-cloud-sdk -y
Credentials may also need to be set. Run the following commands:
$ gcloud init
$ gcloud auth application-default login
For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#google-cloud-platform-gcp
Details: [ModuleNotFoundError] No module named 'googleapiclient'
IBM: disabled
Reason: Missing credential file at /Users/zach/.ibm/credentials.yaml.
Store your API key and Resource Group id in ~/.ibm/credentials.yaml in the following format:
iam_api_key: <IAM_API_KEY>
resource_group_id: <RESOURCE_GROUP_ID>
Kubernetes: disabled
Reason: `kubernetes` package is not installed. Install it with: pip install kubernetes
Lambda: disabled
Reason: Failed to access Lambda Cloud with credentials. To configure credentials, go to:
https://cloud.lambdalabs.com/api-keys
to generate API key and add the line
api_key = [YOUR API KEY]
to ~/.lambda_cloud/lambda_keys
OCI: disabled
Reason: `oci` is not installed. Install it with: pip install oci
For more details, refer to: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#oracle-cloud-infrastructure-oci
RunPod: disabled
Reason: Failed to import runpod. To install, run: pip install skypilot[runpod]
SCP: disabled
Reason: Failed to access SCP with credentials. To configure credentials, see: https://cloud.samsungsds.com/openapiguide
Generate API key and add the following line to ~/.scp/scp_credential:
access_key = [YOUR API ACCESS KEY]
secret_key = [YOUR API SECRET KEY]
project_id = [YOUR PROJECT ID]
vSphere: disabled
Reason: vSphere dependencies are not installed. Run the following commands:
$ pip install skypilot[vSphere]
Credentials may also need to be set. For more details. See https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#vmware-vsphere[ModuleNotFoundError] No module named 'pyVmomi'
To enable a cloud, follow the hints above and rerun: sky check
If any problems remain, refer to detailed docs at: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html
🎉 Enabled clouds 🎉
✔ AWS
Clusters
I 06-12 09:18:16 backend_utils.py:2405] Autodowned clusters: fastapi-runhouse-example, arthur-shield-gpu-cluster
No existing clusters.
Managed spot jobs
No in progress jobs. (See: sky spot -h)
Services
No existing services. (See: sky serve -h)
Additional context
Logs from the startup:
(.venv) zach@Zachs-MacBook-Pro (/Users/zach/git/fun/funhouse/fastapi) (zf/fastapi)
[08:59:13]$ fastapi dev app.py
INFO Using path app.py
INFO Resolved absolute path /Users/zach/git/fun/funhouse/fastapi/app.py
INFO Searching for package file structure from directories with __init__.py files
INFO Importing from /Users/zach/git/fun/funhouse/fastapi
╭─ Python module file ─╮
│ │
│ 🐍 app.py │
│ │
╰──────────────────────╯
INFO Importing module app
╭─ Importable FastAPI app ─╮
│ │
│ from app import app │
│ │
╰──────────────────────────╯
╭────────── FastAPI CLI - Development mode ───────────╮
│ │
│ Serving at: http://127.0.0.1:8000 │
│ │
│ API docs: http://127.0.0.1:8000/docs │
│ │
│ Running in development mode, for production use: │
│ │
│ fastapi run │
│ │
╰─────────────────────────────────────────────────────╯
INFO: Will watch for changes in these directories: ['/Users/zach/git/fun/funhouse/fastapi']
INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
INFO: Started reloader process [35530] using WatchFiles
I 06-12 08:59:25 optimizer.py:691] == Optimizer ==
I 06-12 08:59:25 optimizer.py:714] Estimated cost: $0.1 / hour
I 06-12 08:59:25 optimizer.py:714]
I 06-12 08:59:25 optimizer.py:837] Considered resources (1 node):
I 06-12 08:59:25 optimizer.py:907] ----------------------------------------------------------------------------------------
I 06-12 08:59:25 optimizer.py:907] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
I 06-12 08:59:25 optimizer.py:907] ----------------------------------------------------------------------------------------
I 06-12 08:59:25 optimizer.py:907] AWS m6i.large 2 8 - us-east-2 0.10 ✔
I 06-12 08:59:25 optimizer.py:907] ----------------------------------------------------------------------------------------
I 06-12 08:59:25 optimizer.py:907]
I 06-12 08:59:25 cloud_vm_ray_backend.py:4246] Creating a new cluster: 'fastapi-runhouse-example' [1x AWS(m6i.large)].
I 06-12 08:59:25 cloud_vm_ray_backend.py:4246] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
INFO | 2024-06-12 12:59:26.165931 | 3 changes detected
INFO | 2024-06-12 12:59:26.532277 | 280 changes detected
INFO | 2024-06-12 12:59:26.899229 | 122 changes detected
I 06-12 08:59:27 cloud_vm_ray_backend.py:1373] To view detailed progress: tail -n100 -f /Users/zach/sky_logs/sky-2024-06-12-08-59-25-800758/provision.log
I 06-12 08:59:28 provisioner.py:76] Launching on AWS us-east-2 (us-east-2a,us-east-2b,us-east-2c)
I 06-12 09:00:35 provisioner.py:451] Successfully provisioned or found existing instance.
I 06-12 09:02:03 provisioner.py:553] Successfully provisioned cluster: fastapi-runhouse-example
I 06-12 09:02:05 cloud_vm_ray_backend.py:3266] Run commands not specified or empty.
Clusters
NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND
fastapi-runhouse-example a few secs ago 1x AWS(m6i.large) UP (down) /Users/zach/git/fun/funho...
arthur-shield-gpu-cluster 1 week ago 1x AWS(g3s.xlarge, {'M60': 1}) UP (down) /Users/zach/git/arthur-sh...
INFO | 2024-06-12 13:02:12.533477 | Restarting Runhouse API server on fastapi-runhouse-example.
INFO | 2024-06-12 13:02:12.540893 | Running command on fastapi-runhouse-example: python3 -m pip install runhouse==0.0.28
Collecting runhouse==0.0.28
Downloading runhouse-0.0.28-py3-none-any.whl (366 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 366.6/366.6 kB 2.8 MB/s eta 0:00:00
Collecting opentelemetry-instrumentation-requests
Downloading opentelemetry_instrumentation_requests-0.46b0-py3-none-any.whl (12 kB)
Collecting pyOpenSSL>=23.3.0
Downloading pyOpenSSL-24.1.0-py3-none-any.whl (56 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.9/56.9 kB 12.3 MB/s eta 0:00:00
Requirement already satisfied: ray[default]!=2.6.0,<=2.6.3,>=2.2.0 in /opt/conda/lib/python3.10/site-packages (from runhouse==0.0.28) (2.4.0)
Collecting pexpect
Downloading pexpect-4.9.0-py2.py3-none-any.whl (63 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63.8/63.8 kB 13.8 MB/s eta 0:00:00
Collecting fastapi
Downloading fastapi-0.111.0-py3-none-any.whl (91 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92.0/92.0 kB 19.3 MB/s eta 0:00:00
Collecting apispec
Downloading apispec-6.6.1-py3-none-any.whl (30 kB)
Requirement already satisfied: rich in /opt/conda/lib/python3.10/site-packages (from runhouse==0.0.28) (13.7.1)
Collecting opentelemetry-sdk
Downloading opentelemetry_sdk-1.25.0-py3-none-any.whl (107 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 107.0/107.0 kB 1.3 MB/s eta 0:00:00
Collecting opentelemetry-instrumentation-fastapi
Downloading opentelemetry_instrumentation_fastapi-0.46b0-py3-none-any.whl (11 kB)
Collecting typer
Downloading typer-0.12.3-py3-none-any.whl (47 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 47.2/47.2 kB 9.5 MB/s eta 0:00:00
Collecting sshfs<=2023.4.1,>=2023.1.0
Downloading sshfs-2023.4.1-py3-none-any.whl (15 kB)
Collecting uvicorn
Downloading uvicorn-0.30.1-py3-none-any.whl (62 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.4/62.4 kB 13.7 MB/s eta 0:00:00
Collecting opentelemetry-exporter-otlp-proto-http
Downloading opentelemetry_exporter_otlp_proto_http-1.25.0-py3-none-any.whl (16 kB)
Collecting fsspec<=2023.5.0
Downloading fsspec-2023.5.0-py3-none-any.whl (160 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 160.1/160.1 kB 29.6 MB/s eta 0:00:00
Collecting opentelemetry-instrumentation
Downloading opentelemetry_instrumentation-0.46b0-py3-none-any.whl (29 kB)
Collecting sentry-sdk
Downloading sentry_sdk-2.5.1-py2.py3-none-any.whl (289 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 289.6/289.6 kB 3.7 MB/s eta 0:00:00
Requirement already satisfied: python-dotenv in /opt/conda/lib/python3.10/site-packages (from runhouse==0.0.28) (1.0.1)
Collecting httpx
Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75.6/75.6 kB 1.2 MB/s eta 0:00:00
Collecting opentelemetry-api
Downloading opentelemetry_api-1.25.0-py3-none-any.whl (59 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59.9/59.9 kB 10.2 MB/s eta 0:00:00
Requirement already satisfied: wheel in /opt/conda/lib/python3.10/site-packages (from runhouse==0.0.28) (0.38.4)
Requirement already satisfied: setuptools<70.0.0 in /opt/conda/lib/python3.10/site-packages (from runhouse==0.0.28) (65.6.3)
Collecting cryptography<43,>=41.0.5
Downloading cryptography-42.0.8-cp39-abi3-manylinux_2_28_x86_64.whl (3.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.9/3.9 MB 28.9 MB/s eta 0:00:00
Requirement already satisfied: virtualenv<20.21.1,>=20.0.24 in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (20.21.0)
Requirement already satisfied: click>=7.0 in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (8.1.7)
Requirement already satisfied: jsonschema in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (4.22.0)
Requirement already satisfied: numpy>=1.19.3 in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (1.26.4)
Requirement already satisfied: protobuf!=3.19.5,>=3.15.3 in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (4.25.3)
Requirement already satisfied: grpcio<=1.51.3,>=1.42.0 in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (1.51.3)
Requirement already satisfied: pyyaml in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (6.0.1)
Requirement already satisfied: frozenlist in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (1.4.1)
Requirement already satisfied: filelock in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (3.14.0)
Requirement already satisfied: attrs in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (23.2.0)
Requirement already satisfied: packaging in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (24.1)
Requirement already satisfied: requests in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (2.28.2)
Requirement already satisfied: msgpack<2.0.0,>=1.0.0 in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (1.0.8)
Requirement already satisfied: aiosignal in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (1.3.1)
Requirement already satisfied: gpustat>=1.0.0 in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (1.1.1)
Requirement already satisfied: colorful in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (0.5.6)
Requirement already satisfied: aiohttp-cors in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (0.7.0)
Requirement already satisfied: opencensus in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (0.11.4)
Requirement already satisfied: py-spy>=0.2.0 in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (0.3.14)
Requirement already satisfied: prometheus-client>=0.7.1 in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (0.20.0)
Requirement already satisfied: pydantic in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (1.10.16)
Requirement already satisfied: aiohttp>=3.7 in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (3.9.5)
Requirement already satisfied: smart-open in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (7.0.4)
Collecting asyncssh<3,>=2.11.0
Downloading asyncssh-2.14.2-py3-none-any.whl (352 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 352.5/352.5 kB 4.4 MB/s eta 0:00:00
Collecting ujson!=4.0.2,!=4.1.0,!=4.2.0,!=4.3.0,!=5.0.0,!=5.1.0,>=4.0.1
Downloading ujson-5.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (53 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.6/53.6 kB 10.3 MB/s eta 0:00:00
Requirement already satisfied: jinja2>=2.11.2 in /opt/conda/lib/python3.10/site-packages (from fastapi->runhouse==0.0.28) (3.1.4)
Collecting starlette<0.38.0,>=0.37.2
Downloading starlette-0.37.2-py3-none-any.whl (71 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71.9/71.9 kB 16.3 MB/s eta 0:00:00
Collecting python-multipart>=0.0.7
Downloading python_multipart-0.0.9-py3-none-any.whl (22 kB)
Collecting orjson>=3.2.1
Downloading orjson-3.10.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (142 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 142.7/142.7 kB 1.9 MB/s eta 0:00:00
Collecting fastapi-cli>=0.0.2
Downloading fastapi_cli-0.0.4-py3-none-any.whl (9.5 kB)
Collecting email_validator>=2.0.0
Downloading email_validator-2.1.1-py3-none-any.whl (30 kB)
Requirement already satisfied: typing-extensions>=4.8.0 in /opt/conda/lib/python3.10/site-packages (from fastapi->runhouse==0.0.28) (4.12.2)
Collecting anyio
Downloading anyio-4.4.0-py3-none-any.whl (86 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86.8/86.8 kB 18.8 MB/s eta 0:00:00
Collecting sniffio
Downloading sniffio-1.3.1-py3-none-any.whl (10 kB)
Requirement already satisfied: certifi in /opt/conda/lib/python3.10/site-packages (from httpx->runhouse==0.0.28) (2022.12.7)
Requirement already satisfied: idna in /opt/conda/lib/python3.10/site-packages (from httpx->runhouse==0.0.28) (3.4)
Collecting httpcore==1.*
Downloading httpcore-1.0.5-py3-none-any.whl (77 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77.9/77.9 kB 1.0 MB/s eta 0:00:00
Collecting h11<0.15,>=0.13
Downloading h11-0.14.0-py3-none-any.whl (58 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58.3/58.3 kB 14.1 MB/s eta 0:00:00
Collecting importlib-metadata<=7.1,>=6.0
Downloading importlib_metadata-7.1.0-py3-none-any.whl (24 kB)
Collecting deprecated>=1.2.6
Downloading Deprecated-1.2.14-py2.py3-none-any.whl (9.6 kB)
Requirement already satisfied: googleapis-common-protos~=1.52 in /opt/conda/lib/python3.10/site-packages (from opentelemetry-exporter-otlp-proto-http->runhouse==0.0.28) (1.63.1)
Collecting opentelemetry-exporter-otlp-proto-common==1.25.0
Downloading opentelemetry_exporter_otlp_proto_common-1.25.0-py3-none-any.whl (17 kB)
Collecting opentelemetry-proto==1.25.0
Downloading opentelemetry_proto-1.25.0-py3-none-any.whl (52 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52.5/52.5 kB 532.4 kB/s eta 0:00:00
Collecting opentelemetry-semantic-conventions==0.46b0
Downloading opentelemetry_semantic_conventions-0.46b0-py3-none-any.whl (130 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 130.5/130.5 kB 28.8 MB/s eta 0:00:00
Requirement already satisfied: wrapt<2.0.0,>=1.0.0 in /opt/conda/lib/python3.10/site-packages (from opentelemetry-instrumentation->runhouse==0.0.28) (1.16.0)
Collecting opentelemetry-instrumentation-asgi==0.46b0
Downloading opentelemetry_instrumentation_asgi-0.46b0-py3-none-any.whl (14 kB)
Collecting opentelemetry-util-http==0.46b0
Downloading opentelemetry_util_http-0.46b0-py3-none-any.whl (6.9 kB)
Collecting asgiref~=3.0
Downloading asgiref-3.8.1-py3-none-any.whl (23 kB)
Collecting ptyprocess>=0.5
Downloading ptyprocess-0.7.0-py2.py3-none-any.whl (13 kB)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /opt/conda/lib/python3.10/site-packages (from rich->runhouse==0.0.28) (2.18.0)
Requirement already satisfied: markdown-it-py>=2.2.0 in /opt/conda/lib/python3.10/site-packages (from rich->runhouse==0.0.28) (3.0.0)
Requirement already satisfied: urllib3>=1.26.11 in /opt/conda/lib/python3.10/site-packages (from sentry-sdk->runhouse==0.0.28) (1.26.14)
Collecting shellingham>=1.3.0
Downloading shellingham-1.5.4-py2.py3-none-any.whl (9.8 kB)
Requirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp>=3.7->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (1.9.4)
Requirement already satisfied: async-timeout<5.0,>=4.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp>=3.7->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (4.0.3)
Requirement already satisfied: multidict<7.0,>=4.5 in /opt/conda/lib/python3.10/site-packages (from aiohttp>=3.7->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (6.0.5)
Requirement already satisfied: cffi>=1.12 in /opt/conda/lib/python3.10/site-packages (from cryptography<43,>=41.0.5->pyOpenSSL>=23.3.0->runhouse==0.0.28) (1.15.1)
Collecting dnspython>=2.0.0
Downloading dnspython-2.6.1-py3-none-any.whl (307 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 307.7/307.7 kB 5.3 MB/s eta 0:00:00
Requirement already satisfied: nvidia-ml-py>=11.450.129 in /opt/conda/lib/python3.10/site-packages (from gpustat>=1.0.0->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (12.555.43)
Requirement already satisfied: psutil>=5.6.0 in /opt/conda/lib/python3.10/site-packages (from gpustat>=1.0.0->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (5.9.8)
Requirement already satisfied: blessed>=1.17.1 in /opt/conda/lib/python3.10/site-packages (from gpustat>=1.0.0->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (1.20.0)
Collecting zipp>=0.5
Downloading zipp-3.19.2-py3-none-any.whl (9.0 kB)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.10/site-packages (from jinja2>=2.11.2->fastapi->runhouse==0.0.28) (2.1.5)
Requirement already satisfied: mdurl~=0.1 in /opt/conda/lib/python3.10/site-packages (from markdown-it-py>=2.2.0->rich->runhouse==0.0.28) (0.1.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (2.1.1)
Collecting exceptiongroup>=1.0.2
Downloading exceptiongroup-1.2.1-py3-none-any.whl (16 kB)
Collecting websockets>=10.4
Downloading websockets-12.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (130 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 130.2/130.2 kB 2.3 MB/s eta 0:00:00
Collecting uvloop!=0.15.0,!=0.15.1,>=0.14.0
Downloading uvloop-0.19.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.4/3.4 MB 29.5 MB/s eta 0:00:00
Collecting watchfiles>=0.13
Downloading watchfiles-0.22.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 12.5 MB/s eta 0:00:00
Collecting httptools>=0.5.0
Downloading httptools-0.6.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (341 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 341.4/341.4 kB 5.4 MB/s eta 0:00:00
Requirement already satisfied: distlib<1,>=0.3.6 in /opt/conda/lib/python3.10/site-packages (from virtualenv<20.21.1,>=20.0.24->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (0.3.8)
Requirement already satisfied: platformdirs<4,>=2.4 in /opt/conda/lib/python3.10/site-packages (from virtualenv<20.21.1,>=20.0.24->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (3.11.0)
Requirement already satisfied: referencing>=0.28.4 in /opt/conda/lib/python3.10/site-packages (from jsonschema->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (0.35.1)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /opt/conda/lib/python3.10/site-packages (from jsonschema->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (2023.12.1)
Requirement already satisfied: rpds-py>=0.7.1 in /opt/conda/lib/python3.10/site-packages (from jsonschema->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (0.18.1)
Requirement already satisfied: opencensus-context>=0.1.3 in /opt/conda/lib/python3.10/site-packages (from opencensus->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (0.1.3)
Requirement already satisfied: six~=1.16 in /opt/conda/lib/python3.10/site-packages (from opencensus->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (1.16.0)
Requirement already satisfied: google-api-core<3.0.0,>=1.0.0 in /opt/conda/lib/python3.10/site-packages (from opencensus->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (2.19.0)
Requirement already satisfied: wcwidth>=0.1.4 in /opt/conda/lib/python3.10/site-packages (from blessed>=1.17.1->gpustat>=1.0.0->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (0.2.13)
Requirement already satisfied: pycparser in /opt/conda/lib/python3.10/site-packages (from cffi>=1.12->cryptography<43,>=41.0.5->pyOpenSSL>=23.3.0->runhouse==0.0.28) (2.21)
Requirement already satisfied: proto-plus<2.0.0dev,>=1.22.3 in /opt/conda/lib/python3.10/site-packages (from google-api-core<3.0.0,>=1.0.0->opencensus->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (1.23.0)
Requirement already satisfied: google-auth<3.0.dev0,>=2.14.1 in /opt/conda/lib/python3.10/site-packages (from google-api-core<3.0.0,>=1.0.0->opencensus->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (2.30.0)
Requirement already satisfied: cachetools<6.0,>=2.0.0 in /opt/conda/lib/python3.10/site-packages (from google-auth<3.0.dev0,>=2.14.1->google-api-core<3.0.0,>=1.0.0->opencensus->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (5.3.3)
Requirement already satisfied: rsa<5,>=3.1.4 in /opt/conda/lib/python3.10/site-packages (from google-auth<3.0.dev0,>=2.14.1->google-api-core<3.0.0,>=1.0.0->opencensus->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (4.7.2)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/conda/lib/python3.10/site-packages (from google-auth<3.0.dev0,>=2.14.1->google-api-core<3.0.0,>=1.0.0->opencensus->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (0.4.0)
Requirement already satisfied: pyasn1<0.7.0,>=0.4.6 in /opt/conda/lib/python3.10/site-packages (from pyasn1-modules>=0.2.1->google-auth<3.0.dev0,>=2.14.1->google-api-core<3.0.0,>=1.0.0->opencensus->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (0.6.0)
Installing collected packages: ptyprocess, zipp, websockets, uvloop, ujson, sniffio, shellingham, sentry-sdk, python-multipart, pexpect, orjson, opentelemetry-util-http, opentelemetry-proto, httptools, h11, fsspec, exceptiongroup, dnspython, deprecated, asgiref, apispec, uvicorn, opentelemetry-exporter-otlp-proto-common, importlib-metadata, httpcore, email_validator, cryptography, anyio, watchfiles, typer, starlette, pyOpenSSL, opentelemetry-api, httpx, asyncssh, sshfs, opentelemetry-semantic-conventions, opentelemetry-instrumentation, fastapi-cli, opentelemetry-sdk, opentelemetry-instrumentation-requests, opentelemetry-instrumentation-asgi, fastapi, opentelemetry-instrumentation-fastapi, opentelemetry-exporter-otlp-proto-http, runhouse
Attempting uninstall: cryptography
Found existing installation: cryptography 39.0.1
Uninstalling cryptography-39.0.1:
Successfully uninstalled cryptography-39.0.1
Attempting uninstall: pyOpenSSL
Found existing installation: pyOpenSSL 23.0.0
Uninstalling pyOpenSSL-23.0.0:
Successfully uninstalled pyOpenSSL-23.0.0
Successfully installed anyio-4.4.0 apispec-6.6.1 asgiref-3.8.1 asyncssh-2.14.2 cryptography-42.0.8 deprecated-1.2.14 dnspython-2.6.1 email_validator-2.1.1 exceptiongroup-1.2.1 fastapi-0.111.0 fastapi-cli-0.0.4 fsspec-2023.5.0 h11-0.14.0 httpcore-1.0.5 httptools-0.6.1 httpx-0.27.0 importlib-metadata-7.1.0 opentelemetry-api-1.25.0 opentelemetry-exporter-otlp-proto-common-1.25.0 opentelemetry-exporter-otlp-proto-http-1.25.0 opentelemetry-instrumentation-0.46b0 opentelemetry-instrumentation-asgi-0.46b0 opentelemetry-instrumentation-fastapi-0.46b0 opentelemetry-instrumentation-requests-0.46b0 opentelemetry-proto-1.25.0 opentelemetry-sdk-1.25.0 opentelemetry-semantic-conventions-0.46b0 opentelemetry-util-http-0.46b0 orjson-3.10.4 pexpect-4.9.0 ptyprocess-0.7.0 pyOpenSSL-24.1.0 python-multipart-0.0.9 runhouse-0.0.28 sentry-sdk-2.5.1 shellingham-1.5.4 sniffio-1.3.1 sshfs-2023.4.1 starlette-0.37.2 typer-0.12.3 ujson-5.10.0 uvicorn-0.30.1 uvloop-0.19.0 watchfiles-0.22.0 websockets-12.0 zipp-3.19.2
Shared connection to 18.119.117.17 closed.
INFO | 2024-06-12 13:02:21.961759 | Running command on fastapi-runhouse-example: mkdir -p ~/.rh; touch ~/.rh/cluster_config.json; echo '{"name": "fastapi-runhouse-example", "resource_type": "cluster", "resource_subtype": "OnDemandCluster", "provenance": null, "visibility": "private", "ips": ["18.119.117.17"], "server_port": 32300, "server_connection_type": "ssh", "den_auth": false, "use_local_telemetry": false, "ssh_port": 22, "api_server_url": "https://api.run.house", "instance_type": "CPU:2+", "provider": "aws", "open_ports": [], "use_spot": false, "region": "us-east-2", "stable_internal_external_ips": [["10.16.96.251", "18.119.117.17"]], "autostop_mins": -1}' > ~/.rh/cluster_config.json
Shared connection to 18.119.117.17 closed.
INFO | 2024-06-12 13:02:23.046387 | Running command on fastapi-runhouse-example: runhouse restart --restart-ray --port 32300 --api-server-url https://api.run.house --default-env-name _cluster_default_env --from-python
INFO | 2024-06-12 13:02:27.383278 | Using port: 32300.
INFO | 2024-06-12 13:02:27.383828 | Setting api_server url to https://api.run.house
INFO | 2024-06-12 13:02:27.383922 | Starting server in default env named: _cluster_default_env
INFO | 2024-06-12 13:02:27.383989 | Creating runtime env for conda env: None
INFO | 2024-06-12 13:02:27.385414 | Starting API server using the following command: screen -dm bash -c "/opt/conda/bin/python3 -m runhouse.servers.http.http_server --port 32300 --api-server-url https://api.run.house --default-env-name _cluster_default_env --from-python 2>&1 | tee -a '/home/ubuntu/.rh/server.log' 2>&1".
Executing `pkill -f "/opt/conda/bin/python3 -m runhouse.servers.http.http_server"`
Executing `pkill -f ".*ray.*6379.*"`
Executing `ray start --head --port 6379 --disable-usage-stats`
Usage stats collection is disabled.
Local node IP: 10.16.96.251
--------------------
Ray runtime started.
--------------------
Next steps
To add another node to this Ray cluster, run
ray start --address='10.16.96.251:6379'
To connect to this Ray cluster:
import ray
ray.init()
To submit a Ray job using the Ray Jobs CLI:
RAY_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py
See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html
for more information on submitting Ray jobs to the Ray cluster.
To terminate the Ray runtime, run
ray stop
To view the status of the cluster, use
ray status
To monitor and debug Ray, view the dashboard at
127.0.0.1:8265
If connection to the dashboard fails, check your firewall settings and network configuration.
Executing `screen -dm bash -c "/opt/conda/bin/python3 -m runhouse.servers.http.http_server --port 32300 --api-server-url https://api.run.house --default-env-name _cluster_default_env --from-python 2>&1 | tee -a '/home/ubuntu/.rh/server.log' 2>&1"`
INFO | 2024-06-12 13:02:34.807124 | Loaded cluster config from Ray.
INFO | 2024-06-12 13:02:34.809339 | Updated cluster config with parsed argument values.
INFO | 2024-06-12 13:02:34.838024 | Preparing to send telemetry to https://api.run.house:14318
INFO | 2024-06-12 13:02:34.845426 | Successfully added telemetry exporter https://api.run.house:14318
WARNING | 2024-06-12 13:02:34.845612 | Attempting to instrument FastAPI app while already instrumented
WARNING | 2024-06-12 13:02:34.845691 | Attempting to instrument while already instrumented
INFO | 2024-06-12 13:02:36.204163 | Launching Runhouse API server with den_auth=False and use_local_telemetry=False on host=0.0.0.0 and use_https=False and port_arg=32300
INFO: Started server process [29226]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:32300 (Press CTRL+C to quit)
Shared connection to 18.119.117.17 closed.
INFO | 2024-06-12 13:02:37.401994 | Forwarding port 32300 to port 32300 on localhost.
INFO | 2024-06-12 13:02:38.740679 | Server fastapi-runhouse-example is up.
Sentry is attempting to send 2 pending events
Waiting up to 2 seconds
Press Ctrl-C to quit
INFO | 2024-06-12 13:02:39.209612 | Forwarding port 32300 to port 32300 on localhost.
The feature
It would be interesting if Runhouse could also interface to a cluster in the form of a an existing Slurm cluster.
Motivation
I am part of a team managing a Slurm (GPU) cluster. On the other hand, I have users who are interested in being able to run large language models via Runhouse (https://langchain.readthedocs.io/en/latest/modules/llms/integrations/self_hosted_examples.html). It would be excellent if I could bridge this gap between supply and demand with Runhouse. From what I have read in the documentation so far, Runhouse does not seem to come with an interface to Slurm so far.
What the ideal solution looks like
I am completely new to Runhouse, so this may not be the ideal solution model, but I imagine this could be supported as a bring-your-own cluster with a little bit of extra interaction between Runhouse and Slurm to request the necessary resources (maybe from the Cluster factory method) as a job / jobs in Slurm (probably through the Slurm REST API). Once the jobs are running, the nodes involved can be contacted by Runhouse as a BYO cluster.
Accessing a function via http (in addition to grpc)
From SyncLinear.com | KIT-65
From SyncLinear.com | KIT-82
Im having this bug when trying to setup a model within a lambda cloud running SelfHostedHuggingFaceLLM() after the rh.cluster() function.
`
from langchain.llms import SelfHostedPipeline, SelfHostedHuggingFaceLLM
from langchain import PromptTemplate, LLMChain
import runhouse as rh
gpu = rh.cluster(name="rh-a10", instance_type="A10:1").save()
template = """Question: {question}
Answer: Let's think step by step."""
prompt = PromptTemplate(template=template, input_variables=["question"])
llm = SelfHostedHuggingFaceLLM(model_id="gpt2", hardware=gpu, model_reqs=["pip:./", "transformers", "torch"])
`
I made sure with sky check that the lambda credentials are set, but the error i get within the log is this, which i havent been able to solve.
If i can get any help solving this i would appreciate it.
Describe the bug
Skypilot wheels failed to build so rh.cluster().up_if_not()
fails to work. A lot of skypilot is in fact built successfully, but for whatever reason the skypilot wheels are not built. I wouldn't say this is an issue per se just with skypilot, because I ran this runhouse example on my mac and it worked, so whatever the case it is impacting runhouse on Ubuntu 22.04 as well. I have collected all relevant information and attached it as logs.
Repro
Follow the runhouse demo on Ubuntu 22.04
SEE THE ATTACHED LOGS
collect_env.log
runhouse.1.log
runhouse.2.log
runhouse.log
Versions
Please run the following and paste the output below.
wget https://raw.githubusercontent.com/run-house/runhouse/main/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
Additional context
Hi! First off, I just wanna say runhouse is an awesome project! Really gonna revolutionize how people run machine learning workflows!
Describe the bug
I'm running into an issue where I can't run any remote functions on the cluster, but I can do a cluster.run_python(...)
Here's the code I'm running:
import runhouse as rh
cluster = rh.OnDemandCluster(
name="cpu-cluster",
instance_type="CPU:8",
provider="aws", # options: "AWS", "GCP", "Azure", "Lambda", or "cheapest"
)
cluster.up_if_not()
cluster.run_python(['import numpy', 'print(numpy.__version__)'])
print(cluster.check_server()) # ERRORS HERE
This runs fine until the cluster.check_server()
, as you can see here:
INFO | 2023-07-10 21:51:56,953 | Loaded Runhouse config from /home/shyam/.rh/config.yaml
Refreshing status for 1 cluster ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% -:--:--INFO | 2023-07-10 21:51:58,623 | Found credentials in shared credentials file: ~/.aws/credentials
INFO | 2023-07-10 21:52:05,743 | Running command on cpu-cluster: python3 -c "import numpy; print(numpy.__version__)"
1.25.1
INFO | 2023-07-10 21:52:07,304 | Connected (version 2.0, client OpenSSH_8.2p1)
INFO | 2023-07-10 21:52:07,855 | Authentication (publickey) successful!
INFO | 2023-07-10 21:52:08,095 | Checking server cpu-cluster
Traceback (most recent call last):
File "/home/shyam/Code/trainyard/examples/test.py", line 54, in <module>
print(cluster.check_server())
File "/home/shyam/miniconda3/envs/py310/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 363, in check_server
self.client.check_server(cluster_config=cluster_config)
File "/home/shyam/miniconda3/envs/py310/lib/python3.10/site-packages/runhouse/servers/http/http_client.py", line 48, in check_server
self.request(
File "/home/shyam/miniconda3/envs/py310/lib/python3.10/site-packages/runhouse/servers/http/http_client.py", line 41, in request
raise ValueError(
ValueError: Error calling check on server: Internal Server Error
Not sure if I'm doing something wrong here, but I think my credentials work because I can see that the cluster is being created and I can ssh into it. My package versions can be seen below, let me know if you need more information! Thanks!
Versions
Python Platform: Linux-5.8.0-36-generic-x86_64-with-glibc2.31
Python Version: 3.10.0 (default, Mar 3 2022, 09:58:08) [GCC 7.5.0]
Relevant packages:
awscli==1.25.60
azure-cli==2.31.0
azure-cli-core==2.31.0
azure-cli-telemetry==1.0.6
azure-core==1.28.0
boto3==1.24.59
docker==6.1.3
fsspec==2023.1.0
gcsfs==2023.1.0
google-api-python-client==2.92.0
google-cloud-storage==2.10.0
pyarrow==12.0.1
pycryptodome==3.12.0
rich==13.4.2
runhouse==0.0.7
s3fs==2023.1.0
skypilot==0.3.1
sshfs==2023.4.1
sshtunnel==0.4.0
typer==0.9.0
wheel==0.38.4
Checking credentials to enable clouds for SkyPilot.
AWS: enabled
Azure: disabled
Reason: Azure credential is not set. Run the following commands:
$ az login
$ az account set -s <subscription_id>
For more info: https://docs.microsoft.com/en-us/cli/azure/get-started-with-azure-cli
GCP: disabled
Reason: GCP tools are not installed or credentials are not set. Run the following commands:
$ pip install google-api-python-client
$ conda install -c conda-forge google-cloud-sdk -y
$ gcloud init
$ gcloud auth application-default login
For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html
Lambda: enabled
IBM: disabled
Reason: Missing credential file at /home/shyam/.ibm/credentials.yaml.
Store your API key and Resource Group id in ~/.ibm/credentials.yaml in the following format:
iam_api_key: <IAM_API_KEY>
resource_group_id: <RESOURCE_GROUP_ID>
Cloudflare (for R2 object store): disabled
Reason: [r2] profile is not set in ~/.cloudflare/r2.credentials. Additionally, Account ID from R2 dashboard is not set. Run the following commands:
$ pip install boto3
$ AWS_SHARED_CREDENTIALS_FILE=~/.cloudflare/r2.credentials aws configure --profile r2
$ mkdir -p ~/.cloudflare
$ echo <YOUR_ACCOUNT_ID_HERE> > ~/.cloudflare/accountid
For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#cloudflare-r2
SkyPilot will use only the enabled clouds to run tasks. To change this, configure cloud credentials, and run sky check.
If any problems remain, please file an issue at https://github.com/skypilot-org/skypilot/issues/new
Clusters
NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND
cpu-cluster 13 mins ago 1x AWS(m6i.2xlarge) INIT (down) test.py
Managed spot jobs
No in progress jobs. (See: sky spot -h)
In addition, here's the end of the setup of the cluster:
--------------------
Ray runtime started.
--------------------
Next steps
To add another node to this Ray cluster, run
ray start --address='172.31.46.12:6380'
To connect to this Ray cluster:
import ray
ray.init()
Shared connection to 54.166.159.228 closed.
To submit a Ray job using the Ray Jobs CLI:
RAY_ADDRESS='http://127.0.0.1:8266' ray job submit --working-dir . -- python my_script.py
See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html
for more information on submitting Ray jobs to the Ray cluster.
To terminate the Ray runtime, run
ray stop
To view the status of the cluster, use
ray status
To monitor and debug Ray, view the dashboard at
127.0.0.1:8266
If connection to the dashboard fails, check your firewall settings and network configuration.
/usr/bin/prlimit
2023-07-10 21:35:36,790 INFO log_timer.py:25 -- NodeUpdater: i-036c634eb67821936: Setup commands succeeded [LogTimer=92341ms]
2023-07-10 21:35:36,791 INFO updater.py:489 -- [7/7] Starting the Ray runtime
2023-07-10 21:35:36,792 VINFO command_runner.py:371 -- Running `export RAY_USAGE_STATS_ENABLED=0;export RAY_OVERRIDE_RESOURCES='{"CPU":8}';((ps aux | grep -v nohup | grep -v grep | grep -q -- "python3 -m sky.skylet.skylet") || nohup python3 -m sky.skylet.skylet >> ~/.sky/skylet.log 2>&1 &); ray stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 ray start --disable-usage-stats --head --port=6380 --dashboard-port=8266 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --temp-dir /tmp/ray_skypilot || exit 1; which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done; python -c 'import json, os; json.dump({"ray_port":6380, "ray_dashboard_port":8266}, open(os.path.expanduser("~/.sky/ray_port.json"), "w"))';`
2023-07-10 21:35:36,792 VVINFO command_runner.py:373 -- Full command is `ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_5a4cd850fc/7112f145b3/%C -o ControlPersist=10s -o ConnectTimeout=120s [email protected] bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (export RAY_USAGE_STATS_ENABLED=0;export RAY_OVERRIDE_RESOURCES='"'"'{"CPU":8}'"'"';((ps aux | grep -v nohup | grep -v grep | grep -q -- "python3 -m sky.skylet.skylet") || nohup python3 -m sky.skylet.skylet >> ~/.sky/skylet.log 2>&1 &); ray stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 ray start --disable-usage-stats --head --port=6380 --dashboard-port=8266 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --temp-dir /tmp/ray_skypilot || exit 1; which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done; python -c '"'"'import json, os; json.dump({"ray_port":6380, "ray_dashboard_port":8266}, open(os.path.expanduser("~/.sky/ray_port.json"), "w"))'"'"';)'`
2023-07-10 21:35:41,238 INFO log_timer.py:25 -- NodeUpdater: i-036c634eb67821936: Ray start commands succeeded [LogTimer=4447ms]
2023-07-10 21:35:41,238 INFO log_timer.py:25 -- NodeUpdater: i-036c634eb67821936: Applied config f62a597a450a8281871e7ace3caa155afb5dfe65 [LogTimer=183192ms]
2023-07-10 21:35:42,755 INFO log_timer.py:25 -- AWSNodeProvider: Set tag ray-node-status=up-to-date on ['i-036c634eb67821936'] [LogTimer=515ms]
2023-07-10 21:35:42,925 INFO log_timer.py:25 -- AWSNodeProvider: Set tag ray-runtime-config=f62a597a450a8281871e7ace3caa155afb5dfe65 on ['i-036c634eb67821936'] [LogTimer=170ms]
2023-07-10 21:35:43,090 INFO log_timer.py:25 -- AWSNodeProvider: Set tag ray-file-mounts-contents=24403a03b3acb79e10305dbf19904b00a057a0a1 on ['i-036c634eb67821936'] [LogTimer=165ms]
2023-07-10 21:35:43,091 INFO updater.py:188 -- New status: up-to-date
2023-07-10 21:35:43,273 INFO commands.py:836 -- Useful commands
2023-07-10 21:35:43,273 INFO commands.py:838 -- Monitor autoscaling with
2023-07-10 21:35:43,274 INFO commands.py:839 -- ray exec /home/shyam/.sky/generated/cpu-cluster.yml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
2023-07-10 21:35:43,274 INFO commands.py:846 -- Connect to a terminal on the cluster head:
2023-07-10 21:35:43,274 INFO commands.py:847 -- ray attach /home/shyam/.sky/generated/cpu-cluster.yml
2023-07-10 21:35:43,274 INFO commands.py:850 -- Get a remote shell to the cluster manually:
2023-07-10 21:35:43,274 INFO commands.py:851 -- ssh -o IdentitiesOnly=yes -i ~/.ssh/sky-key [email protected]
The feature
Support for HPU Habana hardware accelerator in runhouse
Motivation
With the increasing demand for high-performance computing and the need for faster processing of large-scale machine learning and deep learning workloads, HPUs have emerged as powerful hardware accelerators. These accelerators offer significant performance advantages over traditional CPUs and GPUs when it comes to tasks involving LLM, neural networks, large-scale data processing, and scientific simulations.
What the ideal solution looks like
By integrating support for HPUs in runhouse, you would provide developers with a platform that enables them to leverage these advanced hardware accelerators seamlessly. This would open up new possibilities for building and running computationally intensive applications and workflows directly on runhouse infrastructure.
For example: the client would be able to remote launch applications on a HPU aws server by:
rh.cluster(name='rh-gaudi', instance_type='dl1.24xlarge', provider='aws').save()
https://aws.amazon.com/ec2/instance-types/dl1/
https://developer.habana.ai/
Additional context
HPU self hosted server as well.
Describe the bug
Please provide a clear and concise expectation of how cold start looks like.
I see the docs mentions couple of methods ot speed up the load time for models, it would be great if objective numbers could be added. Ray also provides methods to combat cold start, and I see the library is being utilized, but do you use such methods?
For example if you look the img below from this article, most providers of the cold starts are below 100s. (see img) & most providers list either P90/P70/P50 values to help understand the cold start problem & solutions in those terms.
Other relevant stuff:
https://news.ycombinator.com/item?id=35738072
https://www.banana.dev/blog/turboboot
Describe the bug
Test with langchain function self_hosted_huggingface_instructor_embedding_documents(), it transfers small files from client to server, the client hits the following error during the process:
INFO | 2023-08-01 21:57:49,547 | Setting up Function on cluster.
INFO | 2023-08-01 21:57:49,547 | Copying folder from file:///root/t to: rh-cls
sky.exceptions.CommandError: Command rsync -Pavz --filter='dir-merge,- .gitignore' -e "ssh -i /root/.ssh/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ConnectTimeout=30s -o ForwardAgent=yes -o ControlMaster=auto -o ControlPath=/tmp/skypilot_ssh_root/3651d5b8ee/%C -o ControlPersist=300s" '/root/t/' [email protected]:'~/t/' failed with return code 2.
Failed to rsync up: /root/t/ -> ~/t/. Ensure that the network is stable, then retry.
Then, single the command out and launch:
#rsync -Pavz --filter='dir-merge,- .gitignore' -e "ssh -i /root/.ssh/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ConnectTimeout=30s -o ForwardAgent=yes -o ControlMaster=auto -o ControlPath=/tmp/skypilot_ssh_root/3651d5b8ee/%C -o ControlPersist=300s" '/root/t/' [email protected]:'~/t/'
protocol version mismatch -- is your shell clean?
(see the rsync manpage for an explanation)
rsync error: protocol incompatibility (code 2) at compat.c(622) [sender=3.2.7]
If relevant, include the steps or code snippet to reproduce the error.
Versions
Please run the following and paste the output below.
wget https://raw.githubusercontent.com/run-house/runhouse/main/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
Python Platform: Linux-5.15.0-60-lowlatency-x86_64-with-glibc2.35
Python Version: 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0]
Relevant packages:
boto3==1.28.17
fastapi==0.99.0
fsspec==2023.5.0
pyarrow==12.0.1
pycryptodome==3.12.0
rich==13.5.2
runhouse==0.0.9
skypilot==0.3.3
sshfs==2023.7.0
sshtunnel==0.4.0
typer==0.9.0
uvicorn==0.23.2
wheel==0.38.4
Additional context
Eliminate need to .write() a data object to the filesystem before returning from a function, which can be quite expensive - e.g. after I've preprocessed a dataset that's already backed by files in the filesystem, calling .write() is just copying them for no reason, including a costly partition.
Basically, we have the cluster object store, we should use it to avoid fs reads and writes we don't need (and save the user the trouble of knowing they need to .write() before using data remotely). This also saves us the trouble of finding places to write down data when the user doesn't feel like providing a path, or is just working with an anonymous data object (e.g. returning an rh.Table from a preprocessing fn). This will also clean up a sort of API wrinkle where a pinned object is markedly different from a blob (there doesn't need to be a real difference in terms of user intent), and the relationship between data passed to a resource constructor and the written-down data is a little unclear (e.g. if I do rh.table(my_ray_table, path="real/path/to/existing.parquet")
which data should fetch return?).
Basic API concepts:
rh.table(my_table)
system=this_cluster
and name=f"table_{random_hex}"
(just like we do to generate random run_keys). The rns_address (whether random or user-provided) is the key in the object store..save
just persists in the RNS that the object lives on that cluster in the object store. If the cluster goes down, the table is obviously gone._data
field because there's no need for a local object store (nothing can .get
the object from the local interpreter anyway).rh.table(my_table).write()
would actually save the table down (same as present behavior), but return a new table object with path set to the fs path. That eliminates the current ._cached_data
ambiguity (multiple sources of truth), because the original object still holds the original data, and the new returned object just points to the fs data. rh.table(my_table).write(path="local/path.parquet")
is clearer than the present constructor accepting both (we should probably throw an error if both are passed in, because it's ambiguous). One gotcha: if the user sets the name for the in-memory table and then writes it, should the new table have the same name? If they .save it, should we delete the existing object out of the object store so it's clear that there's only one table with that rns_address (and its not really accessible anymore)? In general, if a user loads an object from_name, the one stored in RNS should be the source of truth, even if there's a local one in the object store.
my_table.fetch()
and my_table.stream()
from elsewhere should still work, but now via RPCs - the cluster's .get should already work for fetch, but we'd likely need a new one for stream. For fetch, the object needs to be pickleable (not cloudpickle-able) for us to be able to send it over the wire without dealing with python version mismatches (I don't think this is unreasonable).
We need a way to tell for a given blob or table if we need to use the RPCs instead of the existing fs-based operations, and I'm leaning toward actually breaking out the folder-backed table or blob to be separate classes from the in-memory ones. It would probably make the most sense if the in-memory Blob/Table/KVstore etc. classes were actually the base classes, and the folder-backed ones were subclasses. There are a number of advantages to doing this:
path
field.rh.blob(my_model)
saves into the object store with key blob.name
orf"blob_{randomhex}"
. rh.blob(my_model, name="my_model")
and rh.Blob.from_name("my_model")
should behave identically to rh.pin_to_memory("my_model", my_model)
and rh.get_pinned_object("my_model")
, (except with rns_address as the obj_store key instead of name, but that's an implementation detail) and ideally replace it. The current pinning system isn't very elegant and eats too much user brainspace.
An immediate implication of the above (because we use pinning for storing results when a user calls fn.remote
), is that fn.remote
can just wrap the result in a blob
before returning instead of returning the run_key. Wrapping a result in rh.blob is common enough that it makes sense for .remote
to mean "please return a remote object." The current .remote
behavior of returning the run_key is actually "run this async and return a key to retrieve the result", which I think would make more sense to be called fn.async
or fn.submit
, considering the fact that most users don't seem to know we support async because the naming is unclear (submit could make it clearer that the function will continue to run in the background even if they kill the interpreter locally). Also, right now we need to INFO log a bunch of instructions for killing or retrieving for every .remote
call, but this isn't necessary and looks ugly when the user just wants a remote object back.
Lastly, supporting remote in-memory objects opens the door to remote calls on those objects. We could pretty easily support this just by intercepting any call on the object, and if the rh.blob doesn't have that function/attr, we try RPCing the call over to the cluster. Like this:
class Blob(Resource):
...
def __getattribute__(self,name):
if not_a_blob_attr(name):
remote_attr = self.get_attr_over_rpc(name)
if name == "__call__" or hasattr(remote_attr, "__call__"):
def newfunc(*args, **kwargs):
result = self.call_on_obj_via_rpc(name, *args, **kwargs)
return result if self.is_primitive(result) else rh.blob(result)
return newfunc
else:
return remote_attr
else:
return attr
This would make our remote objects real remote objects, and save a lot of trouble creating one-off functions to send to the cluster to call methods on objects. You can do something crazy like:
model = rh.blob(my_model).to(gpu).cuda() # But can't use .to("cuda") because it'd call blob's .to
local_pil_image = model("my_input_string").fetch()
So overall the benefits of this change are:
From SyncLinear.com | KIT-83
From SyncLinear.com | KIT-63
Overview and progress tracker for secrets management revamp, including new APIs and support for new types of secret types and providers.
Keeping track of secrets and keys for your various cloud, cluster, and dev accounts, and sharing them across dev environments and teammates is manual and messy. Providing secrets management for Runhouse-adjacent work (e.g. cloud providers for Runhouse clusters, API keys used alongside Runhouse functions, etc) makes it easier to onboard Runhouse Den. Even as a standalone, Runhouse Secrets can be an easy way to get started with storing, keeping track of, and sharing keys.
Runhouse already has basic secrets management support, including saving/syncing provider secrets to default locations, and a login/logout flow. The secrets flow is currently quite separate from the rest of RH resource abstractions, but can benefit from inheriting the properties expected from RH resources, including naming, saving, and sharing.
Converting secrets to a RH resource makes it easier to further develop secrets to support sharing across users/devices, add flexibility to the types of secrets, and extend to new provider-specific secrets.
rh.Secrets.put/get
custom_secret = rh.secret(name="my_secret", values={"my_key": "my_value"}
custom_secret = custom_secret.write(path="~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~/.rh/secrets/custom_secret.json")
aws_secret = rh.provider_secret("aws") # extracts from default path or env vars
aws_secret.values
>>> {'access_key': 'XXX_KEY', 'secret_key': 'YYY_KEY'}
lambdalabs_secret = rh.provider_secret("lambda", values={"api_key": "*****"}).write()
cluster.sync_secrets(["aws", "lambda"])
env_secret = rh.env_secret(name="my_env_vars", env_vars=["OPENAI_API_KEY"]) # extracts from os.environ
cc @dongreenberg @jlewitt1angell
From SyncLinear.com | KIT-88
Cc @Caroline
From SyncLinear.com | KIT-38
Rather than always append "./" to reqs, etc.
From SyncLinear.com | KIT-64
Clicking on the Discord link in the README (both the Discord badge and in the Getting Help section) goes to an "Invite Invalid" page:
Is Discord still the recommended way to ask questions, or should I post them as Issues? I'm curious about this project :)
Tried reloading "sd_generate" from inside a notebook, and it hung trying to copy over the "./" of the notebook's environment (which was huge).
From SyncLinear.com | KIT-54
Describe the bug
Hi, for the runhouse version 0.0.9, I consistently hit error when run the following script ( it worked before for previous version)
import runhouse as rh
gpu = rh.cluster(ips=['127.0.0.1'],
ssh_creds={'ssh_user': 'rhclient', 'ssh_private_key':'/home/rhclient/.ssh/id_rsa'},
name='rh-cls')
print("#################Restart server")
print("Exit now")
....
INFO | 2023-07-31 18:30:20,983 | No auth token provided, so not using RNS API to save and load configs
INFO | 2023-07-31 18:30:21,832 | Connected (version 2.0, client OpenSSH_8.9p1)
INFO | 2023-07-31 18:30:21,944 | Authentication (publickey) failed.
INFO | 2023-07-31 18:30:21,951 | Connected (version 2.0, client OpenSSH_8.9p1)
INFO | 2023-07-31 18:30:22,010 | Authentication (publickey) failed.
2023-07-31 18:30:22,010| ERROR | Could not open connection to gateway
ERROR | 2023-07-31 18:30:22,010 | Could not open connection to gateway
2023-07-31 18:30:22,011| ERROR | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
ERROR | 2023-07-31 18:30:22,011 | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
INFO | 2023-07-31 18:30:22,011 | Server rh-cls is up, but the HTTP server may not be up.
INFO | 2023-07-31 18:30:22,011 | Restarting HTTP server on rh-cls.
INFO | 2023-07-31 18:30:22,011 | Running command on rh-cls: pkill -f "python -m runhouse.servers.http.http_server"
Warning: Permanently added '127.0.0.1' (ED25519) to the list of known hosts.
Permission denied, please try again.
Permission denied, please try again.
[email protected]: Permission denied (publickey,password).
INFO | 2023-07-31 18:30:22,123 | Running command on rh-cls: screen -dm bash -c 'python -m runhouse.servers.http.http_server |& tee -a ~/.rh/cluster_server_rh-cls.log 2>&1'
Warning: Permanently added '127.0.0.1' (ED25519) to the list of known hosts.
Permission denied, please try again.
Permission denied, please try again.
[email protected]: Permission denied (publickey,password).
INFO | 2023-07-31 18:30:27,237 | Checking server rh-cls again.
Traceback (most recent call last):
File "/home/rhclient/.local/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 357, in check_server
self.connect_server_client()
File "/home/rhclient/.local/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 324, in connect_server_client
self._rpc_tunnel, connected_port = self.ssh_tunnel(
File "/home/rhclient/.local/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 411, in ssh_tunnel
ssh_tunnel.start()
File "/home/rhclient/.local/lib/python3.10/site-packages/sshtunnel.py", line 1331, in start
self._raise(BaseSSHTunnelForwarderError,
File "/home/rhclient/.local/lib/python3.10/site-packages/sshtunnel.py", line 1174, in _raise
raise exception(reason)
sshtunnel.BaseSSHTunnelForwarderError: Could not establish session to SSH gateway
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/devspace/test_self_hosted_llm.py", line 14, in
gpu = rh.cluster(ips=['127.0.0.1'],
File "/home/rhclient/.local/lib/python3.10/site-packages/runhouse/rns/hardware/cluster_factory.py", line 59, in cluster
return Cluster(ips=ips, ssh_creds=ssh_creds, name=name, dryrun=dryrun)
File "/home/rhclient/.local/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 58, in init
self.check_server()
File "/home/rhclient/.local/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 379, in check_server
self.client.check_server(cluster_config=cluster_config)
AttributeError: 'NoneType' object has no attribute 'check_server'
Versions
Please run the following and paste the output below
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
Python Platform: Linux-5.15.0-60-lowlatency-x86_64-with-glibc2.35
Python Version: 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0]
Relevant packages:
boto3==1.28.15
fastapi==0.99.0
fsspec==2023.5.0
pyarrow==12.0.1
pycryptodome==3.12.0
rich==13.5.1
runhouse==0.0.9
skypilot==0.3.3
sshfs==2023.7.0
sshtunnel==0.4.0
typer==0.9.0
uvicorn==0.23.2
wheel==0.38.4
Additional context
I started:
Curious
Describe the bug
Hi,
I'm trying to use a gpu system on our local network. However I'm running into issues.
Basic question: Does the runhouse package need to be installed on the remote gpu system? Couldn't figure this out from the documentation.
Here is the snippet of code I'm trying to run:
import runhouse as rh
import pdb;pdb.set_trace()
cluster = rh.cluster(
name="mlw-cluster",
ips=['xx.xx.xx.xx'],
ssh_creds={'ssh_user': 'lab', 'ssh_private_key':'/export/lab/.ssh/mlw01.key'},
)
def num_cpus():
import multiprocessing
return f"Num cpus: {multiprocessing.cpu_count()}"
num_cpus()
num_cpus_cluster = rh.function(name="num_cpus_cluster", fn=num_cpus).to(system=cluster, reqs=["./"])
I get following error in creating the cluster:
(Pdb) c
2023-07-20 10:17:54,985| WAR | MainThrea/1032@sshtunnel | Could not read SSH configuration file: ~/.ssh/config
WARNING | 2023-07-20 10:17:54,985 | Could not read SSH configuration file: ~/.ssh/config
2023-07-20 10:17:54,987| INF | MainThrea/1060@sshtunnel | 1 keys loaded from agent
INFO | 2023-07-20 10:17:54,987 | 1 keys loaded from agent
2023-07-20 10:17:54,988| INF | MainThrea/1117@sshtunnel | 1 key(s) loaded
INFO | 2023-07-20 10:17:54,988 | 1 key(s) loaded
2023-07-20 10:17:54,988| ERR | MainThrea/1314@sshtunnel | Password is required for key /export/lab/.ssh/mlw01.key
ERROR | 2023-07-20 10:17:54,988 | Password is required for key /export/lab/.ssh/mlw01.key
2023-07-20 10:17:54,988| INF | MainThrea/0978@sshtunnel | Connecting to gateway: xx.x.xxx.x:22 as user 'lab'
INFO | 2023-07-20 10:17:54,988 | Connecting to gateway: 172.17.10.110:22 as user 'lab'
2023-07-20 10:17:54,988| DEB | MainThrea/0983@sshtunnel | Concurrent connections allowed: True
2023-07-20 10:17:54,989| DEB | MainThrea/1400@sshtunnel | Trying to log in with key: b'asdWEQWEQWe'
2023-07-20 10:17:55,012| DEB | MainThrea/1204@sshtunnel | Transport socket info: (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 0), timeout=0.1
2023-07-20 10:17:55,043| INF | Thread-1/1893@transport | Connected (version 2.0, client OpenSSH_7.6p1)
INFO | 2023-07-20 10:17:55,043 | Connected (version 2.0, client OpenSSH_7.6p1)
2023-07-20 10:17:55,278| INF | Thread-1/1893@transport | Authentication (publickey) successful!
INFO | 2023-07-20 10:17:55,278 | Authentication (publickey) successful!
2023-07-20 10:17:55,279| ERR | MainThrea/1230@sshtunnel | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
ERROR | 2023-07-20 10:17:55,279 | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
2023-07-20 10:17:55,280| WAR | MainThrea/1032@sshtunnel | Could not read SSH configuration file: ~/.ssh/config
WARNING | 2023-07-20 10:17:55,280 | Could not read SSH configuration file: ~/.ssh/config
2023-07-20 10:17:55,282| INF | MainThrea/1060@sshtunnel | 1 keys loaded from agent
INFO | 2023-07-20 10:17:55,282 | 1 keys loaded from agent
2023-07-20 10:17:55,282| INF | MainThrea/1117@sshtunnel | 1 key(s) loaded
INFO | 2023-07-20 10:17:55,282 | 1 key(s) loaded
2023-07-20 10:17:55,283| ERR | MainThrea/1314@sshtunnel | Password is required for key /export/lab/.ssh/mlw01.key
ERROR | 2023-07-20 10:17:55,283 | Password is required for key /export/lab/.ssh/mlw01.key
2023-07-20 10:17:55,283| INF | MainThrea/0978@sshtunnel | Connecting to gateway: 172.17.10.110:22 as user 'lab'
INFO | 2023-07-20 10:17:55,283 | Connecting to gateway: 172.17.10.110:22 as user 'lab'
2023-07-20 10:17:55,283| DEB | MainThrea/0983@sshtunnel | Concurrent connections allowed: True
2023-07-20 10:17:55,283| WAR | MainThrea/1618@sshtunnel | It looks like you didn't call the .stop() before the SSHTunnelForwarder obj was collected by the garbage collector! Running .stop(force=True)
WARNING | 2023-07-20 10:17:55,283 | It looks like you didn't call the .stop() before the SSHTunnelForwarder obj was collected by the garbage collector! Running .stop(force=True)
2023-07-20 10:17:55,284| INF | MainThrea/1374@sshtunnel | Closing all open connections...
INFO | 2023-07-20 10:17:55,284 | Closing all open connections...
2023-07-20 10:17:55,284| DEB | MainThrea/1378@sshtunnel | Listening tunnels: None
2023-07-20 10:17:55,284| WAR | MainThrea/1450@sshtunnel | Tunnels are not started. Please .start() first!
WARNING | 2023-07-20 10:17:55,284 | Tunnels are not started. Please .start() first!
2023-07-20 10:17:55,284| INF | MainThrea/1453@sshtunnel | Closing ssh transport
INFO | 2023-07-20 10:17:55,284 | Closing ssh transport
2023-07-20 10:17:55,284| DEB | MainThrea/1477@sshtunnel | Transport is closed
2023-07-20 10:17:55,285| DEB | MainThrea/1400@sshtunnel | Trying to log in with key: b'463095aa1803da78647cd548f37173ef'
2023-07-20 10:17:55,305| DEB | MainThrea/1204@sshtunnel | Transport socket info: (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 0), timeout=0.1
2023-07-20 10:17:55,334| INF | Thread-3/1893@transport | Connected (version 2.0, client OpenSSH_7.6p1)
INFO | 2023-07-20 10:17:55,334 | Connected (version 2.0, client OpenSSH_7.6p1)
2023-07-20 10:17:55,578| INF | Thread-3/1893@transport | Authentication (publickey) successful!
INFO | 2023-07-20 10:17:55,578 | Authentication (publickey) successful!
2023-07-20 10:17:55,579| INF | Srv-50053/1433@sshtunnel | Opening tunnel: 0.0.0.0:50053 <> 127.0.0.1:50052
INFO | 2023-07-20 10:17:55,579 | Opening tunnel: 0.0.0.0:50053 <> 127.0.0.1:50052
INFO | 2023-07-20 10:17:55,580 | Checking server mlw-cluster
2023-07-20 10:17:55,814| TRA | Thread-5 /0360@sshtunnel | #1 <-- ('127.0.0.1', 44364) connected
2023-07-20 10:17:55,815| TRA | Thread-5 /0316@sshtunnel | >>> OUT #1 <-- ('127.0.0.1', 44364) send to ('127.0.0.1', 50052): b'504f5354202f636865636b2f20485454502f312e310d0a486f73743a203132372e302e302e313a35303035330d0a557365722d4167656e743a20707974686f6e2d72657175657374732f322e33312e300d0a4163636570742d456e636f64696e673a20677a69702c206465666c6174650d0a4163636570743a202a2f2a0d0a436f6e6e656374696f6e3a206b6565702d616c6976650d0a436f6e74656e742d4c656e6774683a203330300d0a436f6e74656e742d547970653a206170706c69636174696f6e2f6a736f6e0d0a0d0a7b2264617461223a20227b5c6e202020205c226e616d655c223a205c227e2f6d6c772d636c75737465725c222c5c6e202020205c227265736f757263655f747970655c223a205c22636c75737465725c222c5c6e202020205c227265736f757263655f737562747970655c223a205c22436c75737465725c222c5c6e202020205c226970735c223a205b5c6e20202020202020205c223137322e31372e31302e3131305c225c6e202020205d2c5c6e202020205c227373685f63726564735c223a207b5c6e20202020202020205c227373685f757365725c223a205c226c61625c222c5c6e20202020202020205c227373685f707269766174655f6b65795c223a205c222f6578706f72742f6c61622f2e7373682f6d6c7730312e6b65795c225c6e202020207d5c6e7d227d' >>>
2023-07-20 10:17:55,816| TRA | Thread-5 /0333@sshtunnel | <<< IN #1 <-- ('127.0.0.1', 44364) recv: b'5353482d322e302d4f70656e5353485f372e367031205562756e74752d347562756e7475302e350d0a' <<<
INFO | 2023-07-20 10:17:55,816 | Server mlw-cluster is up, but the HTTP server may not be up.
INFO | 2023-07-20 10:17:55,817 | Restarting HTTP server on mlw-cluster.
INFO | 2023-07-20 10:17:55,817 | Running command on mlw-cluster: pkill -f "python -m runhouse.servers.http.http_server"
2023-07-20 10:17:55,817| TRA | Thread-5 /0311@sshtunnel | >>> OUT #1 <-- ('127.0.0.1', 44364) recv empty data >>>
2023-07-20 10:17:55,820| TRA | Thread-5 /0375@sshtunnel | #1 <-- ('127.0.0.1', 44364) connection closed.
INFO | 2023-07-20 10:17:56,571 | Running command on mlw-cluster: screen -dm bash -c 'python -m runhouse.servers.http.http_server |& tee -a ~/.rh/cluster_server_mlw-cluster.log 2>&1'
INFO | 2023-07-20 10:18:02,291 | Checking server mlw-cluster again.
2023-07-20 10:18:02,318| ERR | Thread-3/1893@transport | Secsh channel 1 open FAILED: Connection refused: Connect failed
ERROR | 2023-07-20 10:18:02,318 | Secsh channel 1 open FAILED: Connection refused: Connect failed
2023-07-20 10:18:02,318| TRA | Thread-14/0357@sshtunnel | #2 <-- ('127.0.0.1', 47456) open new channel ssh error: ChannelException(2, 'Connect failed')
2023-07-20 10:18:02,318| ERR | Thread-14/0394@sshtunnel | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')
ERROR | 2023-07-20 10:18:02,318 | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')
Traceback (most recent call last):
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/connectionpool.py", line 714, in urlopen
httplib_response = self._make_request(
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/connectionpool.py", line 466, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/connectionpool.py", line 461, in _make_request
httplib_response = conn.getresponse()
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/http/client.py", line 1375, in getresponse
response.begin()
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/http/client.py", line 318, in begin
version, status, reason = self._read_status()
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/http/client.py", line 287, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/requests/adapters.py", line 486, in send
resp = conn.urlopen(
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/connectionpool.py", line 798, in urlopen
retries = retries.increment(
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/packages/six.py", line 769, in reraise
raise value.with_traceback(tb)
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/connectionpool.py", line 714, in urlopen
httplib_response = self._make_request(
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/connectionpool.py", line 466, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/connectionpool.py", line 461, in _make_request
httplib_response = conn.getresponse()
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/http/client.py", line 1375, in getresponse
response.begin()
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/http/client.py", line 318, in begin
version, status, reason = self._read_status()
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/http/client.py", line 287, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/export/lab/work/learn_runhouse/testmlw01.py", line 4, in <module>
cluster = rh.cluster(
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/runhouse/rns/hardware/cluster_factory.py", line 59, in cluster
return Cluster(ips=ips, ssh_creds=ssh_creds, name=name, dryrun=dryrun)
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 60, in __init__
self.check_server()
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 381, in check_server
self.client.check_server(cluster_config=cluster_config)
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/runhouse/servers/http/http_client.py", line 48, in check_server
self.request(
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/runhouse/servers/http/http_client.py", line 35, in request
response = req_fn(
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/requests/api.py", line 115, in post
return request("post", url, data=data, json=json, **kwargs)
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/requests/adapters.py", line 501, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Versions
Please run the following and paste the output below.
wget https://raw.githubusercontent.com/run-house/runhouse/main/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
Python Platform: Linux-5.19.0-46-generic-x86_64-with-glibc2.35
Python Version: 3.10.12 (main, Jul 5 2023, 18:54:27) [GCC 11.2.0]
Relevant packages:
boto3==1.28.6
fastapi==0.99.0
fsspec==2023.6.0
pyarrow==12.0.1
pycryptodome==3.12.0
rich==13.4.2
runhouse==0.0.9
skypilot==0.3.3
sshfs==2023.7.0
sshtunnel==0.4.0
typer==0.9.0
uvicorn==0.23.1
wheel==0.38.4
SkyPilot collects usage data to improve its services. `setup` and `run` commands are not collected to ensure privacy.
Usage logging can be disabled by setting the environment variable SKYPILOT_DISABLE_USAGE_COLLECTION=1.
Checking credentials to enable clouds for SkyPilot.
AWS: disabled
Reason: AWS credentials are not set. Run the following commands:
$ pip install boto3
$ aws configure
$ aws configure list # Ensure that this shows identity is set.
For more info: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html
Details: `aws sts get-caller-identity` failed with error: [botocore.exceptions.NoCredentialsError] Unable to locate credentials.
Azure: disabled
Reason: ~/.azure/msal_token_cache.json does not exist. Run the following commands:
$ az login
$ az account set -s <subscription_id>
For more info: https://docs.microsoft.com/en-us/cli/azure/get-started-with-azure-cli
GCP: disabled
Reason: GCP tools are not installed. Run the following commands:
$ pip install google-api-python-client
$ conda install -c conda-forge google-cloud-sdk -y
Credentials may also need to be set. Run the following commands:
$ gcloud init
$ gcloud auth application-default login
For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#google-cloud-platform-gcp
Details: [builtins.ModuleNotFoundError] No module named 'googleapiclient'
Lambda: disabled
Reason: Failed to access Lambda Cloud with credentials. To configure credentials, go to:
https://cloud.lambdalabs.com/api-keys
to generate API key and add the line
api_key = [YOUR API KEY]
to ~/.lambda_cloud/lambda_keys
IBM: disabled
Reason: Missing credential file at /export/lab/.ibm/credentials.yaml.
Store your API key and Resource Group id in ~/.ibm/credentials.yaml in the following format:
iam_api_key: <IAM_API_KEY>
resource_group_id: <RESOURCE_GROUP_ID>
SCP: disabled
Reason: Failed to access SCP with credentials. To configure credentials, see: https://cloud.samsungsds.com/openapiguide
Generate API key and add the following line to ~/.scp/scp_credential:
access_key = [YOUR API ACCESS KEY]
secret_key = [YOUR API SECRET KEY]
project_id = [YOUR PROJECT ID]
OCI: disabled
Reason: `oci` is not installed. Install it with: pip install oci
For more details, refer to: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#oracle-cloud-infrastructure-oci
Cloudflare (for R2 object store): disabled
Reason: [r2] profile is not set in ~/.cloudflare/r2.credentials. Additionally, Account ID from R2 dashboard is not set. Run the following commands:
$ pip install boto3
$ AWS_SHARED_CREDENTIALS_FILE=~/.cloudflare/r2.credentials aws configure --profile r2
$ mkdir -p ~/.cloudflare
$ echo <YOUR_ACCOUNT_ID_HERE> > ~/.cloudflare/accountid
For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#cloudflare-r2
SkyPilot will use only the enabled clouds to run tasks. To change this, configure cloud credentials, and run sky check.
If any problems remain, please file an issue at https://github.com/skypilot-org/skypilot/issues/new
Clusters
No existing clusters.
Managed spot jobs
No in progress jobs. (See: sky spot -h)
Additional context
Add any other context about the problem here.
when run runhouse start --screen
, it shows error like
python3 command was not found. Make sure you have python3 installed.
but when running without --screen, it's ok
Versions
Please run the following and paste the output below.
wget https://raw.githubusercontent.com/run-house/runhouse/main/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
python collect_env.py
Python Platform: Linux-3.10.0-957.21.3.el7.x86_64-x86_64-with-glibc2.17
Python Version: 3.11.4 (main, Jul 5 2023, 13:45:01) [GCC 11.2.0]
Relevant packages:
boto3==1.33.11
fastapi==0.103.1
fsspec==2023.5.0
pyarrow==13.0.0
rich==13.5.2
runhouse==0.0.13
skypilot==0.4.0
sshfs==2023.10.0
sshtunnel==0.4.0
typer==0.9.0
uvicorn==0.23.2
wheel==0.38.4
Checking credentials to enable clouds for SkyPilot.
AWS: disabled
Reason: AWS credentials are not set. Run the following commands:
$ pip install boto3
$ aws configure
$ aws configure list # Ensure that this shows identity is set.
For more info: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html
Details: `aws sts get-caller-identity` failed with error: [botocore.exceptions.NoCredentialsError] Unable to locate credentials.
Azure: disabled
Reason: ~/.azure/msal_token_cache.json does not exist. Run the following commands:
$ az login
$ az account set -s <subscription_id>
For more info: https://docs.microsoft.com/en-us/cli/azure/get-started-with-azure-cli
GCP: disabled
Reason: GCP tools are not installed. Run the following commands:
$ pip install google-api-python-client
$ conda install -c conda-forge google-cloud-sdk -y
Credentials may also need to be set. Run the following commands:
$ gcloud init
$ gcloud auth application-default login
For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#google-cloud-platform-gcp
Details: [builtins.ModuleNotFoundError] No module named 'googleapiclient'
IBM: disabled
Reason: Missing credential file at /home/admins/.ibm/credentials.yaml.
Store your API key and Resource Group id in ~/.ibm/credentials.yaml in the following format:
iam_api_key: <IAM_API_KEY>
resource_group_id: <RESOURCE_GROUP_ID>
Kubernetes: disabled
Reason: Credentials not found - check if ~/.kube/config exists.
Lambda: disabled
Reason: Failed to access Lambda Cloud with credentials. To configure credentials, go to:
https://cloud.lambdalabs.com/api-keys
to generate API key and add the line
api_key = [YOUR API KEY]
to ~/.lambda_cloud/lambda_keys
OCI: disabled
Reason: `oci` is not installed. Install it with: pip install oci
For more details, refer to: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#oracle-cloud-infrastructure-oci
SCP: disabled
Reason: Failed to access SCP with credentials. To configure credentials, see: https://cloud.samsungsds.com/openapiguide
Generate API key and add the following line to ~/.scp/scp_credential:
access_key = [YOUR API ACCESS KEY]
secret_key = [YOUR API SECRET KEY]
project_id = [YOUR PROJECT ID]
Cloudflare (for R2 object store): disabled
Reason: [r2] profile is not set in ~/.cloudflare/r2.credentials. Additionally, Account ID from R2 dashboard is not set. Run the following commands:
$ pip install boto3
$ AWS_SHARED_CREDENTIALS_FILE=~/.cloudflare/r2.credentials aws configure --profile r2
$ mkdir -p ~/.cloudflare
$ echo <YOUR_ACCOUNT_ID_HERE> > ~/.cloudflare/accountid
For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#cloudflare-r2
SkyPilot will use only the enabled clouds to run tasks. To change this, configure cloud credentials, and run sky check.
If any problems remain, please file an issue at https://github.com/skypilot-org/skypilot/issues/new
Clusters
No existing clusters.
Managed spot jobs
No in progress jobs. (See: sky spot -h)
Additional context
fulll logs:
runhouse start --port 2222
INFO | 2023-12-11 02:29:30.713426 | NumExpr defaulting to 8 threads.
INFO | 2023-12-11 02:29:32.342877 | Using port: 2222.
INFO | 2023-12-11 02:29:32.343102 | Starting API server using the following command: /home/admins/miniconda3/bin/python3 -m runhouse.servers.http.http_server.
Executing `/home/admins/miniconda3/bin/python3 -m runhouse.servers.http.http_server --port 2222`
INFO | 2023-12-11 02:29:34.061997 | NumExpr defaulting to 8 threads.
INFO | 2023-12-11 02:29:36.233910 | Launching HTTP server on port: 2222.
INFO | 2023-12-11 02:29:36.234118 | Launching Runhouse API server with den_auth=False and use_local_telemetry=False on host: 0.0.0.0 and port: 32300
INFO: Started server process [15764]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:32300 (Press CTRL+C to quit)
^CINFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [15764]
runhouse start --port 2222 --screen
INFO | 2023-12-11 02:29:45.997178 | NumExpr defaulting to 8 threads.
INFO | 2023-12-11 02:29:46.455935 | Using port: 2222.
INFO | 2023-12-11 02:29:46.456143 | Starting API server using the following command: /home/admins/miniconda3/bin/python3 -m runhouse.servers.http.http_server.
Executing `screen -dm bash -c "/home/admins/miniconda3/bin/python3 -m runhouse.servers.http.http_server --port 2222 2>&1 | tee -a '/home/admins/.rh/server.log' 2>&1"`
python3 command was not found. Make sure you have python3 installed.
Integrate via Rest API (slurmrestd)
SlurmCluster
subclass which can submit jobs to existing slurm clusterFrom SyncLinear.com | KIT-78
Super cool. I have an existing pytorch project that has over 100 .to(device) calls. is there an easy way to transform our codebase to incorporate runhouse? or should i manually change all my .to calls to accomodate for runhouse?
cc: @carolineechen @dongreenberg
Please help prioritize our roadmap! We have a long list of projects we'd like to complete to make Runhouse robust 🦾, comprehensive 🎨, and flexible 🙆♀️ across research and production usage. Please comment which items resonate for your use cases, or let us know if there are features we've missed!
Batching is critical for good compute utilization in ML. Assuming fn is written to accept a list of inputs, calling fn.batch(single_item, batch_size=10) should accumulate the inputs on the server and only call fn(list_of_items) when it has a full batch. Open questions:
From SyncLinear.com | KIT-71
Describe the bug
I'm having an issue when trying to start up a Lang chain llm. After setting up the cluster
gpu = rh.cluster('test', instance_type='T4:1', use_spot=False)
I attempt to create the llm that will run my inferences
from langchain.llms import SelfHostedHuggingFaceLLM
llm = SelfHostedHuggingFaceLLM(model_id='dolly-v2-2-8b', hardware=gpu, model_reqs=['pip:./', 'transformers', 'torch'])
My code appears to run into some error with creating / finding a file. Hoping you all would be able to support.
INFO | 2023-04-20 11:38:47,871 | Setting up Function on cluster.
INFO | 2023-04-20 11:38:47,884 | Upping the cluster test
I 04-20 11:38:53 optimizer.py:617] == Optimizer ==
I 04-20 11:38:53 optimizer.py:628] Target: minimizing cost
I 04-20 11:38:53 optimizer.py:640] Estimated cost: $0.5 / hour
I 04-20 11:38:53 optimizer.py:640]
I 04-20 11:38:53 optimizer.py:712] Considered resources (1 node):
I 04-20 11:38:53 optimizer.py:760] ---------------------------------------------------------------------------------------------------
I 04-20 11:38:53 optimizer.py:760] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
I 04-20 11:38:53 optimizer.py:760] ---------------------------------------------------------------------------------------------------
I 04-20 11:38:53 optimizer.py:760] Azure Standard_NC4as_T4_v3 4 28 T4:1 eastus 0.53 ✔
I 04-20 11:38:53 optimizer.py:760] ---------------------------------------------------------------------------------------------------
I 04-20 11:38:53 optimizer.py:760]
I 04-20 11:38:53 optimizer.py:775] Multiple Azure instances satisfy T4:1. The cheapest Azure(Standard_NC4as_T4_v3, {'T4': 1}) is considered among:
I 04-20 11:38:53 optimizer.py:775] ['Standard_NC4as_T4_v3', 'Standard_NC8as_T4_v3', 'Standard_NC16as_T4_v3'].
I 04-20 11:38:53 optimizer.py:775]
I 04-20 11:38:53 optimizer.py:781] To list more details, run 'sky show-gpus T4'.
I 04-20 11:38:53 cloud_vm_ray_backend.py:3327] Creating a new cluster: "test" [1x Azure(Standard_NC4as_T4_v3, {'T4': 1})].
I 04-20 11:38:53 cloud_vm_ray_backend.py:3327] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 04-20 11:38:58 cloud_vm_ray_backend.py:1156] To view detailed progress: tail -n100 -f [C:\Users\stollbak/sky_logs\sky-2023-04-20-11-38-53-125409\provision.log](file:///C:/Users/stollbak/sky_logs/sky-2023-04-20-11-38-53-125409/provision.log)
Output exceeds the [size limit](command:workbench.action.openSettings?%5B%22notebook.output.textLineLimit%22%5D). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?782f5ea9-ab7f-4618-adf1-dfd0a80d4ddb)---------------------------------------------------------------------------
ScannerError Traceback (most recent call last)
File [c:\Python310\lib\site-packages\sky\execution.py:266](file:///C:/Python310/lib/site-packages/sky/execution.py:266), in _execute(entrypoint, dryrun, down, stream_logs, handle, backend, retry_until_up, optimize_target, stages, cluster_name, detach_setup, detach_run, idle_minutes_to_autostop, no_setup, _is_launched_by_spot_controller)
265 if handle is None:
--> 266 handle = backend.provision(task,
267 task.best_resources,
268 dryrun=dryrun,
269 stream_logs=stream_logs,
270 cluster_name=cluster_name,
271 retry_until_up=retry_until_up)
273 if dryrun:
File [c:\Python310\lib\site-packages\sky\utils\common_utils.py:241](file:///C:/Python310/lib/site-packages/sky/utils/common_utils.py:241), in make_decorator.._record(*args, **kwargs)
240 with cls(full_name, **ctx_kwargs):
--> 241 return f(*args, **kwargs)
File [c:\Python310\lib\site-packages\sky\utils\common_utils.py:220](file:///C:/Python310/lib/site-packages/sky/utils/common_utils.py:220), in make_decorator.._wrapper.._record(*args, **kwargs)
219 with cls(name_or_fn, **ctx_kwargs):
--> 220 return f(*args, **kwargs)
File [c:\Python310\lib\site-packages\sky\backends\backend.py:56](file:///C:/Python310/lib/site-packages/sky/backends/backend.py:56), in Backend.provision(self, task, to_provision, dryrun, stream_logs, cluster_name, retry_until_up)
55 usage_lib.messages.usage.update_actual_task(task)
---> 56 return self._provision(task, to_provision, dryrun, stream_logs,
57 cluster_name, retry_until_up)
File [c:\Python310\lib\site-packages\sky\backends\cloud_vm_ray_backend.py:2220](file:///C:/Python310/lib/site-packages/sky/backends/cloud_vm_ray_backend.py:2220), in CloudVmRayBackend._provision(self, task, to_provision, dryrun, stream_logs, cluster_name, retry_until_up)
2217 provisioner = RetryingVmProvisioner(
2218 self.log_dir, self._dag, self._optimize_target,
2219 self._requested_features, local_wheel_path, wheel_hash)
-> 2220 config_dict = provisioner.provision_with_retries(
2221 task, to_provision_config, dryrun, stream_logs)
2222 break
File [c:\Python310\lib\site-packages\sky\utils\common_utils.py:241](file:///C:/Python310/lib/site-packages/sky/utils/common_utils.py:241), in make_decorator.._record(*args, **kwargs)
240 with cls(full_name, **ctx_kwargs):
--> 241 return f(*args, **kwargs)
File [c:\Python310\lib\site-packages\sky\backends\cloud_vm_ray_backend.py:1718](file:///C:/Python310/lib/site-packages/sky/backends/cloud_vm_ray_backend.py:1718), in RetryingVmProvisioner.provision_with_retries(self, task, to_provision_config, dryrun, stream_logs)
1715 to_provision.cloud.check_features_are_supported(
1716 self._requested_features)
-> 1718 config_dict = self._retry_zones(
1719 to_provision,
1720 num_nodes,
1721 requested_resources=task.resources,
1722 dryrun=dryrun,
1723 stream_logs=stream_logs,
1724 cluster_name=cluster_name,
1725 cloud_user_identity=cloud_user,
1726 prev_cluster_status=prev_cluster_status)
1727 if dryrun:
File [c:\Python310\lib\site-packages\sky\backends\cloud_vm_ray_backend.py:1203](file:///C:/Python310/lib/site-packages/sky/backends/cloud_vm_ray_backend.py:1203), in RetryingVmProvisioner._retry_zones(self, to_provision, num_nodes, requested_resources, dryrun, stream_logs, cluster_name, cloud_user_identity, prev_cluster_status)
1202 try:
-> 1203 config_dict = backend_utils.write_cluster_config(
1204 to_provision,
...
1450 self._close_pipe_fds(p2cread, p2cwrite,
1451 c2pread, c2pwrite,
1452 errread, errwrite)
FileNotFoundError: [WinError 3] The system cannot find the path specified.
Versions
Python Platform: Windows-10-10.0.19044-SP0
Python Version: 3.10.2 (tags/v3.10.2:a58ebcc, Jan 17 2022, 14:12:15) [MSC v.1929 64 bit (AMD64)]
Relevant packages:
awscli==1.27.115
azure-cli==2.31.0
azure-cli-core==2.31.0
azure-cli-telemetry==1.0.6
azure-core==1.26.4
boto3==1.26.115
fsspec==2023.4.0
pyarrow==11.0.0
pycryptodome==3.12.0
rich==13.3.4
runhouse==0.0.5
skypilot==0.2.5
sshfs==2023.4.1
sshtunnel==0.4.0
typer==0.7.0
wheel==0.40.0
Checking credentials to enable clouds for SkyPilot.
AWS: disabled
Reason: AWS CLI is not installed properly. Run the following commands:
$ pip install skypilot[aws] Credentials may also need to be set. Run the following commands:
$ pip install boto3
$ aws configure
For more info: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html
Azure: enabled
GCP: disabled
Reason: GCP tools are not installed or credentials are not set. Run the following commands:
$ pip install google-api-python-client
$ conda install -c conda-forge google-cloud-sdk -y
$ gcloud init
$ gcloud auth application-default login
For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html
Lambda: disabled
Reason: Failed to access Lambda Cloud with credentials. To configure credentials, go to:
https://cloud.lambdalabs.com/api-keys
to generate API key and add the line
api_key = [YOUR API KEY]
to ~/.lambda_cloud/lambda_keys
SkyPilot will use only the enabled clouds to run tasks. To change this, configure cloud credentials, and run sky check.
If any problems remain, please file an issue at https://github.com/skypilot-org/skypilot/issues/new
Clusters
No existing clusters.
Managed spot jobs
No in progress jobs. (See: sky spot -h)
Additional context
Add any other context about the problem here.
Simple use case is logging in with system
command instead of Python API:
!runhouse login [TOKEN]
Currently, the CLI is hardcoded with interactive=True
:
Line 27 in 560a528
It's a minor quality of life improvement.
See above
Excited to get Runhouse integration up on NatML 😄
(some basic notes below, feel free to edit/comment)
GHA Setup
Types of testing, split using pytest.mark
Test profiling
Refactoring
From SyncLinear.com | KIT-85
Tracking issue and design stub for k8s cluster.
From SyncLinear.com | KIT-77
Hi! Runhouse looks fantastic! I am thinking about using it for a few of my use cases.
Describe the bug
When I run the following code-snippet (taken verbatim from the docs), I get an error:
import runhouse as rh
def get_pid(a=0):
import os
return os.getpid() + int(a)
server_fn = rh.function(get_pid).to(rh.here)
print(server_fn.endpoint())
This is the error:
INFO | 2024-04-26 09:08:52.384504 | Sending module get_pid to local Runhouse daemon
Traceback (most recent call last):
File "/home/ubuntu/langchain_experiments_2024/02_runhouse/runrunhouse.py", line 7, in <module>
server_fn = rh.function(get_pid).to(rh.here)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/miniconda3/envs/langchain/lib/python3.11/site-packages/runhouse/resources/functions/function.py", line 91, in to
return super().to(
^^^^^^^^^^^
File "/home/ubuntu/miniconda3/envs/langchain/lib/python3.11/site-packages/runhouse/resources/module.py", line 450, in to
excluded_state_keys = list(new_module.config().keys()) + [
^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/miniconda3/envs/langchain/lib/python3.11/site-packages/runhouse/resources/functions/function.py", line 176, in config
config = super().config(condensed)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/miniconda3/envs/langchain/lib/python3.11/site-packages/runhouse/resources/module.py", line 114, in config
config["signature"] = self.signature(rich=True)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/miniconda3/envs/langchain/lib/python3.11/site-packages/runhouse/resources/module.py", line 261, in signature
self._signature = self._compute_signature(rich=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/miniconda3/envs/langchain/lib/python3.11/site-packages/runhouse/resources/module.py", line 239, in _compute_signature
return {
^
File "/home/ubuntu/miniconda3/envs/langchain/lib/python3.11/site-packages/runhouse/resources/module.py", line 240, in <dictcomp>
name: self.method_signature(method) if rich else None
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/miniconda3/envs/langchain/lib/python3.11/site-packages/runhouse/resources/functions/function.py", line 121, in method_signature
return self.method_signature(self._get_obj_from_pointers(*self.fn_pointers))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/miniconda3/envs/langchain/lib/python3.11/site-packages/runhouse/resources/module.py", line 369, in _get_obj_from_pointers
obj_store.imported_modules[module_name] = importlib.import_module(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/miniconda3/envs/langchain/lib/python3.11/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 940, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/home/ubuntu/langchain_experiments_2024/02_runhouse/runrunhouse.py", line 7, in <module>
server_fn = rh.function(get_pid).to(rh.here)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/miniconda3/envs/langchain/lib/python3.11/site-packages/runhouse/resources/functions/function.py", line 91, in to
return super().to(
^^^^^^^^^^^
File "/home/ubuntu/miniconda3/envs/langchain/lib/python3.11/site-packages/runhouse/resources/module.py", line 470, in to
system.put_resource(new_module, state, dryrun=True)
File "/home/ubuntu/miniconda3/envs/langchain/lib/python3.11/site-packages/runhouse/resources/hardware/cluster.py", line 450, in put_resource
return obj_store.put_resource(serialized_data=data, env_name=env_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/miniconda3/envs/langchain/lib/python3.11/site-packages/runhouse/servers/obj_store.py", line 1434, in put_resource
return sync_function(self.aput_resource)(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/miniconda3/envs/langchain/lib/python3.11/site-packages/runhouse/utils.py", line 97, in wrapper
return future.result()
^^^^^^^^^^^^^^^
File "/home/ubuntu/miniconda3/envs/langchain/lib/python3.11/concurrent/futures/_base.py", line 456, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/miniconda3/envs/langchain/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/home/ubuntu/miniconda3/envs/langchain/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/miniconda3/envs/langchain/lib/python3.11/site-packages/runhouse/utils.py", line 77, in _thread_coroutine
return loop.run_until_complete(coroutine)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/miniconda3/envs/langchain/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/home/ubuntu/miniconda3/envs/langchain/lib/python3.11/site-packages/runhouse/servers/obj_store.py", line 1421, in aput_resource
return await self.acall_env_servlet_method(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/miniconda3/envs/langchain/lib/python3.11/site-packages/runhouse/servers/obj_store.py", line 261, in acall_env_servlet_method
return await ObjStore.acall_actor_method(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/miniconda3/envs/langchain/lib/python3.11/site-packages/runhouse/servers/obj_store.py", line 278, in acall_actor_method
return await getattr(actor, method).remote(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(AttributeError): ray::EnvServlet.aput_resource_local() (pid=4070735, ip=10.1.0.71, actor_id=6a27e532729d39d91d17627b05000000, repr=<runhouse.servers.env_servlet.EnvServlet object at 0x7f73567aea90>)
File "/home/ubuntu/miniconda3/lib/python3.11/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/miniconda3/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/miniconda3/lib/python3.11/site-packages/runhouse/servers/env_servlet.py", line 58, in wrapper
return handle_exception_response(e, traceback.format_exc(), serialization)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/miniconda3/lib/python3.11/site-packages/runhouse/servers/http/http_utils.py", line 165, in handle_exception_response
raise exception
File "/home/ubuntu/miniconda3/lib/python3.11/site-packages/runhouse/servers/env_servlet.py", line 39, in wrapper
output = await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/miniconda3/lib/python3.11/site-packages/runhouse/servers/env_servlet.py", line 113, in aput_resource_local
return await obj_store.aput_resource_local(resource_config, state, dryrun)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/miniconda3/lib/python3.11/site-packages/runhouse/servers/obj_store.py", line 1468, in aput_resource_local
resource = Resource.from_config(config=resource_config, dryrun=dryrun)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/miniconda3/lib/python3.11/site-packages/runhouse/resources/resource.py", line 292, in from_config
sys.modules["runhouse"], resource_type.capitalize(), None
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/miniconda3/lib/python3.11/site-packages/runhouse/resources/module.py", line 517, in __getattribute__
return super().__getattribute__(item)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Function' object has no attribute 'capitalize'
Sentry is attempting to send 2 pending events
Waiting up to 2 seconds
Press Ctrl-C to quit
Versions
Please run the following and paste the output below.
Python Platform: Linux-5.15.0-94-generic-x86_64-with-glibc2.35
Python Version: 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0]
Relevant packages:
fastapi==0.110.2
fsspec==2023.5.0
opentelemetry-instrumentation-fastapi==0.45b0
rich==13.7.1
runhouse==0.0.25
sshfs==2023.4.1
typer==0.12.3
uvicorn==0.29.0
wheel==0.41.2
The feature
I wonder if you have any plans to add features and interfaces that allow runhouse to manage local network GPU (not native cloud ) devices?
Motivation
Because I need to deploy localization devices. Instead of relying entirely on cloud devices
A new PostgresTable Table subclass. Maybe we should have a SQLTable subclass which defaults to DuckDB, and then support Postgres, MySQL, SQLite, others?
From SyncLinear.com | KIT-76
Please provide example about how to launch runhouse on my local server
Instead of use AWS, GCP cloud etc, I don't have any of these cloud account, but I have GPU V100 on my local server.
I would like to know how to setup on the server/client sides and how to interactive, I have been trying all examples from:
https://github.com/run-house/tutorials, but none of them worked.
please give more setup instructions and guidances.
Here is what I hit when I was trying to setup on-prem cluster:
$cat rh.py
import runhouse as rh
from diffusers import StableDiffusionPipeline
def sd_generate(prompt):
model = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-base").to("cpu")
return model(prompt).images[0]
gpu = rh.cluster( ips=['127.0.0.1'],
ssh_creds={'ssh_user': 'htang', 'ssh_private_key':'/home/htang/.ssh/id_rsa'},
name='rh-cluster')
sd_generate = rh.function(sd_generate).to(gpu, reqs=["./", "torch", "diffusers"])
img = sd_generate("An oil painting of Keanu Reeves eating a sandwich.")
print(type(img))
img.save("sd.png")
img.show()
$ python rh.py
INFO | 2023-05-28 04:11:35,284 | Loaded Runhouse config from /home/ytang/.rh/config.yaml
INFO | 2023-05-28 04:11:36,858 | Running command on rh-cluster: ray start --head
INFO | 2023-05-28 04:11:37,663 | Running command on rh-cluster: mkdir -p ~/.rh; touch /.rh/cluster_config.yaml; echo '{"name": "/rh-cluster", "resource_type": "cluster", "resource_subtype": "Cluster", "ips": ["127.0.0.1"], "ssh_creds": {"ssh_user": "ytang", "ssh_private_key": "/home/ytang/.ssh/id_rsa"}}' > ~/.rh/cluster_config.yaml
INFO | 2023-05-28 04:11:37,773 | Setting up Function on cluster.
INFO | 2023-05-28 04:11:38,044 | Connected (version 2.0, client OpenSSH_8.9p1)
INFO | 2023-05-28 04:11:38,105 | Authentication (publickey) successful!
INFO | 2023-05-28 04:11:38,361 | Running command on rh-cluster: ray start --head
INFO | 2023-05-28 04:11:39,023 | Running command on rh-cluster: mkdir -p ~/.rh; touch /.rh/cluster_config.yaml; echo '{"name": "/rh-cluster", "resource_type": "cluster", "resource_subtype": "Cluster", "ips": ["127.0.0.1"], "ssh_creds": {"ssh_user": "ytang", "ssh_private_key": "/home/ytang/.ssh/id_rsa"}}' > ~/.rh/cluster_config.yaml
INFO | 2023-05-28 04:11:39,137 | Copying local package scripts to cluster
INFO | 2023-05-28 04:11:39,327 | Installing packages on cluster rh-cluster: ['./', 'torch', 'diffusers']
Traceback (most recent call last):
File "/home/ytang/scripts/./rh.py", line 15, in
sd_generate = rh.function(sd_generate).to(gpu, reqs=["./", "torch", "diffusers"])
File "/home/ytang/.local/lib/python3.10/site-packages/runhouse/rns/function.py", line 119, in to
new_function.system.install_packages(new_function.reqs)
File "/home/ytang/.local/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 205, in install_packages
self.client.install_packages(to_install)
File "/home/ytang/.local/lib/python3.10/site-packages/runhouse/servers/grpc/unary_client.py", line 59, in install_packages
server_res = self.stub.InstallPackages(message)
File "/home/ytang/.local/lib/python3.10/site-packages/grpc/_channel.py", line 946, in call
return _end_unary_response_blocking(state, call, False, None)
File "/home/ytang/.local/lib/python3.10/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1685247106.845643418","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1685247106.845642647","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
Right now we support two kinds of RNS stores for saving and loading - the Runhouse RNS and the git repo. MLFlow has a high degree of flexibility in the storage backends users can persist their logs and experiments to, and many DS teams already have these stores set up. MLFlow only provides first-class support for models as saved and loaded primitives from the store (which is funny, because "models" are a primitive we specifically don't support, on purpose). Today, people are saving and loading other infrastructure metadata as free-form strings (e.g. s3 paths), but this has significant limitations:
I think there are a few possible APIs we can provide here:
rh.Table.from_name("bert_dropout_v5")
, we'll be going to MLFlow first to get the full RNS path for that resource, and then to Runhouse to fetch the resource itself. We could also support an api to pull a dict of all available resources for an experiment at once.mlflow.runhouse
integration (or model type) which facilitates saving and loading of Runhouse resources. This would allow saving and loading of resources in a familiar way to MLFlow users, but would also add a lot of new non-model things into the users' model registry.The easiest way to think about the user journey is like this (showing a notebook-centric workflow in a system like Databricks just to stress the assumptions, but this would all work even more simply in a git+IDE setting):
rh.Function.from_name("yolo_v5_training_dropout")
), copy out the full logic (including functions) from the notebook, or copy out the logic and flow but move the reusable functions into a shared git repo and import them into the script.Cc @rmehyde, @ankmathur96
From SyncLinear.com | KIT-80
For now, we can do this server-side it'll only be surfaced if stream_logs=True.
Important context: ray-project/ray#5554
From SyncLinear.com | KIT-72
Please add support for Python 3.11
This is a follow-up to a separate offline discussion about API feedback.
Since the typos were fixed in this commit by @carolineechen , there is no need to submit a fix-up on my end.
The open issues are with the rendering of certain inline markups on Runhouse website (v. latest), for example:
Docstrings under Package Factory Method, Blob Factory Method:
Page Secrets in Vault
The rendering is correct on the local build.
I also ruled out any conflicting or overriding configurations in the files below:
docs/conf.py
for any conflicting sphinx extension or html theme.readthedocs.yaml
under project root for any configuration overridesI was not able to cross reference doc built from the main branch which was "404 not found" at the time of submitting this issue.
See if those issues will persist the next time we build the remote doc.
Basic API ideas (WIP):
Create Run object (captures logs, inputs, outputs, other artifacts read or written within call, who ran, where):
res = fn(**kwargs, name=”my_run”)
A run is a folder (created inside local rh directory by default), and can be sent elsewhere to persist logs, results, artifact info, etc.:
rh.run(name=“my_run”).to("s3", path="runhouse/nlp_team/bert_ft/results")
Ideally, we can have a "default log store" setting in the user config so the logs from their runs can be sent to the same place by default when they save, rather than having to send each run one by one.
This could be the way for users to configure for artifacts/logs to flow to an existing MLFlow store, or to flow to W&B, Grafana, Datadog, etc.
Save the run to local or RNS (not all runs need to be saved)
rh.run(name=“my_run”).save()
Creates a run object by tracing the activity within the block - no inputs and outputs, but captures logs (perhaps several logfiles for different calls) and artifacts used:
with rh.run(name=”my_run”) as r:
Big feature, essentially the same as auto-caching in orchestrators - check if this run was already completed, and load results if so, otherwise run:
res = fn.get_or_run(name=”yelp_review_preproc_test”)
Create/name a CLI run:
r = my_cluster.run(["python test_bert.py --gpus 4 --model distilbert"], name="test_distilbert_ddp")
Inspiration: this MLFlow example
We can also support event (failure or completion) notifications through knocknock or pagerduty!
Cc @Caroline
From SyncLinear.com | KIT-67
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.