google-deepmind / launchpad Goto Github PK

License: Apache License 2.0

Starlark 9.07% Python 49.65% C++ 37.85% C 0.41% Dockerfile 0.80% Shell 2.22%

launchpad's Introduction

Launchpad

Launchpad is a library that simplifies writing distributed programs by seamlessly launching them on a variety of different platforms. Switching between local and distributed execution requires only a flag change.

Launchpad introduces a programming model that represents a distributed system as a graph data structure (a Program) describing the system’s topology. Each node in the program graph represents a service in the distributed system, i.e. the fundamental unit of computation that we are interested in running. As nodes are added to this graph, Launchpad constructs a handle for each of them. A handle ultimately represents a client to the yet-to-be-constructed service. A directed edge in the program graph, representing communication between two services, is created when the handle associated with one node is given to another at construction time. This edge originates from the receiving node, indicating that the receiving node will be the one initiating communication. This process allows Launchpad to define cross-service communication simply by passing handles to nodes. Launchpad provides a number of node types, including:

PyNode - a simple node executing provided Python code upon entry. It is similar to a main function, but with the distinction that each node may be running in separate processes and on different machines.
CourierNode - it enables cross-node communication. CourierNodes can communicate by calling public methods on each other either synchronously or asynchronously via futures. The underlying remote procedure calls are handled transparently by Launchpad.
ReverbNode - it exposes functionality of Reverb, an easy-to-use data storage and transport system primarily used by RL algorithms as an experience replay. You can read more about Reverb here.
MultiThreadingColocation - allows to colocate multiple other nodes in a single process.
MultiProcessingColocation - allows to colocate multiple other nodes as sub processes.

Using Launchpad involves writing nodes and defining the topology of your distributed program by passing to each node references of the other nodes that it can communicate with. The core data structure dealing with this is called a Launchpad program, which can then be executed seamlessly with a number of supported runtimes.

Supported launch types

Launchpad supports a number of launch types, both for running programs on a single machine, in a distributed manner, or in a form of a test. Launch type can be controlled by the launch_type argument passed to lp.launch method, or specified through the --lp_launch_type command line flag. Please refer to the documentation of the LaunchType for details.

Installation
Quick Start
Citing Launchpad
Acknowledgements
Other resources

Installation

Please keep in mind that Launchpad is not hardened for production use, and while we do our best to keep things in working order, things may break or segfault.

⚠️ Launchpad currently only supports Linux based OSes.

The recommended way to install Launchpad is with pip. We also provide instructions to build from source using the same docker images we use for releases.

TensorFlow can be installed separately or as part of the pip install. Installing TensorFlow as part of the install ensures compatibility.

$ pip install dm-launchpad[tensorflow]

# Without Tensorflow install and version dependency check.
$ pip install dm-launchpad

Nightly builds

$ pip install dm-launchpad-nightly[tensorflow]

# Without Tensorflow install and version dependency check.
$ pip install dm-launchpad-nightly

Similarily, Reverb can be installed ensuring compatibility:

$ pip install dm-launchpad[reverb]

Develop Launchpad inside a docker container

The most convenient way to develop Launchpad is with Docker. This way you can compile and test Launchpad inside a container without having to install anything on your host machine, while you can still use your editor of choice for making code changes. The steps are as follows.

Checkout Launchpad's source code from GitHub.

$ git checkout https://github.com/deepmind/launchpad.git
$ cd launchpad

Build the Docker container to be used for compiling and testing Launchpad. You can specify tensorflow_pip parameter to set the version of Tensorflow to build against. You can also specify which version(s) of Python container should support. The command below enables support for Python 3.7, 3.8, 3.9 and 3.10.

$ docker build --tag launchpad:devel \
  --build-arg tensorflow_pip=tensorflow==2.3.0 \
  --build-arg python_version="3.7 3.8 3.9 3.10" - < docker/build.dockerfile

The next step is to enter the built Docker image, binding checked out Launchpad's sources to /tmp/launchpad within the container.

$ docker run --rm --mount "type=bind,src=$PWD,dst=/tmp/launchpad" \
  -it launchpad:devel bash

At this point you can build and install Launchpad within the container by executing:

$ /tmp/launchpad/oss_build.sh

By default it builds Python 3.8 version, you can change that with --python flag.

$ /tmp/launchpad/oss_build.sh --python 3.8

To make sure installation was successful and Launchpad works as expected, you can run some examples provided:

$ python3.8 -m launchpad.examples.hello_world.launch
$ python3.8 -m launchpad.examples.consumer_producers.launch --lp_launch_type=local_mp

To make changes to Launchpad codebase, edit sources checked out from GitHub directly on your host machine (outside of the Docker container). All changes are visible inside the Docker container. To recompile just run the oss_build.sh script again from the Docker container. In order to reduce compilation time of the consecutive runs, make sure to not exit the Docker container.

Citing Launchpad

If you use Launchpad in your work, please cite the accompanying technical report:

@article{yang2021launchpad,
    title={Launchpad: A Programming Model for Distributed Machine Learning
           Research},
    author={Fan Yang and Gabriel Barth-Maron and Piotr Stańczyk and Matthew
            Hoffman and Siqi Liu and Manuel Kroiss and Aedan Pope and Alban
            Rrustemi},
    year={2021},
    journal={arXiv preprint arXiv:2106.04516},
    url={https://arxiv.org/abs/2106.04516},
}

Acknowledgements

We greatly appreciate all the help from Reverb and TF-Agents teams in setting up building and testing setup for Launchpad.

Other resources

launchpad's People

Contributors

Stargazers

Watchers

launchpad's Issues

Pip Installation Error

I can't for the life of me get launchpad to install on my linux machine. With a fresh environment and via pip install "dm-launchpad[tensorflow]", I get the following error when trying to import launchpad:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zakka/miniconda/envs/test/lib/python3.8/site-packages/launchpad/__init__.py", line 36, in <module>
    from launchpad.nodes.courier.node import CourierHandle
  File "/home/zakka/miniconda/envs/test/lib/python3.8/site-packages/launchpad/nodes/courier/node.py", line 21, in <module>
    import courier
  File "/home/zakka/miniconda/envs/test/lib/python3.8/site-packages/courier/__init__.py", line 26, in <module>
    from courier.python.client import Client  # pytype: disable=import-error
  File "/home/zakka/miniconda/envs/test/lib/python3.8/site-packages/courier/python/client.py", line 30, in <module>
    from courier.python import py_client
ImportError: libpython3.8.so.1.0: cannot open shared object file: No such file or directory

Error in writing tests for dm-launchpad-nightly==0.3.0.dev20210802

Hi,
I was recently running my test cases and encountered the following error.

E absl.flags._exceptions.UnparsedFlagAccessError: Trying to access flag --lp_termination_notice_secs before flags were parsed.

I didn't have this issue with dm-launchpad-nightly==0.3.0.dev20210728 so I suspect something was changed in the last few days.

Here is a snippet of my test case. I can include the other relevant files if they are helpful for debugging.

#!/usr/bin/env python3
"""Integration test for the distributed agent."""

from typing import Optional

from absl.testing import absltest
import acme
from acme.testing import fakes
import launchpad as lp
import numpy as np

from magi.agents.impala import agent_distributed as impala_lib
from magi.agents.impala import config
from magi.agents.impala.agent_test import MyNetwork


def network_factory(spec):
    def forward_fn(x, s):
        model = MyNetwork(spec.num_values)
        return model(x, s)

    def initial_state_fn(batch_size: Optional[int] = None):
        model = MyNetwork(spec.num_values)
        return model.initial_state(batch_size)

    def unroll_fn(inputs, state, start_of_episode=None):
        model = MyNetwork(spec.num_values)
        return model.unroll(inputs, state, start_of_episode)

    return {
        "forward": forward_fn,
        "unroll": unroll_fn,
        "initial_state": initial_state_fn,
    }


class DistributedAgentTest(absltest.TestCase):
    """Simple integration/smoke test for the distributed agent."""

    def test_control_suite(self):
        """Tests that the agent can run on the control suite without crashing."""

        def environment_factory():
            environment = fakes.DiscreteEnvironment(
                num_actions=5,
                num_observations=10,
                obs_shape=(5, 10),
                obs_dtype=np.float32,
                episode_length=10,
            )
            return environment

        agent = impala_lib.DistributedIMPALA(
            environment_factory=lambda seed, test: environment_factory(),
            network_factory=network_factory,
            num_actors=2,
            config=config.IMPALAConfig(
                sequence_length=4, sequence_period=4, max_queue_size=1000
            ),
            max_actor_steps=None,
        )
        program = agent.build()

        (learner_node,) = program.groups["learner"]
        learner_node.disable_run()

        lp.launch(program, launch_type="test_mt")

        learner: acme.Learner = learner_node.create_handle().dereference()

        for _ in range(5):
            learner.step()


if __name__ == "__main__":
    import tensorflow as tf

    tf.config.experimental.set_visible_devices([], "GPU")
    absltest.main()

I am using pytest and pytest-xdist for running the test cases. I did a check and it appears that running without pytest works fine. I suspect that the issue is that without using absl.app.run() the flags are not parsed hence throwing the error, but this is not desirable as the user may still want to use launchpad without using absl. It would be great if the configuration can be achieved without having to use absl flags.

Feature request: Support Python installed in non-system locations natively.

Hi,

Thank you so much for providing this awesome library. I have been using it extensively with Acme and Reverb and found it great for my daily research.

Right now when using launchpad and reverb in Python not installed via the OS's package manager (for example, using Conda), both Launchpad and Reverb will throw an error. Here is an example of the stack trace tested with launchpad 0.5.0.

python -c "import launchpad"
# Traceback (most recent call last):
#  File "<string>", line 1, in <module>
#  File "/opt/conda/lib/python3.8/site-packages/launchpad/__init__.py", line 36, in <module>
#   from launchpad.nodes.courier.node import CourierHandle
#  File "/opt/conda/lib/python3.8/site-packages/launchpad/nodes/courier/node.py", line 21, in <module>
#    import courier
#  File "/opt/conda/lib/python3.8/site-packages/courier/__init__.py", line 26, in <module>
#    from courier.python.client import Client  # pytype: disable=import-error
#  File "/opt/conda/lib/python3.8/site-packages/courier/python/client.py", line 30, in <module>
#    from courier.python import py_client
# ImportError: libpython3.8.so.1.0: cannot open shared object file: No such file or directory

This happens because the location where Conda installs a copy of libpython3.8.so.1.0 is in /opt/conda/lib which is not in the linker's default path for searching the shared libraries. Tweaking the LD_LIBRARY_PATH to be /opt/conda/lib would fix the problem, but this can create a problem for new users who want to install launchpad to start using it right away. This issue has been reported previously tensorflow/agents#724
google-deepmind/acme#47 and the solution is to tweak the LD_LIBRARY_PATH.

Looking closer at the shared library produced by bazel, it seems that the rpath does not include a relative directory to the lib directory of Conda. Looking into the shared library created

readelf --dynamic /opt/conda/lib/python3.8/site-packages/courier/python/py_client.so
# 0x000000000000001d (RUNPATH)            Library runpath: [$ORIGIN/../../_solib_k8/_U_S_Scourier_Ccourier_Uservice_Ucc_Uproto___Ucourier:$ORIGIN/../../_solib_k8/_U_S_Scourier_Sserialization_Cserialization_Ucc_Uproto___Ucourier_Sserialization:$ORIGIN/../../_solib_k8/_U@tensorflow_Usolib_S_S_Cframework_Ulib___Utensorflow_Usolib_Stensorflow_Usolib:$ORIGIN/../../_solib_k8/_U@python_Uincludes_S_S_Cpython_Uincludes___Upython_Uincludes:$ORIGIN/:$ORIGIN/..]

Tweaking the LD_LIBRARY_PATH is perhaps not really ideal. I have been looking into if it is possible to update the BUILD files to fix this problem, but my limited knowledge of Bazel does not get me very far. The closest place which I believe would need update is https://github.com/deepmind/launchpad/blob/3a7d71d32a4f961b5835cc4256ea461880a6bc10/launchpad/build_defs.bzl#L427 and https://github.com/deepmind/launchpad/blob/3a7d71d32a4f961b5835cc4256ea461880a6bc10/launchpad/build_defs.bzl#L499

Perhaps including an additional level up in generating the rpath should fix the problem, but I am to sure if doing so would have an adverse effect. In contrast, I have to encountered this issue using either tensorflow or jax (by pip installing them into my conda environment instead of using the conda package manager) which also uses pybind so I suspect that they must have done something to solve this. I have also cross-referenced with https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tensorflow.bzl but I couldn't figure out what they did differently. I found the way TF sets up pybind extensions is very similar to what's done in Launchpad so I don't know why there is a difference in behavior. JAX uses TF's build defs for the pybind extensions so I assume the solution to fix this problem should exist in TF's Bazel configuration.

It would be great if both Launchpad and Reverb can support this use case by default. If you can point me in the right direction I am also happy to create a PR to resolve this.

Thanks!

PythonProcess LaunchConfig does not work with the args.

Hi, currently trying to use the multi_processing PythonProcess's args argument will throw an error because of of call to

https://github.com/deepmind/launchpad/blob/master/launchpad/nodes/python/flags_utils.py#L28

which includes the line.

FLAGS['lp_dummy_config'].flag_type(): {},

The lp_dummy_config flag is not defined. I suspect this happened because the OSS version does not export this flag. Is it possible to fix this issue?

Best,
Yicheng

CourierNode can not run in distributed mode

CourierNode is described that it enables cross-node communication. However, I find that it runs only in local mode

Example of how to specify local resources.

Hi there, thanks for releasing this! Really appreciated :)

I was just wondering, is there perhaps an example for how to specify the local resources for a launch? I can see that launch allows for an optional Dict[str, Any] to set up the resource configs. I'm guessing here the str keys refer to the nodes in the program but I'm not sure how to specify the "per-node launch configuration" for each node. Would this also be a dictionary, e.g. {"gpus": 2, "cpus": 4}? Apologies if there are examples of this that I missed.

Strange info when running launchtype = LOCAL_MULTI_PROCESSING

Launchpad throws out this info when I spawn a multiple local processes:

b"Failed to register: Unable to acquire bus name 'launchpad.locallaunch.cacbafcebhbdba'\n"

What causes this?

Force all nodes to quit when exception happens in one program

Hello, is there a way to get all nodes to stop when one node gets an exception?

Update launchpad releases

Hi,
Currently it seems that latest version of launchpad on PyPI is still building against tensorflow~=2.8.0 and an older version of reverb. This prevents user from being able to update to a later reverb version (as well as tensorflow>2.8). Is it possible to release a new version of launchpad that builds against the latest Reverb release version?

Best,
Yicheng

Cannot set certain XLA_ARGS for `PythonProcess`

When using local_mp, each process that uses jax spawns a huge amount of threads. I'm running 128 actors, and each one spawns ~500 threads, meaning the program spawns over 50,000 threads!

This puts me over the ulimit for my university cluster, and I suspect isn't performant. The recommended solution is to set XLA_FLAGS="--xla_cpu_multi_thread_eigen=false intra_op_parallelism_threads=1". But for some reason this isn't working with PythonProcess. Here's my PythonProcess for each of my nodes:

      PythonProcess(env={
        "CUDA_VISIBLE_DEVICES": str(-1),
        "XLA_FLAGS": "--xla_cpu_multi_thread_eigen=false intra_op_parallelism_threads=1",
      })

Which results in the error bash: line 1: XLA_FLAGS=--xla_cpu_multi_thread_eigen=false intra_op_parallelism_threads=1: command not found in each process that uses a local resource with those envs. Why is the environment variable being treated as a command here? I've talso ried enclosing the value in quotes which did not work. Thank you!

Run one of the node in the main process

For launching with multi-process mode, is it possible to launch one of the nodes in the main process where the launchpad program is launched?

My use case is to integrate with logging in Weights and Biases https://wandb.ai/site, where I would like to create the logging context in the main process (to better preserve argv which is captured by W&B in logging my experiment runs).
For example, I have a script to launch an LP program

python scripts/lp_run_impala.py --config.learning_rate=0.001

The program would create an LP program to run a distributed IMPALA training. I created a sink node which the actors and learners can send log information to be logged on W&B. I would like W&B to track the command line arguments I supplied to the training script which is python scripts/lp_run_impala.py --config.learning_rate=0.001, this is fine when W&B is initialized in the main process. However, since currently the sink node is launched in a separate process and it captures the argv from the worker process which is

<python with no main file> --data_file /tmp/tmpv29m1qx5/job.pkl --lp_task_id 0

I suspect that doing some sort of dereferencing of the node would solve the problem but I am not entirely sure of the details. I would really appreciate the launchpad maintainers' help in setting this up!

Many thanks in advance.

Why is courier implemented with C++ ?

Hi.
I am wondering if there are any reasons that courier is implemented with C++ instead of Python ?
Thanks~

Run out of GPU memory when using Launchpad

Hi I am running a distributed agent in Acme (/acme/examples/control/lp_local_d4pg.py) and I get the following error (run of out GPU memory)

I0522 20:03:38.349611 140210148575040 node.py:61] Reverb client connecting to: localhost:18200
Traceback (most recent call last):
  File "/home/lorenzo/acme/lib/python3.8/site-packages/launchpad/nodes/python/process_entry.py", line 80, in <module>
    app.run(main)
  File "/home/lorenzo/acme/lib/python3.8/site-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/home/lorenzo/acme/lib/python3.8/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/home/lorenzo/acme/lib/python3.8/site-packages/launchpad/nodes/python/process_entry.py", line 75, in main
    functions[task_id]()
  File "/home/lorenzo/acme/lib/python3.8/site-packages/launchpad/nodes/python/node.py", line 71, in _construct_function
    return functools.partial(self._function, *args, **kwargs)()
  File "/home/lorenzo/acme/lib/python3.8/site-packages/launchpad/nodes/courier/node.py", line 106, in run
    instance = self._construct_instance()  # pytype:disable=wrong-arg-types
  File "/home/lorenzo/acme/lib/python3.8/site-packages/launchpad/nodes/python/node.py", line 164, in _construct_instance
    return self._constructor(*args, **kwargs)
  File "/home/lorenzo/git/acme/acme/agents/tf/d4pg/agent_distributed.py", line 149, in actor
    networks = self._network_factory(self._environment_spec.actions)
  File "/home/lorenzo/git/acme/acme/agents/tf/d4pg/agent_distributed.py", line 67, in wrapped_network_factory
    networks_dict = network_factory(action_spec)
  File "/home/lorenzo/git/planning_2d/test_speed.py", line 118, in make_networks
    networks.DiscreteValuedHead(vmin, vmax, num_atoms),
  File "/home/lorenzo/acme/lib/python3.8/site-packages/sonnet/src/base.py", line 126, in __call__
    module.__init__(*args, **kwargs)
  File "/home/lorenzo/git/acme/acme/tf/networks/distributional.py", line 63, in __init__
    vmin = tf.convert_to_tensor(vmin)
  File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
    return target(*args, **kwargs)
  File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 1430, in convert_to_tensor_v2_with_dispatch
    return convert_to_tensor_v2(
  File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 1436, in convert_to_tensor_v2
    return convert_to_tensor(
  File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped
    return func(*args, **kwargs)
  File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 1566, in convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function
    return constant_op.constant(value, dtype, name=name)
  File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 264, in constant
    return _constant_impl(value, dtype, shape, name, verify_shape=False,
  File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 276, in _constant_impl
    return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
  File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 301, in _constant_eager_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 97, in convert_to_eager_tensor
    ctx.ensure_initialized()
  File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/eager/context.py", line 554, in ensure_initialized
    context_handle = pywrap_tfe.TFE_NewContext(opts)
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: out of memory

I tried on different hardware (rtx 1080 and rtx 1650) and I still get the same issue.
If I disable the GPU and run only CPU any number of programs can be spawned successfully instead. Obviously it's much slower though.

Any idea on what caused the error?

I am posting it here instead of acme because the issue seems related to Launchpad

Program cannot be interrupted after registering stop handler.

Hi,

I noticed that the following code cannot be interrupted by Ctrl-C.

def main(_):
    def signal_handler():
        print("Called handler")

    launchpad.register_stop_handler(signal_handler)
    launchpad.unregister_stop_handler(signal_handler)

    print("Start")

    while True:
        print("foo")
        for i in range(100):
            time.sleep(0.1)
            print(i)

if __name__ == "__main__":
    app.run(main)

Looks like the issue is around https://github.com/google-deepmind/launchpad/blob/master/launchpad/launch/worker_manager.py#L76 where the original interrupt handler is cleared and ignored.

I came across this issue as I was debugging why my acme offline experiment loop cannot be interrupted. In that context, it creates an EnvironmentLoop that will register and unregister a custom handler like above (in loop.run). Subsequent learner loops cannot be interrupted. The relevant part of the code is

https://github.com/google-deepmind/acme/blob/993826a95e657b8fe796ca7c640891d0de9d7a31/acme/jax/experiments/run_offline_experiment.py#L113-L114

@qstanczyk looks like you are quite familiar with that part of the code, is there something I can do to help resolve this?

Specify CPU and GPU usage

I am trying to set up the learner so that it uses the GPU, while the actors only use the CPUs. I see there is a local_resources option inside the launch function. It seems to require a dictionary with config instructions per node. How would I specify using CPUs and GPUs using this launch config? How would one for e.g. allow the learner in Acme's D4PG (https://github.com/deepmind/acme/tree/master/acme/agents/tf/d4pg) to use the GPU in distributed training?

Thanks.

How to launching my program on multiple machines at once?

Do you have plans to release a version that can run on multiple machines in the future?

Question: How to Debug

How do you debug launchpad programs?

I have tried running the single process version of launchpad (LOCAL_MULTI_THREADING launchtype) and adding breakpoints to the relevant parts of the program. I can debug the building of the program nodes, but I can't seem to debug what happens in lp.launch.

Do I need to dereference and obtain a handle to every node (like done in the tests)?

PyPi and release branch source code mismatch

I noticed for Launchpad version 0.5.2 that there is a mismatch between the uploaded PyPi package and the head of the release branch. This is problematic because this line is different in the PyPi package because there is a type mismatch where the expected type should be int but in the uploaded package it is a float due to a division by 1000.

Where can I find the full documentation for LaunchPad

All I could find is the white paper and the docs/ folder in the repo, is there any full documentation available?

Remove dataclasses dependency in the wheel for python > 3.6

Having dataclasses on python 3.8 causes the following error

AttributeError: module 'typing' has no attribute '_ClassVar'

A similar issue is discussed here
pytorch/pytorch#46930

Difference from Ray?

Thank you so much for open-sourcing the library. I really appreciate it.

I'm curious - launchpad's functionality looks very similar to Ray, which also features an RPC-style API and distributed launching. In addition, Ray supports Ray RLlib, just like Launchpad supports Acme. May I ask what are the key differences and advantages to use Launchpad?

Cannot import name 'py_client' from 'courier.python' (unknown location)

When I tried to build launchpad following this. I got:

./run_python_tests.sh

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/tmp/launchpad/launchpad/__init__.py", line 36, in <module>
    from launchpad.nodes.courier.node import CourierHandle
  File "/tmp/launchpad/launchpad/nodes/courier/node.py", line 21, in <module>
    import courier
  File "/tmp/launchpad/courier/__init__.py", line 26, in <module>
    from courier.python.client import Client  # pytype: disable=import-error
  File "/tmp/launchpad/courier/python/client.py", line 31, in <module>
    from courier.python import py_client
ImportError: cannot import name 'py_client' from 'courier.python' (unknown location)

lp.stop() in TestCase cannot make the test finished due to system exit

To reproduce, with dm-launchpad-nightly

import time

from absl.testing import absltest

import launchpad as lp


def learner():
  for i in range(10):
    time.sleep(1)
  lp.stop()


def actor():
  while True:
    print('act')
    time.sleep(1)


class LPTest(absltest.TestCase):

  def test_stop(self):
    program = lp.Program('test')
    program.add_node(lp.CourierNode(learner), label='learner')
    program.add_node(lp.CourierNode(actor), label='actor')
    lp.launch(programs=program, launch_type='test_mt', test_case=self)


if __name__ == '__main__':
  absltest.main()

Run the above code with pytest, then you will get the following error

(venv) ➜  launchpad-bugs pytest test_learner_actor.py 
============================================================================================================================ test session starts =============================================================================================================================
platform linux -- Python 3.8.10, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
rootdir: /home/zhongwen/PycharmProjects/launchpad-bugs
collected 1 item                                                                                                                                                                                                                                                             

test_learner_actor.py .                                                                                                                                                                                                                                                [100%]

============================================================================================================================== warnings summary ==============================================================================================================================
test_learner_actor.py::LPTest::test_stop
  /home/zhongwen/PycharmProjects/launchpad-bugs/venv/lib/python3.8/site-packages/_pytest/threadexception.py:75: PytestUnhandledThreadExceptionWarning: Exception in thread learner
  
  Traceback (most recent call last):
    File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
      self.run()
    File "/usr/lib/python3.8/threading.py", line 870, in run
      self._target(*self._args, **self._kwargs)
    File "/home/zhongwen/PycharmProjects/launchpad-bugs/venv/lib/python3.8/site-packages/launchpad/launch/worker_manager.py", line 142, in run_inner
      future.set_result(f())
    File "/home/zhongwen/PycharmProjects/launchpad-bugs/venv/lib/python3.8/site-packages/launchpad/nodes/python/node.py", line 72, in _construct_function
      return functools.partial(self._function, *args, **kwargs)()
    File "/home/zhongwen/PycharmProjects/launchpad-bugs/venv/lib/python3.8/site-packages/launchpad/nodes/courier/node.py", line 107, in run
      instance = self._construct_instance()  # pytype:disable=wrong-arg-types
    File "/home/zhongwen/PycharmProjects/launchpad-bugs/venv/lib/python3.8/site-packages/launchpad/nodes/python/node.py", line 165, in _construct_instance
      return self._constructor(*args, **kwargs)
    File "/home/zhongwen/PycharmProjects/launchpad-bugs/test_learner_actor.py", line 10, in learner
      time.sleep(1)
  SystemExit
  
    warnings.warn(pytest.PytestUnhandledThreadExceptionWarning(msg))

MuJoCo error when deployed inside launchpad

Context: I have MuJoCo physics simulator setup on my machine and works fine for my RL experiments. However, when I launch multiple instances of the MuJoCo environment inside launchpad nodes, I get the following error:

ERROR: GLEW initialization error: Missing GL version
Press Enter to exit ...

This error happens only when I launch multiple environments on different launchpad nodes or one environment outside the launchpad and one inside a launchpad node (refer to the example below). This error does not occur when I do not use launchpad.

Here's a minimal code to reproduce this on my machine:

import gym
import launchpad as lp

def make_env():
    env = gym.make('FetchReach-v1')
    _ = env.render(mode='rgb_array', height=64, width=64)

_ = make_env() # Successfully creates the env outside lp

# Fails if the above line isn't commented out
# Only works when the line above or everything below is commented
program = lp.Program(name='agent')
env_node = lp.PyNode(make_env)
program.add_node(env_node, label = 'test_node')
lp.launch(program, terminal='current_terminal')

Edit
The issue seems to be with multi-threading (refer to the code example below). Is there any way of bypassing this error with MuJoCo? I am using the following research code released by google research, which happens to be multi-threading with MuJoCo environments.

import gym
import threading

def make_env():
    env = gym.make('FetchReach-v1')
    _ = env.render(mode='rgb_array', height=64, width=64)

_ = make_env() # Successfully creates the env outside lp

thread = threading.Thread(target=make_env)
thread.start()
thread.join()
# >>> ERROR: GLEW initalization error: Missing GL version

Test launchpad setups

Hello there.

My example code setup looks as follows:

import time
import tensorflow as tf
from absl.testing import absltest

import launchpad as lp


class learner():
    def __init__(self) -> None:
        self.step_counter = 0
        print('Init learner')
    def step(self):
        print('Step learner')
        time.sleep(1)
        self.step_counter += 1
    def run(self):
        # while True:
        print('Run learner')
        while True:
            self.step()


class actor():
    def __init__(self) -> None:
        print('Init actor')
        self.step_count = 0
    def step(self):
        print('Step actor')
        time.sleep(1)
        self.step_count += 1
    def run(self):
        # while True:
        print('Run actor')
        while True:
            self.step()


class LPTest(absltest.TestCase):
    def test_stop(self):
        program = lp.Program('test')
        program.add_node(lp.CourierNode(learner), label='learner')
        program.add_node(lp.CourierNode(actor), label='actor')

        (learner_node,) = program.groups['learner']
        learner_node.disable_run()

        (actor_node,) = program.groups['actor']
        actor_node.disable_run()

        lp.launch(programs=program, launch_type='test_mt', test_case=self)
        learner_inst = learner_node.create_handle().dereference()
        learner_inst.step()

        actor_inst = actor_node.create_handle().dereference()
        actor_inst.step()

        print("Learner count: ", learner_inst.step_counter)
        print("Actor count: ", actor_inst.step_count)

        # Just to output the print statements.
        exit()

I want to be able to test if the internal values of the actor and learner process have been updated. However, when I print the counts I get:

Init learner
Init actor
Step learner
Step actor
Learner count:  <function exception_handler.<locals>.inner_function at 0x7f2c35acd5e0>
Actor count:  <function exception_handler.<locals>.inner_function at 0x7f2c35acd700>

Is there a way to get the values for the internal class variables? Furthermore, if I have a ReverbNode how can I ensure that my launchpad program stops after the test is complete? I also don't want to stop the launchpad instances of other tests that might be running. Thanks!

Distributed processes on multiple machines with launchpad?

Hi I was wondering if there was timeline for the release of the version that will support multiple processes on multiple machines. I am working with acme and that would be really cool!

Wait for launch pad program to finish in a subprocess

Hi,

I am currently using LaunchPad for a local distributed RL program. I am using the local multiprocessing backend but would like to have multiple runs of the program in a single process to perform a sweep.

I have the following script (ignoring the details of the agent implementation):

def run():
    program = impala_agent.DistributedIMPALA(
        environment_factory=environment_factory,
        network_factory=network_factory,
        num_actors=8,
        max_actor_steps=int(1e8),
        log_every=10.0,
    ).build()
    local_resources = {
        "learner": PythonProcess(
            env={"JAX_PLATFORM_NAME": "gpu", "CUDA_VISIBLE_DEVICES": "0"}
        ),
        "evaluator": PythonProcess(env={"CUDA_VISIBLE_DEVICES": ""}),
        "actor": PythonProcess(env={"CUDA_VISIBLE_DEVICES": ""}),
        "counter": PythonProcess(env={"CUDA_VISIBLE_DEVICES": ""}),
        "replay": PythonProcess(env={"CUDA_VISIBLE_DEVICES": ""}),
        "coordinator": PythonProcess(env={"CUDA_VISIBLE_DEVICES": ""}),
    }
    lp.launch(
        program,
        lp.LaunchType.LOCAL_MULTI_PROCESSING,
        terminal="current_terminal",
        local_resources=local_resources,
    )


def main(_):
    p = multiprocessing.Process(target=run)
    p.start()
    p.join() # This returns immediately

This currently does not work as the LaunchPad program would kill the entire current process tree at exit (at least in my experience when I have a reverb node that never terminates properly and has to be killed with os.kill). This prevents me from performing a sweep in the same process with something like

for params in SWEEPS:
  p = multiprocessing.Process(target=run, args=(params,))
  p.start()
  p.join() # This returns immediately

With the script, the process would return immediately and exit. However, the LP processes will not be shut down and I end up with the agent processes running in the background. In the ideal case, I would love to be able to wait on the LP program to finish and then execute another LP program in the same process with out having to write a shell script to wrap the sweeps.

Thanks!

Signal is not working

In the deepmind/acme project, I run python3.9 examples/gym/lp_sac_jax.py and I get the following.

RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd
I1225 21:19:30.732187 140644471760704 xla_bridge.py:243] Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker: 
I1225 21:19:30.732384 140644471760704 xla_bridge.py:243] Unable to initialize backend 'gpu': NOT_FOUND: Could not find registered platform with name: "cuda". Available platform names are: Host Interpreter
I1225 21:19:30.732703 140644471760704 xla_bridge.py:243] Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
W1225 21:19:30.732816 140644471760704 xla_bridge.py:248] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
[reverb/cc/platform/tfrecord_checkpointer.cc:150]  Initializing TFRecordCheckpointer in /tmp/tmpkxa6r6gu.
I1225 21:19:30.883024 140631112410880 courier_utils.py:51] Binding: run
I1225 21:19:30.884491 140631112410880 lp_utils.py:87] StepsLimiter: Starting with max_steps = 1000000 (actor_steps)
[reverb/cc/platform/tfrecord_checkpointer.cc:386] Loading latest checkpoint from /tmp/tmpkxa6r6gu
[reverb/cc/platform/default/server.cc:71] Started replay server on port 22255
I1225 21:19:30.888980 140629820557056 node.py:62] Reverb client connecting to: localhost:22255
I1225 21:19:32.412005 140624292452096 node.py:62] Reverb client connecting to: localhost:22255
I1225 21:19:32.412815 140605521848064 node.py:62] Reverb client connecting to: localhost:22255
I1225 21:19:32.418541 140630617470720 savers.py:166] Attempting to restore checkpoint: None
I1225 21:19:32.418889 140605488277248 node.py:62] Reverb client connecting to: localhost:22255
I1225 21:19:32.423120 140605479884544 node.py:62] Reverb client connecting to: localhost:22255
Error in atexit._run_exitfuncs: 10s.
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/launchpad/launch/worker_manager.py", line 408, in wait
    raise failure
  File "/usr/lib/python3.9/site-packages/launchpad/launch/worker_manager.py", line 454, in _check_workers
    worker.future.result()
  File "/usr/lib/python3.9/concurrent/futures/_base.py", line 438, in result
    return self.__get_result()
  File "/usr/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
    raise self._exception
  File "/usr/lib/python3.9/site-packages/launchpad/launch/worker_manager.py", line 230, in run_inner
    future.set_result(f())
  File "/usr/lib/python3.9/site-packages/launchpad/nodes/python/node.py", line 75, in _construct_function
    return functools.partial(self._function, *args, **kwargs)()
  File "/usr/lib/python3.9/site-packages/launchpad/nodes/courier/node.py", line 107, in run
    instance = self._construct_instance()  # pytype:disable=wrong-arg-types
  File "/usr/lib/python3.9/site-packages/launchpad/nodes/python/node.py", line 177, in _construct_instance
    self._instance = self._constructor(*args, **kwargs)
  File "/usr/lib/python3.9/site-packages/acme/jax/layouts/distributed_layout.py", line 140, in counter
    return savers.CheckpointingRunner(
  File "/usr/lib/python3.9/site-packages/acme/tf/savers.py", line 211, in __init__
    signals.add_handler(signal.SIGTERM, _signal_handler)
  File "/usr/lib/python3.9/site-packages/acme/utils/signals.py", line 31, in add_handler
    signal.signal(signo.value, _wrapped)
  File "/usr/lib/python3.9/signal.py", line 47, in signal
    handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
ValueError: signal only works in main thread of the main interpreter
terminate called without an active exception
Fatal Python error: Aborted

Thread 0x00007fea57c76740 (most recent call first):
<no Python frame>
Aborted (core dumped)

The same issue happens with python3.7

Here is a colab reproducing this issue - https://colab.research.google.com/gist/kovkev/a47e34e1fab418189fbe7e04ccc861d0/acme.ipynb

LOCAL_DOCKER

Hi, I'm interested in launching nodes in local Docker containers. There was once implemented enum LOCAL_DOCKER, but it has been effectively removed with 9f16e42 . Why? What was wrong with this option?

Regards,

Build from source on Apple M1

Hi,

I'm trying to build launchpad from sources on a laptop with the M1 chip.
Is it possible?
I've tried using the Dockerfile provided - unsuccessfully.
Curious to known if someone succeeded.
Thanks,

Cyprien

Issue with building LaunchPad with TensorFlow 2.14

Hi,

When I try to build LaunchPad from source to support tensorflow 2.14 I get some new errors. This was not a problem when building against TensorFlow 2.13.

Traceback (most recent call last):
File "", line 1, in
File "/home/yicheng/projects/launchpad/.venv/lib/python3.9/site-packages/tensorflow/init.py", line 38, in
from tensorflow.python.tools import module_util as _module_util
File "/home/yicheng/projects/launchpad/.venv/lib/python3.9/site-packages/tensorflow/python/init.py", line 42, in
from tensorflow.python.saved_model import saved_model
File "/home/yicheng/projects/launchpad/.venv/lib/python3.9/site-packages/tensorflow/python/saved_model/saved_model.py", line 20, in
from tensorflow.python.saved_model import builder
File "/home/yicheng/projects/launchpad/.venv/lib/python3.9/site-packages/tensorflow/python/saved_model/builder.py", line 23, in
from tensorflow.python.saved_model.builder_impl import _SavedModelBuilder
File "/home/yicheng/projects/launchpad/.venv/lib/python3.9/site-packages/tensorflow/python/saved_model/builder_impl.py", line 27, in
from tensorflow.python.framework import ops
File "/home/yicheng/projects/launchpad/.venv/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 43, in
from tensorflow.python.client import pywrap_tf_session
File "/home/yicheng/projects/launchpad/.venv/lib/python3.9/site-packages/tensorflow/python/client/pywrap_tf_session.py", line 25, in
from tensorflow.python.util import tf_stack
File "/home/yicheng/projects/launchpad/.venv/lib/python3.9/site-packages/tensorflow/python/util/tf_stack.py", line 22, in
from tensorflow.python.util import _tf_stack
ImportError: generic_type: cannot initialize type "StatusCode": an object with that name is already defined

Looks like when using TensorFlow 2.14, importing both TensorFlow and Launchpad at the same time (order doesn't matter) causes the above problem).

Bug with JAX absl flag configuration

Hi,

Thanks for the awesome project!

I found the following issue in dm-launchpad-nightly=0.1.0.dev20210427
It seems that if I configure jax absl flags in my entry script launchpad will error on launching the program

For a script that looks something like this

# pylint: disable=unused-import
from absl import app, flags
from jax.config import config
import launchpad as lp

FLAGS = flags.FLAGS

def main(_):
  import os
  os.environ['JAX_PLATFORM_NAME'] = 'cpu'

  program = impala_agent.DistributedIMPALA(
      environment_factory=environment_factory,
      # environment_spec=environment_spec,
      network_factory=network_factory,  # lp_utils.partial_kwargs(helpers.make_networks),
      num_actors=1,
      max_actor_steps=int(1e6)).build()
  lp.launch(program,
            lp.LaunchType.LOCAL_MULTI_PROCESSING,
            terminal='current_terminal')

if __name__ == '__main__':
  config.config_with_absl()  # Errors right now
  app.run(main)  # pylint: disable=no-value-for-parameter

This will gives an error

absl.flags._exceptions.DuplicateFlagError: The flag 'jax_omnistaging' is defined twice. First from jax.config, Second from jax.config.  Description from first occurrence: Deprecated. Setting this flag to False raises an error. Setting it to True has no effect.

I think the issue is probably in https://github.com/deepmind/launchpad/blob/master/launchpad/nodes/python/flags_utils.py#L54, I think there should be additional checks the function is first called to see if the flags have already been configured by the user.

Launchpad Release

Hi Launchpad Team,

Thanks for trying to help us with the #32, however, I don't think we have a release for the new TF version. Both the nightly and stable versions?

I am currently using a wheel built from a source as mentioned in #30 (comment). There are some additional fixes that seem to be required to work with TF 2.12.0. I am happy to create a PR if that is preferred (but maybe it's easier to update from internal). There seem to be some upgrades needed in terms of how to handle absl Status and tsl Status otherwise things are straightforward.

The update is quite useful as it is currently blocking using JAX agents in dm-acme at HEAD. dm-acme@HEAD now uses jax>0.4.3 which is incompatible with the TF/TFP version used in Acme.

@ddmbr what do you think?

Courier Nodes cannot receive Array-types and hang

Courier node's interface can accept Array-types on construction and transmit them but are unable to receive them. Moreover, when an unexpected type is sent over RPC the system will silently hang indefinitely.

Is only one-way transmission of Arrays expected behavior? Here's a minimal reproducable example (Python 3.8):

from typing import Optional

import launchpad as lp
import numpy as np
from absl import app


class Logger:
    def __init__(self, data: Optional[np.ndarray] = None) -> None:
        self._data = data

    def receive(self, x: np.ndarray) -> None:
        print(x)

    def send(self):
        return self._data


def main(_):
    program = lp.Program("test")
    logger_handle = program.add_node(lp.CourierNode(Logger, np.zeros([3])), label="logger")
    lp.launch(program, launch_type=lp.LaunchType.LOCAL_MULTI_THREADING)

    logger_client = logger_handle.dereference()

    # logger_client.receive(np.ones_like([3]))  # Hangs indefinitely
    print(logger_client.send())                            # --> [0, 0, 0]


if __name__ == "__main__":
    app.run(main)

absl-py==1.2.0
cloudpickle==2.2.0
distrax==0.1.2
dm-env==1.5
dm-haiku==0.0.8
dm-launchpad==0.5.2
dm-reverb==0.7.2
dm-tree==0.1.7
jax==0.3.17
jaxlib==0.3.15
networkx==2.8.6
numpy==1.22.4
protobuf==3.19.5
tensorflow==2.8.3
tensorflow-probability==0.16.0

Undefined symbol on importing courier

How I installed packages:

Python 3.9.7 from anaconda (linux x86_64)

tensorflow==2.6.1
dm-launchpad==0.3.1  (or dm-launchpad-nightly)
dm-reverb==0.5.0

When trying to import courier, I get the following error:

.../site-packages/courier/python/py_client.so: undefined symbol: _ZN4absl14lts_2020_09_235MutexD1Ev

With dm-launchpad-nightly (as of 11/3/2021)

.../site-packages/courier/python/py_client.so: undefined symbol: _ZNK4absl12lts_202103246Status4codeEv

What am I missing? Is it related to tensorflow (nightly) or dm-reverb (nightly) version? I know Google uses "single versioning" where everything is like a nightly, not quite ensuring backward compatibility, but in the open source community it is very difficult to find a set of DM packages in their stable versions that are compatible with one another.

Not able to save logs after launchpad.launch

I'm not able to save the logs after the termination of the launchpad in .py file.
Could you please suggest a solution?
and is there any better way to save the logs?

Code:

lp_return = lp.launch(program, launch_type=lp.LaunchType.LOCAL_MULTI_PROCESSING, terminal="current_terminal")
lp_return.wait()
print('-- acme saved')
shutil.make_archive('/content/logs', 'zip', '/root/acme')

log from terminal:
[counter/0] I0503 16:58:32.552900 139934399387520 savers.py:207] Caught SIGTERM: forcing a checkpoint save.
[counter/0] I0503 16:58:32.553121 139934399387520 savers.py:156] Saving checkpoint: /root/acme/2c71038a-cb02-11ec-9553-0242ac1c0002/checkpoints/counter
[reverb/cc/platform/default/server.cc:84] Shutting down replay server
Killing entire runtime.
Killed
/content/gdrive/MyDrive/code/

Supported launchpad types

I'm using lp.batched_handler and getting the following error when the output of the batched function is jnp.array/np.array

File "~/miniconda3/envs/mp2/lib/python3.9/site-packages/courier/python/client.py", line 52, in inner_function
[actor/1]     raise translate_status(e.status) from e
[actor/1] pybind11_abseil.status.StatusNotOk: Unbatching provided object is not currently supported.

I'm facing the similar issue when trying to call a function on the courier node with numpy or jax types
- The system gets stuck and the function call doesn't happen at all

Launchpad version: 0.5.2 installed from pip

Is the only way to handle data-type issues to convert from jnp.array/np.array to python list ?

Compress option is not passing through nodes to courier

Hey LP team,

There's a helpful option compress in courier client

https://github.com/deepmind/launchpad/blob/7879e9009fee8fb32cdd3133aac92e9bbb6554fa/courier/python/client.py#L117

However it seems that when creating the node, we cannot pass this option in.
https://github.com/deepmind/launchpad/blob/master/launchpad/nodes/courier/node.py#L104

Could you please either

point us how to achieve using compress in courier client?
add **kwargs to the https://github.com/deepmind/launchpad/blob/master/launchpad/nodes/courier/node.py#L104 ?

Thanks a lot

@ddmbr

Import py_client fails for not provided libpython3.8.so.1.0

Hi,

in my project I want to use dm-acme as rl framework. However, when importing e.g. from acme.jax import experiments I get an error for missing shared object file libpython3.8.so.1.0 required from dm-launchpad.

Error message:

File "/home/.../python3_deps_pypi__dm_launchpad/courier/python/client.py", line 30, in <module>
    from courier.python import py_client
ImportError: libpython3.8.so.1.0: cannot open shared object file: No such file or directory

After an internal discussion we concluded that python 3.8 C extensions are not supposed to ask for libpython3.8.so.1.0 (This is also the conclusion in this thread).

So my question is, whether launchpad is built with a shared library python? If this is the case, is this on purpose or could this be changed to be linked with a static one?