Git Product home page Git Product logo

Comments (2)

fg91 avatar fg91 commented on June 26, 2024

Fully agree that this should be simplified.

Questions to discuss:

  • Shared memory:
    • Do we need to specify an amount? We've had this volume configured in our default pod template and never had any issues:

       volumeMounts:
          - mountPath: /dev/shm
            name: dshm
        volumes:
          - name: dshm
            emptyDir:
              medium: Memory
    • Do we try to merge this into the pod template a user might have provided to the task or should the shared memory volume only be added if the user doesnโ€™t provide a pod template?

  • Timeouts:
    • For the join timeout I feel we should consider the scenario that some workers have a hot start (node is up and image is cached) while other workers have a cold start, i.e. node needs to be scaled up and image has to be pulled. I feel 15 minutes, as you specified, is a good value here. Are there other opinions?
    • Clarify whether the timeout in the rdzv config is the same timeout as in torch.distributed.init_process_group and decide on a reasonable default value.

from flyte.

cosmicBboy avatar cosmicBboy commented on June 26, 2024

Just to circle back to this: we opted to:

  1. Initialize the Elastic task config with a default pod template:
PodTemplate(
    primary_container_name="pytorch",
    pod_spec=V1PodSpec(
        containers=[
            V1Container(
                name="pytorch",
                volume_mounts=[V1VolumeMount(mount_path="/dev/shm", name="dshm")]
            )
        ],
        volumes=[V1Volume(name="dshm", empty_dir=V1EmptyDirVolumeSource(medium="Memory"))]
    ),
)

This would not be exposed to the end user, but they could still override this by specifying pod_template in the @task decorator.

  1. Set the default rdvz_configs join_timeout to 900 (15 minutes). Digging into the pytorch docs/code, it looks like timeout and join_timeout are the same, I think timeout is a legacy argument for the `EtcdRendezvousHandler:

from flyte.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.