Git Product home page Git Product logo

Comments (6)

wconstab avatar wconstab commented on July 1, 2024 1

my latest thinking is we don't need any new args to Stage() ctor, if we continue to assume no skip connections.

if we want to enable skip connections (later) we could add the args i proposed in the RFC with an adjustment. instead of 'args_rank' it would be 'args_stage', so e.g. it would tell you 'send arg 0 to stage 3, send arg1 to stage 4' but Stage would not know which rank owned those stages yet.

The change i would propose for now is that when you ask the stage for 'get_*send_ops', it has an optional argument for stage mapping. if None, it assumes a linear-modulo-pp_size mapping. If not None, it uses the mapping to determine which pp_rank a stage is on.

think this through and see if it makes sense, its just off the top of my head so could have issues.

from pytorch.

wconstab avatar wconstab commented on July 1, 2024

One annoying thing is that it depends on which schedule you use what 'next stage' will be. It would be ideal if we could later-bind that information when the stages and schedule are used together.

Maybe the stage can have a map of stage-id to rank given to it by the schedule either during schedule init or during each call to get_*_ops?
cc @H-Huang

from pytorch.

H-Huang avatar H-Huang commented on July 1, 2024

Yeah I agree, there will need to be a stage-id to rank mapping for the correct comm ops. There are currently a few assumptions baked into the code that need to be updated:

Assumption 1) Stage id to rank mapping in looped cases is always stage_ids = range(rank, total_num_stages, local_num_stages). We can fix this by adding stage-id to rank mapping.
Assumption 2) You always receive from stage_id - 1 and send to stage_id + 1. We can fix this by the optional arguments mentioned above.

from pytorch.

wconstab avatar wconstab commented on July 1, 2024

@H-Huang one more design consideration is how we should deal with the communication between the two stages at the bottom of the 'V' that are on the same physical rank.

e.g. say stage 3 needs to send outputs to 4 and recv grads from 4.

  1. can we use NCCL for this use case today until we decide to optimize it? (Does nccl support sending/recving from the same rank to itself?)
  2. if we want to avoid doing a comm op, how can we cleanly let stage3 know about stage4 and share a tensor? perhaps the schedule code itself needs to do this by passing the output tensor from 3 as an input to 4? (and skip generating send/recv ops).

from pytorch.

H-Huang avatar H-Huang commented on July 1, 2024

@wconstab I'm not sure about (1) I can test it out, but I think a clean way of doing it is to just check a condition in get_*_send_ops of whether the rank you are sending to is yourself, if so then just automatically update the respective recv_buffers (much like what should be updated from a get_*_recv_op). I think all of the changes can remain in the Stage class (the stage would just somehow need to know the other stages) without any changes to Schedule implementation. The send / recv ops will just return empty lists in this case (thus batch_isend_irecv will be a no-op)

from pytorch.

wconstab avatar wconstab commented on July 1, 2024

the stage would just somehow need to know the other stages)

what's your proposal for how to let stages know about other stages?

  • during Stage init we could not easily pass all other stages, so lets rule this out
  • (a) a new method on Stage to 'register peer stages' could be called by the schedule at init time, for all stages on the same rank
  • (b) passing the recv 'Stage' object to Stage.get_fwd_send_ops(recv_stage) might be another way, during schedule step()

I guess (a) is pretty clean if we can do it in a schedule base class. And we should define the fallback too. If this registration is not performed, what will happen?

  • ranks will fall back to using nccl to send/recv to local same rank?
  • or will this error?

from pytorch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.