Comments (2)
Fully agree that this should be simplified.
Questions to discuss:
- Shared memory:
-
Do we need to specify an amount? We've had this volume configured in our default pod template and never had any issues:
volumeMounts: - mountPath: /dev/shm name: dshm volumes: - name: dshm emptyDir: medium: Memory
-
Do we try to merge this into the pod template a user might have provided to the task or should the shared memory volume only be added if the user doesnโt provide a pod template?
-
- Timeouts:
- For the join timeout I feel we should consider the scenario that some workers have a hot start (node is up and image is cached) while other workers have a cold start, i.e. node needs to be scaled up and image has to be pulled. I feel 15 minutes, as you specified, is a good value here. Are there other opinions?
- Clarify whether the
timeout
in the rdzv config is the same timeout as intorch.distributed.init_process_group
and decide on a reasonable default value.
from flyte.
Just to circle back to this: we opted to:
- Initialize the
Elastic
task config with a default pod template:
PodTemplate(
primary_container_name="pytorch",
pod_spec=V1PodSpec(
containers=[
V1Container(
name="pytorch",
volume_mounts=[V1VolumeMount(mount_path="/dev/shm", name="dshm")]
)
],
volumes=[V1Volume(name="dshm", empty_dir=V1EmptyDirVolumeSource(medium="Memory"))]
),
)
This would not be exposed to the end user, but they could still override this by specifying pod_template
in the @task
decorator.
- Set the default
rdvz_configs
join_timeout
to900
(15 minutes). Digging into the pytorch docs/code, it looks liketimeout
andjoin_timeout
are the same, I thinktimeout
is a legacy argument for the `EtcdRendezvousHandler:
from flyte.
Related Issues (20)
- [Housekeeping] Add support for protobuf version 5 in flytekit HOT 1
- [BUG] When triggering a remote LP, Flytekit fails with TypeError: 'NoneType' object is not subscriptable HOT 5
- [BUG] python task retries=n causes "Requests overridden" error log message for interruptible task HOT 1
- [BUG] Handler for .well-known/openid-configuration constructs redirect path incorrectly HOT 2
- [Core feature] LiteralBlob and StructuredDataset metadata HOT 1
- More flexible configuration of SecurityContext for Pods/Containers started by flyte HOT 9
- [Core feature] `@dynamic` should accept all (?) of `@workflow` attributes HOT 2
- [BUG] Tasks from subworkflow calling reference launch plan read cache from different projects HOT 2
- [BUG] New versions of viper breaks config loading HOT 3
- [BUG] nested dynamic won't bind pydantic models or dictionaries as inputs HOT 2
- [Core feature] Build multiple ImageSpec in parallel HOT 2
- [Housekeeping] Distributed Tracing Should Support OTLP Exporters HOT 1
- [Integration] NIM
- [BUG] ArrayNodes downloads all inputs for every subtasks HOT 2
- Flytekit checkpoint improvement- pytorch HOT 1
- [BUG] Union types fail for e.g. two different dataclasses HOT 4
- [BUG] Missing inputs when using datetime.date type hint. HOT 1
- [Core feature] UX improvement: `pyflyte run` includes imported local modules in the target workflow file HOT 2
- [Housekeeping] Remove the need of kwtypes in user code HOT 1
- [Core feature] pyflyte run --remote should support a url HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from flyte.