Comments (14)
@zrss could you also share the config.json
that you applied for each scaling step?
from kungfu.
What are the flags that you passed to kungfu-run
? This can be found in the very beginning of the log, e.g.
$ kungfu-run -w -np 4 echo 2
[arg] [0]=kungfu-run
[arg] [1]=-w
[arg] [2]=-np
[arg] [3]=4
[arg] [4]=echo
[arg] [5]=2
[kf-env]: KUNGFU_GIT_URL=/Users/lg/code/mirrors/github.com/lsds/KungFu
[nic] [0] lo0 :: 127.0.0.1/8, ::1/128, fe80::1/64
[nic] [1] gif0 ::
[nic] [2] stf0 ::
[nic] [3] en0 :: 192.168.1.85/24
[nic] [4] en3 ::
[nic] [5] en4 ::
[nic] [6] en1 ::
[nic] [7] en2 ::
[nic] [8] bridge0 ::
[nic] [9] p2p0 ::
[nic] [10] awdl0 :: fe80::4c97:11ff:feab:ca51/64
[nic] [11] llw0 :: fe80::4c97:11ff:feab:ca51/64
[nic] [12] utun0 :: fe80::9d17:9272:6aa8:a8e8/64
[nic] [13] utun1 :: fe80::e340:7496:425b:77ee/64
[nic] [14] en5 :: fe80::aede:48ff:fe00:1122/64
[I] watching config server
I suspect the -init-version
flag is not set correctly for the second scaling up.
According to the example
it should be set to -1
if the kungfu-run
is not the first generation.
from kungfu.
thanks for the reply ~
it should be set to
-1
if thekungfu-run
is not the first generation.
kungfu-run params are all the same among my scale up/down cases ...
from kungfu.
@lgarithm , can i make this conclusion
- bootstrap a new kungfu job with default
init-version
(not set it) - always set
init-version
to-1
for newly added kungfu-run
from kungfu.
@lgarithm can we just set the init-version to -1,not only for the newly added kungfu-run,but also for the first generation kungfu-run. It is hard for cluster to distinguish wether it is the first generation.
from kungfu.
@lgarithm , can i make this conclusion
- bootstrap a new kungfu job with default
init-version
(not set it)- always set
init-version
to-1
for newly added kungfu-run
Yes, this is correct.
from kungfu.
@lgarithm can we just set the init-version to -1,not only for the newly added kungfu-run,but also for the first generation kungfu-run. It is hard for cluster to distinguish wether it is the first generation.
We can consider this as future improvement. But currently I can't think of how to do it in a clean way.
from kungfu.
@lgarithm can we just set the init-version to -1,not only for the newly added kungfu-run,but also for the first generation kungfu-run. It is hard for cluster to distinguish wether it is the first generation.
If you can manually initialize the first generation kungfu-run
s, then you can always set init-version
to -1
.
from kungfu.
i.e. start the first generation kungfu-run
with -init-version -1
, then run this
KungFu/srcs/go/kungfu/peer/peer.go
Lines 191 to 205 in 06d742e
in your cluster manager.
from kungfu.
@lgarithm thanks for the reply, i'd like to try, currently, it seems the only way for us to do
to clarify
in our current arch, a host file (the file only records the ip of containers) is generated by cluster manager, and we import a kungfu-mng
process for converting the host file to config.json of kungfu
the kungfu-mng
is running in container, and every container has the same meta info (included the bootstrap command, that's the reason why we want to set the init-version as a fixed value) as we can only modify the number of containers by cluster manager (i.e. the elastic feature of Volcano on K8S)
the cluster manager will update the host file and bootstrap (shutdown) the new container when we scale up/down the kungfu-job
so now i can't think of a way for us to distinguish it is a newly added container in container unless the cluster manager can tag the newly added container with some labels (for example, add a SCALE_OUT
env in the newly added container)
the kungfu-mng
can compare the number of container in the host file with the bootstrap command of kungfu-run -H
- the number of container (and ip) == -H, the first generation
- the number of container (and ip) != -H, not the first generation
then
- the first generation, bootstrap
kungfu-run
byinit-version=0
- not the first generation, bootstrap
kungfu-run
byinit-version=-1
from kungfu.
https://github.com/volcano-sh/volcano
from kungfu.
What if the config.json restored to the origin after two scaling operations?
- the number of container (and ip) == -H, the first generation
- the number of container (and ip) != -H, not the first generation
then
- the first generation, bootstrap
kungfu-run
byinit-version=0
- not the first generation, bootstrap
kungfu-run
byinit-version=-1
from kungfu.
How about add a version field in the config.json
object?
from kungfu.
What if the config.json restored to the origin after two scaling operations?
- the number of container (and ip) == -H, the first generation
- the number of container (and ip) != -H, not the first generation
then
- the first generation, bootstrap
kungfu-run
byinit-version=0
- not the first generation, bootstrap
kungfu-run
byinit-version=-1
we (platform) should limit the number of instances that cannot be smaller than the default value when scaling down, and this can simplify the scene
How about add a version field in the
config.json
object?
good idea, we can post a feature request to cluster manager for adding a version
field in host file. generally saying, version=version + 1
in every scale up/down case
from kungfu.
Related Issues (20)
- bert demo question HOT 4
- panic error HOT 3
- When using the config-server, if you call allgather api, it will block.
- After remove the worker from the cluster, it is better to set the rank id to -1. HOT 3
- Elastic hook can't support training from checkpoint.
- Support real global batch normalisation HOT 2
- Inconsistency detected by ld.so
- failed to establish connection to the newly runner HOT 5
- the kungfu-job is hang when it scale down HOT 2
- Performance drops when TensorFlow experimental XLA JIT is enabled. HOT 6
- [doc] request parameters doc when the -init-version=-1 HOT 5
- Support for share-memory channels? HOT 1
- use a dedicated thread for NCCL operations
- Access to Adaptive Batch Size Policy
- Is Windows supported?
- Error from pytoch demo HOT 1
- code loss HOT 2
- A question about Horovod central coordinator in the paper of KungFu HOT 2
- With PairAveragingOptimizer, is it possible that two workers in different iterations average their models? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kungfu.