scale up from 1 instance to 2 instances A: v0 -> v

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

thanks for the reply ~ it should be set to <code class=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

i.e. start the first generation kungfu-run with <code

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

kungfu job is hang in a inconsistent version when i scale down/up mutiple times about kungfu HOT 14 CLOSED

lsds commented on June 22, 2024

kungfu job is hang in a inconsistent version when i scale down/up mutiple times

from kungfu.

Comments (14)

lgarithm commented on June 22, 2024

@zrss could you also share the config.json that you applied for each scaling step?

from kungfu.

lgarithm commented on June 22, 2024

What are the flags that you passed to kungfu-run? This can be found in the very beginning of the log, e.g.

$ kungfu-run -w -np 4 echo 2
[arg] [0]=kungfu-run
[arg] [1]=-w
[arg] [2]=-np
[arg] [3]=4
[arg] [4]=echo
[arg] [5]=2
[kf-env]: KUNGFU_GIT_URL=/Users/lg/code/mirrors/github.com/lsds/KungFu
[nic] [0] lo0 :: 127.0.0.1/8, ::1/128, fe80::1/64
[nic] [1] gif0 :: 
[nic] [2] stf0 :: 
[nic] [3] en0 :: 192.168.1.85/24
[nic] [4] en3 :: 
[nic] [5] en4 :: 
[nic] [6] en1 :: 
[nic] [7] en2 :: 
[nic] [8] bridge0 :: 
[nic] [9] p2p0 :: 
[nic] [10] awdl0 :: fe80::4c97:11ff:feab:ca51/64
[nic] [11] llw0 :: fe80::4c97:11ff:feab:ca51/64
[nic] [12] utun0 :: fe80::9d17:9272:6aa8:a8e8/64
[nic] [13] utun1 :: fe80::e340:7496:425b:77ee/64
[nic] [14] en5 :: fe80::aede:48ff:fe00:1122/64
[I] watching config server

I suspect the -init-version flag is not set correctly for the second scaling up.
According to the example

https://github.com/lsds/KungFu/blob/master/tests/go/cmd/kungfu-cluster-manager-example/kungfu-cluster-manager-example.go#L89

it should be set to -1 if the kungfu-run is not the first generation.

from kungfu.

zrss commented on June 22, 2024

thanks for the reply ~

it should be set to -1 if the kungfu-run is not the first generation.

kungfu-run params are all the same among my scale up/down cases ...

from kungfu.

zrss commented on June 22, 2024

@lgarithm , can i make this conclusion

bootstrap a new kungfu job with default init-version (not set it)
always set init-version to -1 for newly added kungfu-run

from kungfu.

rankeey commented on June 22, 2024

@lgarithm can we just set the init-version to -1，not only for the newly added kungfu-run，but also for the first generation kungfu-run. It is hard for cluster to distinguish wether it is the first generation.

from kungfu.

lgarithm commented on June 22, 2024

@lgarithm , can i make this conclusion

bootstrap a new kungfu job with default init-version (not set it)

always set init-version to -1 for newly added kungfu-run

Yes, this is correct.

from kungfu.

lgarithm commented on June 22, 2024

@lgarithm can we just set the init-version to -1，not only for the newly added kungfu-run，but also for the first generation kungfu-run. It is hard for cluster to distinguish wether it is the first generation.

We can consider this as future improvement. But currently I can't think of how to do it in a clean way.

from kungfu.

lgarithm commented on June 22, 2024

@lgarithm can we just set the init-version to -1，not only for the newly added kungfu-run，but also for the first generation kungfu-run. It is hard for cluster to distinguish wether it is the first generation.

If you can manually initialize the first generation kungfu-runs, then you can always set init-version to -1.

from kungfu.

lgarithm commented on June 22, 2024

i.e. start the first generation kungfu-run with -init-version -1, then run this

KungFu/srcs/go/kungfu/peer/peer.go

Lines 191 to 205 in 06d742e

 var notify execution.PeerFunc = func(ctrl plan.PeerID) error { 

 ctx, cancel := context.WithTimeout(context.TODO(), config.WaitRunnerTimeout) 

 defer cancel() 

 n, err := p.router.Wait(ctx, ctrl) 

 if err != nil { 

 return err 

 } 

 if n > 0 { 

 log.Warnf("%s is up after pinged %d times", ctrl, n+1) 

 } 

 return p.router.Send(ctrl.WithName("update"), stage.Encode(), connection.ConnControl, 0) 

 } 

 if err := notify.Par(cluster.Runners); err != nil { 

 utils.ExitErr(err) 

 }

in your cluster manager.

from kungfu.

zrss commented on June 22, 2024

@lgarithm thanks for the reply, i'd like to try, ~~currently, it seems the only way for us to do~~

to clarify

in our current arch, a host file (the file only records the ip of containers) is generated by cluster manager, and we import a kungfu-mng process for converting the host file to config.json of kungfu

the kungfu-mng is running in container, and every container has the same meta info (included the bootstrap command, that's the reason why we want to set the init-version as a fixed value) as we can only modify the number of containers by cluster manager (i.e. the elastic feature of Volcano on K8S)

the cluster manager will update the host file and bootstrap (shutdown) the new container when we scale up/down the kungfu-job

so now i can't think of a way for us to distinguish it is a newly added container in container unless the cluster manager can tag the newly added container with some labels (for example, add a SCALE_OUT env in the newly added container)

the kungfu-mng can compare the number of container in the host file with the bootstrap command of kungfu-run -H

the number of container (and ip) == -H, the first generation
the number of container (and ip) != -H, not the first generation

then

the first generation, bootstrap kungfu-run by init-version=0
not the first generation, bootstrap kungfu-run by init-version=-1

from kungfu.

zrss commented on June 22, 2024

https://github.com/volcano-sh/volcano

from kungfu.

lgarithm commented on June 22, 2024

What if the config.json restored to the origin after two scaling operations?

the number of container (and ip) == -H, the first generation

the number of container (and ip) != -H, not the first generation

then

the first generation, bootstrap kungfu-run by init-version=0

not the first generation, bootstrap kungfu-run by init-version=-1

from kungfu.

lgarithm commented on June 22, 2024

How about add a version field in the config.json object?

from kungfu.

zrss commented on June 22, 2024

What if the config.json restored to the origin after two scaling operations?

the number of container (and ip) == -H, the first generation

the number of container (and ip) != -H, not the first generation

then

the first generation, bootstrap kungfu-run by init-version=0

not the first generation, bootstrap kungfu-run by init-version=-1

we (platform) should limit the number of instances that cannot be smaller than the default value when scaling down, and this can simplify the scene

How about add a version field in the config.json object?

good idea, we can post a feature request to cluster manager for adding a version field in host file. generally saying, version=version + 1 in every scale up/down case

from kungfu.

kungfu job is hang in a inconsistent version when i scale down/up mutiple times about kungfu HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	var notify execution.PeerFunc = func(ctrl plan.PeerID) error {
	ctx, cancel := context.WithTimeout(context.TODO(), config.WaitRunnerTimeout)
	defer cancel()
	n, err := p.router.Wait(ctx, ctrl)
	if err != nil {
	return err
	}
	if n > 0 {
	log.Warnf("%s is up after pinged %d times", ctrl, n+1)
	}
	return p.router.Send(ctrl.WithName("update"), stage.Encode(), connection.ConnControl, 0)
	}
	if err := notify.Par(cluster.Runners); err != nil {
	utils.ExitErr(err)
	}