有办法让 BatchNorm2d 之类的层保持 float32 进行训练吗？用 half 可能导致 loss 不好收敛

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

希望能够保持特定层的 weight 仍为 float32 about patrickstar HOT 5 OPEN

liaojianjin commented on June 12, 2024 1

希望能够保持特定层的 weight 仍为 float32

from patrickstar.

Comments (5)

zhuzilin commented on June 12, 2024

可以在一部分结构上用 torch_scope 这个接口包一下，在 torch scope 里面的部分会用使用 fp32 进行训练，例如 moe 的例子里：

PatrickStar/examples/moe/moe_bert.py

Lines 53 to 64 in 0731c6e

 # The MoE modules are mainly of model parallel, we need to use `torch_scope` 

 # to separate it from the other chunk based data parallel modules. 

 # Also, MoE modules will take cart of its own communication, that's why 

 # we need to disable allreduce in the torch scope. 

 with torch_scope(do_allreduce=False): 

 self.output = fmoe.FMoETransformerMLP( 

 num_expert=2, 

 world_size=get_world_size(), 

 d_model=config.hidden_size, 

 d_hidden=config.intermediate_size, 

 gate=fmoe.gates.NaiveGate, 

 )

不过注意，如果只是要把一层设置为 fp32 的话，这里的 do_allreduce 应该设置为 True

from patrickstar.

Jack47 commented on June 12, 2024

可以在一部分结构上用 torch_scope 这个接口包一下，在 torch scope 里面的部分会用使用 fp32 进行训练，例如 moe 的例子里：

PatrickStar/examples/moe/moe_bert.py

Lines 53 to 64 in 0731c6e

# The MoE modules are mainly of model parallel, we need to use `torch_scope`

# to separate it from the other chunk based data parallel modules.

# Also, MoE modules will take cart of its own communication, that's why

# we need to disable allreduce in the torch scope.

with torch_scope(do_allreduce=False):

self.output = fmoe.FMoETransformerMLP(

num_expert=2,

world_size=get_world_size(),

d_model=config.hidden_size,

d_hidden=config.intermediate_size,

gate=fmoe.gates.NaiveGate,

)

不过注意，如果只是要把一层设置为 fp32 的话，这里的 do_allreduce 应该设置为 True

妙啊，意思是这块是torch在管理的，不需要ps参与？

from patrickstar.

liaojianjin commented on June 12, 2024

可以在一部分结构上用 torch_scope 这个接口包一下，在 torch scope 里面的部分会用使用 fp32 进行训练，例如 moe 的例子里：

PatrickStar/examples/moe/moe_bert.py

Lines 53 to 64 in 0731c6e

# The MoE modules are mainly of model parallel, we need to use `torch_scope`

# to separate it from the other chunk based data parallel modules.

# Also, MoE modules will take cart of its own communication, that's why

# we need to disable allreduce in the torch scope.

with torch_scope(do_allreduce=False):

self.output = fmoe.FMoETransformerMLP(

num_expert=2,

world_size=get_world_size(),

d_model=config.hidden_size,

d_hidden=config.intermediate_size,

gate=fmoe.gates.NaiveGate,

)

不过注意，如果只是要把一层设置为 fp32 的话，这里的 do_allreduce 应该设置为 True

妙啊，意思是这块是torch在管理的，不需要ps参与？

应该是的，torch_scope 把 config 做了个临时修改

PatrickStar/patrickstar/core/preprocess.py

Lines 80 to 86 in d2a5e1d

 def torch_scope(do_allreduce=True): 

 r"""All parameters initialized in this scope will not be managed in chunks.""" 

 _runtime_config.push() 

 _runtime_config.config["use_chunk"] = False 

 _runtime_config.config["do_allreduce"] = do_allreduce 

 yield 

 _runtime_config.pop()

在 Module init后将参数注册为torch管理，并且保持输入输出为float

PatrickStar/patrickstar/core/preprocess.py

Lines 366 to 375 in d2a5e1d

 if not _runtime_config.use_chunk: 

 for name, param in module.named_parameters(recurse=False): 

 name = f"{module.__class__.__name__}.{name}_{self.param_idx}" 

 register_param(param, ParamType.TORCH_BASED, torch.float, name) 

 if _runtime_config.do_allreduce: 

 self.client.torch_param_allreduce_list.append(param) 

 # We need to cast the inputs to fp32 for the unmanaged modules. 

 cast_forward(module, torch.float) 

 return

from patrickstar.

zhuzilin commented on June 12, 2024

@Jack47 @liaojianjin
最近我们在对派大星进行全面的重构...所以这些特性可能之后都会有些变化.. 例如我们可能之后会直接复用 pytorch autocast，而不是实现自己版本的混合精度训练了，这样的话本 issue 中提到的 layernorm 设置成 fp32 的问题可能就迎刃而解了，也不需要在迁移后重新对齐精度了。所以现在的暴露的接口可能比较简陋，非常抱歉...

from patrickstar.

Jack47 commented on June 12, 2024

@Jack47 @liaojianjin 最近我们在对派大星进行全面的重构...所以这些特性可能之后都会有些变化.. 例如我们可能之后会直接复用 pytorch autocast，而不是实现自己版本的混合精度训练了，这样的话本 issue 中提到的 layernorm 设置成 fp32 的问题可能就迎刃而解了，也不需要在迁移后重新对齐精度了。所以现在的暴露的接口可能比较简陋，非常抱歉...

好的好的，👍

from patrickstar.

希望能够保持特定层的 weight 仍为 float32 about patrickstar HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	# The MoE modules are mainly of model parallel, we need to use `torch_scope`
	# to separate it from the other chunk based data parallel modules.
	# Also, MoE modules will take cart of its own communication, that's why
	# we need to disable allreduce in the torch scope.
	with torch_scope(do_allreduce=False):
	self.output = fmoe.FMoETransformerMLP(
	num_expert=2,
	world_size=get_world_size(),
	d_model=config.hidden_size,
	d_hidden=config.intermediate_size,
	gate=fmoe.gates.NaiveGate,
	)

	def torch_scope(do_allreduce=True):
	r"""All parameters initialized in this scope will not be managed in chunks."""
	_runtime_config.push()
	_runtime_config.config["use_chunk"] = False
	_runtime_config.config["do_allreduce"] = do_allreduce
	yield
	_runtime_config.pop()

	if not _runtime_config.use_chunk:
	for name, param in module.named_parameters(recurse=False):
	name = f"{module.__class__.__name__}.{name}_{self.param_idx}"
	register_param(param, ParamType.TORCH_BASED, torch.float, name)
	if _runtime_config.do_allreduce:
	self.client.torch_param_allreduce_list.append(param)

	# We need to cast the inputs to fp32 for the unmanaged modules.
	cast_forward(module, torch.float)
	return