Comments (2)
Hi @XilunWu , this looks like a DTensor + TP issue. Does it look familar to you?
@Xingzhi107 , do you mind share more about
train_gpt2.py
? we want to understand more about the layer norm module (how it is defined), and how it is annotated (ParallelStyle)
Thanks for your replyοΌ
I defined a class
class ShardParallel(ParallelStyle):
def __init__(
self,
*,
input_layouts: Optional[Placement] = None,
output_layouts: Optional[Placement] = None,
use_local_output: bool = True,
param_layouts:Dict[str,Optional[Placement]] = None,
):
super().__init__()
self.input_layouts = input_layouts
self.output_layouts = output_layouts
self.use_local_output = use_local_output
self.param_layouts = param_layouts
@staticmethod
def _prepare_input_fn(input_layouts, inputs, device_mesh):
input_tensor = inputs[0]
if not isinstance(input_tensor, DTensor):
input_tensor = DTensor.from_local(input_tensor, device_mesh, [input_layouts], run_check=False)
return input_tensor
def _partition_fn(self, name, module, device_mesh):
for name, param in module.named_parameters():
dist_param = nn.Parameter(
distribute_tensor(param, device_mesh, [self.param_layouts[name]])
)
module.register_parameter(name, dist_param)
...
def _apply(self, module: nn.Module, device_mesh: DeviceMesh) -> nn.Module:
return distribute_module(
module,
device_mesh,
self._partition_fn,
partial(self._prepare_input_fn, self.input_layouts),
partial(self._prepare_output_fn, self.output_layouts, self.use_local_output),
)
and use parallelize_plan={'ln_1':ShardParallel(input_layouts=[Replicate()],input_layouts=[output_layouts],param_layouts={'weight':[Replicate()]})},if _partition_fn only used on nn.Linear and nn.Embedding
the error is when
outputs = tp_model(batch['input_ids'].cuda(rank),None,batch['attention_mask'].cuda(rank))
...
# εεδΌ ζ
loss.backward()
Traceback (most recent call last):
File "train_gpt2.py", line 255, in <module>
Traceback (most recent call last):
File "train_gpt2.py", line 255, in <module>
loss.backward()
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 266, in backward
loss.backward()Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/_tensor/api.py", line 280, in __torch_dispatch__
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 522, in backward
return DTensor._op_dispatcher.dispatch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/_tensor/dispatch.py", line 106, in dispatch
self.sharding_propagator.propagate(op_info)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/_tensor/sharding_prop.py", line 163, in propagate
torch.autograd.backward(output_sharding = self.propagate_op_sharding(op_info.schema)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/_tensor/sharding_prop.py", line 369, in propagate_op_sharding_non_cached
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 266, in backward
raise NotImplementedError(
NotImplementedError: Operator aten.native_layer_norm_backward.default does not have a sharding strategy registered.
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/_tensor/api.py", line 280, in __torch_dispatch__
return DTensor._op_dispatcher.dispatch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/_tensor/dispatch.py", line 106, in dispatch
self.sharding_propagator.propagate(op_info)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/_tensor/sharding_prop.py", line 163, in propagate
output_sharding = self.propagate_op_sharding(op_info.schema)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/_tensor/sharding_prop.py", line 369, in propagate_op_sharding_non_cached
raise NotImplementedError(
NotImplementedError: Operator aten.native_layer_norm_backward.default does not have a sharding strategy registered.
[2024-06-15 04:27:31,514] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 65897) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
from pytorch.
Hi @XilunWu , this looks like a DTensor + TP issue. Does it look familar to you?
@Xingzhi107 , do you mind share more about train_gpt2.py
? we want to understand more about the layer norm module (how it is defined), and how it is annotated (ParallelStyle)
from pytorch.
Related Issues (20)
- Adding betainc
- take_along_dim or gather unstable results on cpu with stride 1 HOT 2
- Torch Threading causes Seg Fault in pygame. HOT 6
- [inductor][perf] Suboptimal codegen for horizontally fused softmax HOT 4
- [inductor][perf] Inductor/Triton softmax kernel is slower than eager HOT 4
- nn.Linear outputs differ on the same input tensor #129029 answer does not match HOT 2
- interpolate nearest get values zero when outputs over 4G elements
- Questions about CVE-2024-31583 and CVE-2024-31580 HOT 1
- Ignore this
- Torch compile initialises CUDA context, even compiling CPU functions HOT 2
- [RFC][C10D] Avoid creating new nccl communicator for each P2P pair
- Failure with setup-ssh on Amazon Linux 2023 runners HOT 1
- DISABLED test_metadata_parsing_with_layer_split (__main__.TestSerialize) HOT 1
- DISABLED test_nn_sequential_invocation (__main__.MiscTests) HOT 1
- Example usage for `convert_conv3d_weight_memory_format` does not work anymore HOT 2
- TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 doesn't appear to clear file cache HOT 2
- Huge solibs in Linux wheel for torch 2.3.1+rocm6.0
- [custom_op] support dtype as default values
- [custom_op] support str as default values
- extending forward-mode AD docs should really point to an example
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pytorch.