Comments (4)
@HongLouyemeng Hi.
In our experience after setting the environment for running .py scenarios, you should build oneccl with debug type build, like:
cmake .. -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCOMPUTE_BACKEND=dpcpp -DCMAKE_BUILD_TYPE=debug && make -j && make install
You can try to use native gdb: mpiexec -n 2 gdb -ex run --args python test.py
Or try use impi's gdb tool:
mpiexec -n 2 -gdb python test.py
You can try to set breakpoint in oneccl's code and it should help.
If you observe some issue and you want to get backtrace, you can apply cmd in gdb console: thread apply all bt
from oneccl.
@HongLouyemeng Hope this helps, will close this issue.
from oneccl.
The same code in torch.multiprocessing, lldb debugger is not work.
import os
import time
import numpy as np
import torch
import torch.multiprocessing as mp
import torch.distributed as dist
from torch.distributed._tensor import DTensor, DeviceMesh, Shard, Replicate, distribute_tensor,zeros
def run(rank, size):
a = torch.tensor([[0, 2.], [3, 0]])
a.neg()
def init_process(rank_id, size, fn, backend='gloo'):
""" Initialize the distributed environment. """
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '12347'
os.environ[
"TORCH_DISTRIBUTED_DEBUG"
] = "DETAIL"
dist.init_process_group(backend, rank=rank_id, world_size=size)
fn(rank_id, size)
if __name__ == "__main__":
big_tensor = torch.arange(0,16).reshape(4,4)
size = 1
processes = []
mp.set_start_method("spawn")
for rank in range(size):
p = mp.Process(target=init_process, args=(rank, size, run))
p.start()
processes.append(p)
for p in processes:
p.join()
from oneccl.
@HongLouyemeng Hi. In our experience after setting the environment for running .py scenarios, you should build oneccl with debug type build, like:
cmake .. -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCOMPUTE_BACKEND=dpcpp -DCMAKE_BUILD_TYPE=debug && make -j && make install
You can try to use native gdb:
mpiexec -n 2 gdb -ex run --args python test.py
Or try use impi's gdb tool:mpiexec -n 2 -gdb python test.py
You can try to set breakpoint in oneccl's code and it should help. If you observe some issue and you want to get backtrace, you can apply cmd in gdb console:
thread apply all bt
Thanks for your answer OVO
from oneccl.
Related Issues (20)
- torch Distributed Data Parallel with ccl backend failed for torch 2.1.0+cpu and oneccl-bind-pt 2.1.0+cpu while working on torch 2.0.1+cpu and oneccl-bind-pt 2.0.0+cpu
- Will it support arm64?
- Issue about using shared memory provider HOT 3
- Allreduce cpu example fails with CCL_WORKER_COUNT > 1 HOT 3
- Binaries location in 2021.12.0 HOT 1
- oneCCL doesn't compile with -Werror due to -Wsuggest-override in include/oneapi/ccl/exception.hpp HOT 3
- AllgatherV crashes when the buffers overlap HOT 2
- Compile error on the master branch HOT 6
- Issue on page /introduction/sample.html HOT 8
- Warning: extra ';' inside a class HOT 1
- Errors when building with DPCPP backend HOT 1
- NMS (Non-Maximum Suppression) Support
- CMake configuration writes directly to installation directory HOT 4
- Add support for including oneCCL in a CMake mono build HOT 2
- Issue on page /api/operations/collective-operations/allgatherv.html
- Will it support windows? HOT 1
- Will it support windows?
- oneCCL examples for NCCL examples HOT 3
- portability of oneCCL HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from oneccl.