hsword / hetu Goto Github PK

A high-performance distributed deep learning system targeting large-scale and automated distributed training. If you have any interests, please visit/star/fork https://github.com/PKU-DAIR/Hetu

License: Apache License 2.0

CMake 0.46% Shell 0.22% C++ 12.26% Python 71.04% HTML 0.06% C 1.72% Cuda 14.14% Makefile 0.11%

deep-learning distributed-systems

hetu's People

Contributors

Stargazers

Watchers

Forkers

pku-dair hugozhl initzhang xiaming9880 nox-410 codecaution gooliang hydrazeng hankpipi sj1104 trellixvulnteam pinxuezhao afdwang ftgreat yinglailin shism2

hetu's Issues

How to run ctr examples with GPU

Hi Hetu's authors,

Is there any example scripts to run ctr model with GPU? Current examples use heturun for multi-worker, but there is no GPU related arguments in this tool. I have comfirmed on my testbed that current ctr scripts run on CPU instead of GPU.

Looking for your responses. Thx.

run PS的demo时触发OSError

编译通过，单卡和allreduce都能正常run，
但是run PS模式的demo报了一个so的错误，
bash examples/cnn/scripts/hetu_2gpu_ps.sh mlp CIFAR10

报错如下,
Cluster: {
Chief: localhost,
Servers(1): {'localhost': 1},
Workers(2): {'localhost': 2},
}
Process Process-1:
Traceback (most recent call last):
File "/home/anaconda3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/anaconda3/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/Hetu/bin/../python/runner.py", line 27, in start_sched
ht.scheduler_init()
File "/home/Hetu/python/hetu/gpu_ops/executor.py", line 106, in scheduler_init
ps_comm = ll(path_to_lib("libps.so"))
File "/home/anaconda3/lib/python3.8/ctypes/init.py", line 459, in LoadLibrary
return self._dlltype(name)
File "/home/anaconda3/lib/python3.8/ctypes/init.py", line 381, in init
self._handle = _dlopen(self._name, mode)
OSError: /home/Hetu/python/hetu/gpu_ops/../../../build/lib/libps.so: undefined symbol: _ZN6google8protobuf8internal26fixed_address_empty_stringB5cxx11E
Process Process-2:
Traceback (most recent call last):
File "/home/anaconda3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/anaconda3/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/Hetu/bin/../python/runner.py", line 33, in start_server
ht.server_init()
File "/home/Hetu/python/hetu/gpu_ops/executor.py", line 94, in server_init
ps_comm = ll(path_to_lib("libps.so"))
File "/home/anaconda3/lib/python3.8/ctypes/init.py", line 459, in LoadLibrary
return self._dlltype(name)
File "/home/anaconda3/lib/python3.8/ctypes/init.py", line 381, in init
self._handle = _dlopen(self._name, mode)
OSError: /home/Hetu/python/hetu/gpu_ops/../../../build/lib/libps.so: undefined symbol: _ZN6google8protobuf8internal26fixed_address_empty_stringB5cxx11E
[1,1]:Building MLP model...
[1,1]:Traceback (most recent call last):
[1,1]: File "/home/Hetu/examples/cnn/scripts/../main.py", line 122, in
[1,1]: executor = ht.Executor(eval_nodes, dist_strategy=strategy)
[1,1]: File "/home/Hetu/python/hetu/gpu_ops/executor.py", line 375, in init
[1,1]: config = HetuConfig(eval_node_list=all_eval_nodes,
[1,1]: File "/home/Hetu/python/hetu/gpu_ops/executor.py", line 247, in init
[1,1]: worker_init()
[1,1]: File "/home/Hetu/python/hetu/gpu_ops/executor.py", line 83, in worker_init
[1,1]: ps_comm = ll(path_to_lib("libps.so"))
[1,1]: File "/home/anaconda3/lib/python3.8/ctypes/init.py", line 459, in LoadLibrary
[1,1]: return self._dlltype(name)
[1,1]: File "/home/anaconda3/lib/python3.8/ctypes/init.py", line 381, in init
[1,1]: self._handle = _dlopen(self._name, mode)
[1,1]:OSError: /home/Hetu/python/hetu/gpu_ops/../../../build/lib/libps.so: undefined symbol: _ZN6google8protobuf8internal26fixed_address_empty_stringB5cxx11E
[1,1]:Traceback (most recent call last):
[1,1]: File "/home/Hetu/python/hetu/gpu_ops/executor.py", line 507, in del
[1,1]: if self.config.comp_stream is not None:
[1,1]:AttributeError: 'Executor' object has no attribute 'config'
[1,0]:Exception ignored in: <function Executor.del at 0x7f8772777700>
[1,0]:Traceback (most recent call last):
[1,0]: File "/home/Hetu/python/hetu/gpu_ops/executor.py", line 507, in del
[1,0]: if self.config.comp_stream is not None:
[1,0]:AttributeError: 'Executor' object has no attribute 'config'

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[56167,1],1]
Exit code: 1

KeyError(f"None of [{key}] are in the [{axis_name}]"

I am getting KeyError(f"None of [{key}] are in the [{axis_name}]" while running python models/load_data.py.
The error occurs in this line.

I have set appropriate path for criteo_dataset in load_data.py. The download, extraction and creating local files part is successful. I downloaded dataset from this website: https://www.kaggle.com/competitions/criteo-display-ad-challenge/data.

Get stuck when launching the server in hybrid mode

Hello,
I was trying to run hetu in hybrid mode with multiple nodes.
My command to launch the server and scheduler is
python -m hetu.launcher ${workdir}/../settings/local_s1.yml -n 1 --sched

and the content of yml file is as follows.

However, both the scheduler and server processes are stuck in the following loop (Line 357 in the file ps-lite/src/van.cc).

Without shutting down the previous program, I started the worker and it could not find the server.

My workers and server are located in the same machine.
I wonder is there anything wrong with my configurations or did I miss any preceding steps?

Thank you very much.

The Question about reverse_layout_transform_kernel

Hi authors,

When I simulate a condition that num_local_gpus=2, num_nodes=2, samples=8, hidden=1 like the Figure 6(https://arxiv.org/pdf/2203.14685.pdf), I find that ha2a_reverse_layout_transform_kernel may not achieve the ideal result.

The output of ha2a_reverse_layout_transform_kernel maybe 00 01 20 21 10 11 30 31 instead of 00 10 20 30 01 11 21 31 in worker:0.

Is this line code should changed

Hetu/src/ops/H_A2A_LayoutTransform.cu

Line 44 in 15209eb

 output_data[(gpu_id*data_size_per_gpu+target_node_id*data_size_per_gpu_per_node+target_gpu_id*data_size_per_gpu_per_gpu+offset) * (hidden) + j]=input_data[i * (hidden) + j]; 

into

output_data[(target_gpu_id*data_size_per_gpu+target_node_id*data_size_per_gpu_per_node+gpu_id*data_size_per_gpu_per_gpu+offset) * (hidden) + j]=input_data[i * (hidden) + j];

Sorry for my poor English.

Error when running RDMA

Hi Hetu Authors,

There are some errors when running Hetu with RDMA. The error logs are:

[libprotobuf ERROR google/protobuf/message_lite.cc:133] Can't parse message of type "ps.PBMeta" because it is missing required fields: (cannot determine missing fields for lite message)
[10-0-10-200:13:06:00] /hetu/ps-lite/include/common/logging.h:317: [13:06:00] /hetu/ps-lite/src/van.cc:544: Check failed: pb.ParseFromArray(meta_buf, buf_size) failed to parse string into protobuf

Stack trace returned 8 entries:
[bt] (0) /hetu/python/hetu/gpu_ops/../../../build/lib/libps.so(dmlc::StackTrace[abi:cxx11]()+0x17f) [0x7f8fe4c8f3bf]
[bt] (1) /hetu/python/hetu/gpu_ops/../../../build/lib/libps.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x3b) [0x7f8fe4c90b0b]
[bt] (2) /hetu/python/hetu/gpu_ops/../../../build/lib/libps.so(ps::Van::UnpackMeta(char const*, int, ps::Meta*)+0x3ae) [0x7f8fe4ccbade]
[bt] (3) /hetu/python/hetu/gpu_ops/../../../build/lib/libps.so(ps::IBVerbsVan::RecvMsg(ps::Message*)+0x182) [0x7f8fe4cdd012]
[bt] (4) /hetu/python/hetu/gpu_ops/../../../build/lib/libps.so(ps::Van::Receiving()+0x216) [0x7f8fe4ccaa36]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd44c0) [0x7f9007a054c0]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f9011b846db]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f9011ebd61f]

To compile ps-lite with RDMA ibverbs, I added the Macro in ps-lite/CMakeLists.txt

add_compile_definitions(DMLC_USE_IBVERBS)
target_link_libraries(ps PRIVATE -lrdmacm -libverbs)

And there is one bug in the compilation: ibverbs_van.h: 657. There is no data member named msg.meta.datasize

Therefore I implemented one function for the Message.h: struct Message

struct Message {
    /** \brief the meta info of this message */
    Meta meta;
    /** \brief the large chunk of data of this message */
    std::vector<SArray<char>> data;

    std::string DebugString() const {
        std::stringstream ss;
        ss << meta.DebugString();
        if (data.size()) {
            ss << " Body:";
            for (const auto &d : data)
                ss << " data_size=" << d.size();
        }
        return ss.str();
    }

    inline size_t data_size() const {
        size_t data_len = 0;
        for (auto &iter : data)
            data_len += iter.size();
        return data_len;
    }
};

and correct the line with size_t data_len = msg.data_size();

My Hetu is based on commit: (https://github.com/Hsword/Hetu/tree/120b776d653708adfccbadc8e1b35d633eaf1161).

The testing model is wdl_criteo and ps_num = 1, worker_num=8 with the Hybrid communication pattern.

Hope you can help us :)

我想问下Hetu能部署在国产显卡上吗，比如DCU和Ascend

ModuleNotFoundError: No module named 'hetu_cache'

Traceback (most recent call last):
File "/home/Hetu/examples/ctr/tests/../run_hetu.py", line 200, in
worker(args)
File "/home/Hetu/examples/ctr/tests/../run_hetu.py", line 123, in worker
executor = ht.Executor(eval_nodes, dist_strategy=strategy, cstable_policy=args.cache,
File "/home/Hetu/python/hetu/gpu_ops/executor.py", line 395, in init
config = HetuConfig(eval_node_list=all_eval_nodes,
File "/home/Hetu/python/hetu/gpu_ops/executor.py", line 367, in init
topo_sort_with_hook(self.my_eval_nodes, self)
File "/home/Hetu/python/hetu/gpu_ops/executor.py", line 1319, in topo_sort_with_hook
topo_sort_dfs_with_hook(node, visited, config)
File "/home/Hetu/python/hetu/gpu_ops/executor.py", line 1332, in topo_sort_dfs_with_hook
topo_sort_dfs_with_hook(n, visited, config)
File "/home/Hetu/python/hetu/gpu_ops/executor.py", line 1333, in topo_sort_dfs_with_hook
node.forward_hook(config)
File "/home/Hetu/python/hetu/gpu_ops/ParameterServerCommunicate.py", line 162, in forward_hook
self.cache = CacheSparseTable(
File "/home/Hetu/python/hetu/cstable.py", line 27, in init
import hetu_cache
ModuleNotFoundError: No module named 'hetu_cache'

PS Worker Sparse Push Block

Wait function should be removed. See PR #41 ps-lite/src/worker.cc

fatal error: cub/cub.cuh: No such file or directory

I am trying to setup Hetu with CUDA 10.2. I have followed all the steps provided in the README.MD, but I am facing an issue while performing the make -j 8 step. I have installed CUB from https://github.com/NVIDIA/cub/releases/tag/1.12.1 and used appropriate path in cmake.config file. However, I always get an error saying fatal error: cub/cub.cuh: No such file or directory.

Is it possible to attach a sample cmake.config file for reference?

I can't find source code about Automatic Distributed Training, i.e. Cost estimation

In your latest doc doc, you mentioned that you have achieved many distributed training features. However I can not find them, especially for cost estimation. I need help.

Question about using galvatron AssertionError: 50257 is not divisible by 4

I am very sorry for possibly cloning the wrong branch previously. However, when I tried to run the train_dist.sh script according to the README file using this branch, I encountered some difficulties as the system indicated that a module named gpt was missing.
Subsequently, I tried switching to another repository, but I encountered the following error:

Traceback (most recent call last):
  File "train_dist.py", line 87, in <module>
    train(args)
  File "train_dist.py", line 33, in train
    model = construct_hybrid_parallel_model(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/galvatron/models/gpt_hf/GPTModel_hybrid_parallel.py", line 12, in construct_hybrid_parallel_model
    hp_model = construct_hybrid_parallel_model_api(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/galvatron/core/hybrid_parallel_model.py", line 114, in construct_hybrid_parallel_model_api
    model = construct_tensor_parallel_model(model, config, tp_groups_whole) # get_enc_groups(tp_groups_whole, module_types))
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/galvatron/models/gpt_hf/GPTModel_tensor_parallel.py", line 75, in construct_tensor_parallel_model
    setattr(model.transformer, 'wte', VocabParallelEmbedding(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/galvatron/site_package/megatron/core/tensor_parallel/layers.py", line 194, in __init__
    ) = VocabUtility.vocab_range_from_global_vocab_size(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/galvatron/site_package/megatron/core/tensor_parallel/utils.py", line 110, in vocab_range_from_global_vocab_size
    per_partition_vocab_size = divide(global_vocab_size, world_size)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/galvatron/site_package/megatron/core/utils.py", line 22, in divide
    ensure_divisibility(numerator, denominator)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/galvatron/site_package/megatron/core/utils.py", line 16, in ensure_divisibility
    assert numerator % denominator == 0, "{} is not divisible by {}".format(numerator, denominator)
AssertionError: 50257 is not divisible by 4

I tried to modify the vocab_size, but it still doesn't work.
Then I tried to use stable version in this repository, I encountered following error:

Traceback (most recent call last):
  File "train_dist.py", line 99, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
  File "train_dist.py", line 99, in <module>
  File "train_dist.py", line 99, in <module>
    train(args)
  File "train_dist.py", line 42, in train
    gpt_model = GPTLMHeadModel(config, device='meta' if args.initialize_on_meta else 'cpu')
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 582, in __init__
    train(args)    
train(args)Traceback (most recent call last):
  File "train_dist.py", line 42, in train

  File "train_dist.py", line 42, in train
  File "train_dist.py", line 99, in <module>
        gpt_model = GPTLMHeadModel(config, device='meta' if args.initialize_on_meta else 'cpu')gpt_model = GPTLMHeadModel(config, device='meta' if args.initialize_on_meta else 'cpu')

  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 582, in __init__
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 582, in __init__
Traceback (most recent call last):
  File "train_dist.py", line 99, in <module>
    self.transformer = GPTModel(config, process_group=process_group, **factory_kwargs)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 466, in __init__
Traceback (most recent call last):
  File "train_dist.py", line 99, in <module>
        [    self.transformer = GPTModel(config, process_group=process_group, **factory_kwargs)
self.transformer = GPTModel(config, process_group=process_group, **factory_kwargs)    
Traceback (most recent call last):
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 467, in <listcomp>

  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 466, in __init__
train(args)  File "train_dist.py", line 99, in <module>

  File "train_dist.py", line 42, in train
    train(args)
  File "train_dist.py", line 42, in train
    create_block(config, layer_idx=i, process_group=process_group, **factory_kwargs)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 279, in create_block
        [gpt_model = GPTLMHeadModel(config, device='meta' if args.initialize_on_meta else 'cpu')

  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 467, in <listcomp>
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 582, in __init__
    gpt_model = GPTLMHeadModel(config, device='meta' if args.initialize_on_meta else 'cpu')        
block = Block(train(args)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/block.py", line 68, in __init__

  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 582, in __init__
  File "train_dist.py", line 42, in train
    train(args)
    self.mixer = mixer_cls(dim)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/mha.py", line 456, in __init__
  File "train_dist.py", line 42, in train
    Traceback (most recent call last):
create_block(config, layer_idx=i, process_group=process_group, **factory_kwargs)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 279, in create_block
  File "train_dist.py", line 99, in <module>
    gpt_model = GPTLMHeadModel(config, device='meta' if args.initialize_on_meta else 'cpu')
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 582, in __init__
    raise ImportError("fused_dense is not installed")    
block = Block(
      File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/block.py", line 68, in __init__
ImportErrorgpt_model = GPTLMHeadModel(config, device='meta' if args.initialize_on_meta else 'cpu'): fused_dense is not installed

  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 582, in __init__
    self.mixer = mixer_cls(dim)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/mha.py", line 456, in __init__
    raise ImportError("fused_dense is not installed")
ImportError: fused_dense is not installed
    self.transformer = GPTModel(config, process_group=process_group, **factory_kwargs)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 466, in __init__
    train(args)
  File "train_dist.py", line 42, in train
    self.transformer = GPTModel(config, process_group=process_group, **factory_kwargs)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 466, in __init__
        self.transformer = GPTModel(config, process_group=process_group, **factory_kwargs)gpt_model = GPTLMHeadModel(config, device='meta' if args.initialize_on_meta else 'cpu')

  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 466, in __init__
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 582, in __init__
    self.transformer = GPTModel(config, process_group=process_group, **factory_kwargs)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 466, in __init__
    [
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 467, in <listcomp>
    [
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 467, in <listcomp>
    [
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 467, in <listcomp>
    [
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 467, in <listcomp>
    create_block(config, layer_idx=i, process_group=process_group, **factory_kwargs)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 279, in create_block
        create_block(config, layer_idx=i, process_group=process_group, **factory_kwargs)
self.transformer = GPTModel(config, process_group=process_group, **factory_kwargs)  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 279, in create_block

  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 466, in __init__
    create_block(config, layer_idx=i, process_group=process_group, **factory_kwargs)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 279, in create_block
    block = Block(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/block.py", line 68, in __init__
    create_block(config, layer_idx=i, process_group=process_group, **factory_kwargs)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 279, in create_block
    block = Block(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/block.py", line 68, in __init__
    self.mixer = mixer_cls(dim)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/mha.py", line 456, in __init__
    self.mixer = mixer_cls(dim)
      File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/mha.py", line 456, in __init__
block = Block(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/block.py", line 68, in __init__
    block = Block(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/block.py", line 68, in __init__
    [    
self.mixer = mixer_cls(dim)  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 466, in __init__

  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 467, in <listcomp>
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/mha.py", line 456, in __init__
    self.mixer = mixer_cls(dim)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/mha.py", line 456, in __init__
    raise ImportError("fused_dense is not installed")
ImportError: fused_dense is not installed
    raise ImportError("fused_dense is not installed")
ImportError: fused_dense is not installed
    [
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 467, in <listcomp>
    raise ImportError("fused_dense is not installed")
ImportError: fused_dense is not installed
    raise ImportError("fused_dense is not installed")
ImportError: fused_dense is not installed
    create_block(config, layer_idx=i, process_group=process_group, **factory_kwargs)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 279, in create_block
    create_block(config, layer_idx=i, process_group=process_group, **factory_kwargs)    block = Block(

  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/block.py", line 68, in __init__
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 279, in create_block
    self.mixer = mixer_cls(dim)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/mha.py", line 456, in __init__
    raise ImportError("fused_dense is not installed")
ImportError: fused_dense is not installed
    block = Block(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/block.py", line 68, in __init__
    self.mixer = mixer_cls(dim)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/mha.py", line 456, in __init__
    raise ImportError("fused_dense is not installed")
ImportError: fused_dense is not installed
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2890126) of binary: /home/wyr/anaconda3/envs/galvatron/bin/python3
Traceback (most recent call last):
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train_dist.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-06-21_23:09:56
  host      : SYS-4029GP-TRTC-ZY001
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2890127)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-06-21_23:09:56
  host      : SYS-4029GP-TRTC-ZY001
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 2890128)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-06-21_23:09:56
  host      : SYS-4029GP-TRTC-ZY001
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 2890129)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2024-06-21_23:09:56
  host      : SYS-4029GP-TRTC-ZY001
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 2890130)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2024-06-21_23:09:56
  host      : SYS-4029GP-TRTC-ZY001
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 2890132)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2024-06-21_23:09:56
  host      : SYS-4029GP-TRTC-ZY001
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 2890134)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2024-06-21_23:09:56
  host      : SYS-4029GP-TRTC-ZY001
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 2890137)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-21_23:09:56
  host      : SYS-4029GP-TRTC-ZY001
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2890126)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Could you please help me resolve this issue? Or could you provide some possible solutions? Thank you for your help and support!

Questions about some specific numbers in Galvatron.

I notice that there exists some specific numbers in Galvatron when searching for the optimal strategy. For example, in tools/Galvatron/bert/search_layerwise_hp_8gpus.py#L77~88, the parameter_size of BERT-Large is designated as 48.05, tp_activation_per_bsz_dict is designated as a constant dict, and other_model_states and other_activation_per_bsz are designated with two constants. I'd like to raise another example, in tools/Galvatron/bert/search_layerwise_hp_dist_16gpus.py#L93~94, there are two constant dicts called other_memory_pp_off and other_memory_pp_on. I'm wondering that how these numbers are obtained? Are there any scripts or formulations for accessing these constants? Moreover, BERT-Large has about 340M parameters in common sense, so I'm also confused about how 48.05 is obtained in Galvatron.

It would be of great help if you address the above questions. Many thanks!

AssertionError: assert params_size == len(grads)

Describe the bug
The program accidentally terminates because params_size is not equal to the length of grads. I am eager to get the author's help, thanks.

To Reproduce
Steps to reproduce the behavior:

cd example/moe
NCCL_DEBUG=DEBUG mpirun --mca btl '^openib' -np 1 python test_moe_top.py --top=1 --num_local_experts=2 --batch_size=16
(substitute for step 2 but get the same log) NCCL_DEBUG=DEBUG mpirun --mca btl '^openib' -np 2 python test_moe_top.py --top=1 --num_local_experts=2 --batch_size=16

Logs

$NCCL_DEBUG=DEBUG mpirun --mca btl '^openib' -np 1 python test_moe_top.py --top=1 --num_local_experts=2 --batch_size=16
2022-11-29 10:08:01,460 - __main__ - INFO - Training MoE Examples on HETU
device_id:  0
2022-11-29 10:08:03,679 - __main__ - INFO - Step 0
Traceback (most recent call last):
  File "test_moe_top.py", line 86, in <module>
    loss_val, predict_y, y_val, _  = executor.run(
  File "/home/xinglinpan/Hetu/python/hetu/gpu_ops/executor.py", line 446, in run
    return self.subexecutor[name].run(eval_node_list, feed_dict, convert_to_numpy_ret_vals, **kwargs)
  File "/home/xinglinpan/Hetu/python/hetu/gpu_ops/executor.py", line 972, in run
    self.compute(self.computing_nodes,
  File "/home/xinglinpan/Hetu/python/hetu/gpu_ops/executor.py", line 1048, in compute
    node.compute(input_vals, node_val, cur_stream)
  File "/home/xinglinpan/Hetu/python/hetu/optimizer.py", line 116, in compute
    self.optimizer.update(input_vals, stream_handle)
  File "/home/xinglinpan/Hetu/python/hetu/optimizer.py", line 188, in update
    assert params_size == len(grads)
AssertionError
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[1691,1],0]
  Exit code:    1
--------------------------------------------------------------------------

params_size is 6 and len(grads) is 2

Platform

Device: GeForce RTX 2080Ti * 4
OS: Linux gpu9 4.4.0-142-generic #168-Ubuntu SMP Wed Jan 16 21:00:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
CUDA version: 10.2
NCCL version: 2.10.3
PyTorch version: 1.9.1
Python Version: 3.8

是否考虑支持容器部署和编译

Questions about moe examples.

Hi, I'm unable to run through the MOE sample.(test_moe_top.py)
The error message is as follows：

2024-04-26 15:13:06,594 - __main__ - INFO - Training MoE Examples on HETU
libibverbs: Warning: couldn't load driver '/usr/local/infiniband/lib/libibverbs/libmlx4': /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.27' not found (required by /usr/local/infiniband/lib/libibverbs/libmlx4-rdmav2.so)
device_id:  0
Traceback (most recent call last):
  File "test_moe_top.py", line 81, in <module>
    comm_mode=args.comm_mode)
  File "/data/MoE/Hetu-main/python/hetu/gpu_ops/executor.py", line 463, in __init__
    train_name=train_name, val_name=val_name, **kargs)
  File "/data/MoE/Hetu-main/python/hetu/gpu_ops/executor.py", line 418, in __init__
    topo_sort_with_hook(self.my_eval_nodes, self)
  File "/data//MoE/Hetu-main/python/hetu/gpu_ops/executor.py", line 1499, in topo_sort_with_hook
    topo_sort_dfs_with_hook(node, visited, config)
  File "/data/MoE/Hetu-main/python/hetu/gpu_ops/executor.py", line 1506, in topo_sort_dfs_with_hook
    node.backward_hook(config)
  File "/data/MoE/Hetu-main/python/hetu/optimizer.py", line 174, in backward_hook
    cur_node, config.param_allreduce_group.get(cur_param, config.nccl_comm))
AttributeError: 'HetuConfig' object has no attribute 'param_allreduce_group'
Exception ignored in: <function Executor.__del__ at 0x7f63122fdef0>

Duplicate ps when building Hetu

Hello,
I tried to build Hetu with cmake 3.18.2. However, I got following errors after executing cmake.

It seems that this was caused by the target "ps" with the same name in Hetu and HetuML. Is there any method to successfully build Hetu without this problem?

Best regards
Xiaodian

Backward AllReduce in HET

Due to some conflict commits, the distributed API has been changed and affects the execution of previous examples.

When tracking ops in executor (https://github.com/PKU-DAIR/Hetu/blob/main/python/hetu/gpu_ops/executor.py), the All-reduce op is not seen in the backward pass for HET CTR examples (https://github.com/PKU-DAIR/Hetu/blob/main/examples/ctr/run_hetu.py).

Invalid device ordinal error

Hello,
I am running HET on criteo dataset on a single GPU node by setting HETU_VERSION = 'gpu' in HYBRID mode. I ran `bash examples/ctr/tests/hybrid_wdl_criteo.sh, but I am getting the following error:

I tried printing dev_id in gpu_runtime.cc. It always gives 1, 2, 3 instead of 0.

This is my configuration file

shared :
  DMLC_PS_ROOT_URI : 127.0.0.1
  DMLC_PS_ROOT_PORT : 13100
  DMLC_NUM_WORKER : 2
  DMLC_NUM_SERVER : 1
  DMLC_PS_VAN_TYPE : p3
launch :
  worker : 2
  server : 1
  scheduler : true
nodes:
  - host: lmohit95
    servers: 1
    workers: 2
    chief: true