I am getting KeyError(f"None of [{key}] are in the [{axis_na

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="use

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

KeyError(f"None of [{key}] are in the [{axis_name}]" about hetu HOT 13 CLOSED

lmohit95 commented on August 17, 2024

KeyError(f"None of [{key}] are in the [{axis_name}]"

from hetu.

Comments (13)

Hankpipi commented on August 17, 2024

Hi, @lmohit95, this has been solved in #52 and sorry for the delay.

from hetu.

lmohit95 commented on August 17, 2024

Thank you. load_data.py works perfectly now. The dataset is downloaded and processed.
But while running bash tests/local_dcn_criteo.sh command, I get the following error. I have tried changing batchsize at line 62 in run_hetu.py and tried running the command, but I still get the same error.

I am able to run other DLRM like facebook open source DLRM using GPU, so I believe CUDA setup is correct.

from hetu.

Hankpipi commented on August 17, 2024

@lmohit95, Hetu main brench has been updated which enables dynamic memory, please pull the new code and try again.

from hetu.

lmohit95 commented on August 17, 2024

@lmohit95, Hetu main brench has been updated, please try to pull the new code and try again.

Thank you. It works now. Sorry for asking lot of questions. I am facing this issue now while training criteo dataset.

#50 (comment)

from hetu.

Hankpipi commented on August 17, 2024

@lmohit95, Hetu main brench has been updated, please try to pull the new code and try again.

Thank you. It works now. Sorry for asking lot of questions. I am facing this issue now while training criteo dataset.

#50 (comment)

I mean this problem has been solved by #47 which was merged not long before, and will it still happen when you pull these changes?

from hetu.

lmohit95 commented on August 17, 2024

I get the following error when I run bash tests/local_dcn_criteo.sh. I created hetu_config.yaml file in tmp folder and copied contents provided in README.MD. I am trying to run HET on a single GPU.

To avoid this error, I deliberately made file = None in __init__ function of distribute.py. While doing that, I am facing the outofmemory error.

from hetu.

Hankpipi commented on August 17, 2024

@lmohit95, You can also update the line 120 by the following code to avoid the first error:

Hetu/examples/ctr/run_hetu.py

Lines 120 to 126 in 1684091

 if args.comm is None: 

 executor = ht.Executor(eval_nodes, ctx=ht.gpu(0), cstable_policy=args.cache, 

 bsp=args.bsp, cache_bound=args.bound, seed=123, log_path=executor_log_path) 

 else: 

 strategy = ht.dist.DataParallel(aggregate=args.comm) 

 executor = ht.Executor(eval_nodes, dist_strategy=strategy, cstable_policy=args.cache, 

 bsp=args.bsp, cache_bound=args.bound, seed=123, log_path=executor_log_path)

For the OOM error, #47 implements dynamic memory allocation, and the gpu memory peak will be halved when you run bash tests/local_dcn_criteo.sh.

Maybe you haven't pull the latest code yet?

from hetu.

lmohit95 commented on August 17, 2024

Thanks a lot for everything. The tests are working perfectly now. I was accessing the forked repo mentioned in the HET paper.
While running python run_hetu.py --model dcn_criteo --all --val command, I am facing the following error:

I pulled latest code and downloaded criteo dataset by running load_data.py file.

from hetu.

Hankpipi commented on August 17, 2024

@lmohit95, thanks for you feedback and sorry for my mistake.
It is true that there are still some errors in dataset processing, and I have fixed it in #54.
Please pull my code and run python load_data.py again before running run_hetu.py --model dcn_criteo --all --val.

from hetu.

lmohit95 commented on August 17, 2024

Thank you. Everything works now. I just wanted to clarify something regarding run_hetu.py --model dcn_criteo --all --val. This command trains and tests HET on criteo dataset right?

The paper mentions that the training process can take hours (Fig 6),

but in my case the training runs for a total of 10 epochs with far less overall runtime.

from hetu.

Hsword commented on August 17, 2024

It seems like you are running in a local execution mode, rather than the distributed training. That's why it's much faster.
Besides, note that the test_auc is reported every 1/10 epoch as described in

Hetu/examples/ctr/run_hetu.py

Line 190 in acae42a

help="num of epochs, each train 1/10 data")

from hetu.

lmohit95 commented on August 17, 2024

Got it. Thanks for all the help!!!

from hetu.

lmohit95 commented on August 17, 2024

Hello,
Thanks for all the help until now.
I am running HET on criteo dataset on a single GPU node by setting HETU_VERSION = 'gpu' in HYBRID mode. I ran `bash examples/ctr/tests/hybrid_wdl_criteo.sh, but I am getting the following error:

This is my configuration file

shared :
  DMLC_PS_ROOT_URI : 127.0.0.1
  DMLC_PS_ROOT_PORT : 13100
  DMLC_NUM_WORKER : 2
  DMLC_NUM_SERVER : 1
  DMLC_PS_VAN_TYPE : p3
launch :
  worker : 2
  server : 1
  scheduler : true
nodes:
  - host: lmohit95
    servers: 1
    workers: 2
    chief: true

from hetu.

KeyError(f"None of [{key}] are in the [{axis_name}]" about hetu HOT 13 CLOSED

Comments (13)

Related Issues (19)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	if args.comm is None:
	executor = ht.Executor(eval_nodes, ctx=ht.gpu(0), cstable_policy=args.cache,
	bsp=args.bsp, cache_bound=args.bound, seed=123, log_path=executor_log_path)
	else:
	strategy = ht.dist.DataParallel(aggregate=args.comm)
	executor = ht.Executor(eval_nodes, dist_strategy=strategy, cstable_policy=args.cache,
	bsp=args.bsp, cache_bound=args.bound, seed=123, log_path=executor_log_path)