Git Product home page Git Product logo

Comments (13)

Hankpipi avatar Hankpipi commented on August 17, 2024

Hi, @lmohit95, this has been solved in #52 and sorry for the delay.

from hetu.

lmohit95 avatar lmohit95 commented on August 17, 2024

Thank you. load_data.py works perfectly now. The dataset is downloaded and processed.
But while running bash tests/local_dcn_criteo.sh command, I get the following error. I have tried changing batchsize at line 62 in run_hetu.py and tried running the command, but I still get the same error.

image

I am able to run other DLRM like facebook open source DLRM using GPU, so I believe CUDA setup is correct.

from hetu.

Hankpipi avatar Hankpipi commented on August 17, 2024

@lmohit95, Hetu main brench has been updated which enables dynamic memory, please pull the new code and try again.

from hetu.

lmohit95 avatar lmohit95 commented on August 17, 2024

@lmohit95, Hetu main brench has been updated, please try to pull the new code and try again.

Thank you. It works now. Sorry for asking lot of questions. I am facing this issue now while training criteo dataset.

#50 (comment)

from hetu.

Hankpipi avatar Hankpipi commented on August 17, 2024

@lmohit95, Hetu main brench has been updated, please try to pull the new code and try again.

Thank you. It works now. Sorry for asking lot of questions. I am facing this issue now while training criteo dataset.

#50 (comment)

I mean this problem has been solved by #47 which was merged not long before, and will it still happen when you pull these changes?

from hetu.

lmohit95 avatar lmohit95 commented on August 17, 2024

I get the following error when I run bash tests/local_dcn_criteo.sh. I created hetu_config.yaml file in tmp folder and copied contents provided in README.MD. I am trying to run HET on a single GPU.

image

To avoid this error, I deliberately made file = None in __init__ function of distribute.py. While doing that, I am facing the outofmemory error.

from hetu.

Hankpipi avatar Hankpipi commented on August 17, 2024

@lmohit95, You can also update the line 120 by the following code to avoid the first error:

if args.comm is None:
executor = ht.Executor(eval_nodes, ctx=ht.gpu(0), cstable_policy=args.cache,
bsp=args.bsp, cache_bound=args.bound, seed=123, log_path=executor_log_path)
else:
strategy = ht.dist.DataParallel(aggregate=args.comm)
executor = ht.Executor(eval_nodes, dist_strategy=strategy, cstable_policy=args.cache,
bsp=args.bsp, cache_bound=args.bound, seed=123, log_path=executor_log_path)

For the OOM error, #47 implements dynamic memory allocation, and the gpu memory peak will be halved when you run bash tests/local_dcn_criteo.sh.

Maybe you haven't pull the latest code yet?

from hetu.

lmohit95 avatar lmohit95 commented on August 17, 2024

Thanks a lot for everything. The tests are working perfectly now. I was accessing the forked repo mentioned in the HET paper.
While running python run_hetu.py --model dcn_criteo --all --val command, I am facing the following error:

image

I pulled latest code and downloaded criteo dataset by running load_data.py file.

from hetu.

Hankpipi avatar Hankpipi commented on August 17, 2024

@lmohit95, thanks for you feedback and sorry for my mistake.
It is true that there are still some errors in dataset processing, and I have fixed it in #54.
Please pull my code and run python load_data.py again before running run_hetu.py --model dcn_criteo --all --val.

from hetu.

lmohit95 avatar lmohit95 commented on August 17, 2024

Thank you. Everything works now. I just wanted to clarify something regarding run_hetu.py --model dcn_criteo --all --val. This command trains and tests HET on criteo dataset right?

The paper mentions that the training process can take hours (Fig 6),

image

but in my case the training runs for a total of 10 epochs with far less overall runtime.

image

from hetu.

Hsword avatar Hsword commented on August 17, 2024

It seems like you are running in a local execution mode, rather than the distributed training. That's why it's much faster.
Besides, note that the test_auc is reported every 1/10 epoch as described in

help="num of epochs, each train 1/10 data")
.

from hetu.

lmohit95 avatar lmohit95 commented on August 17, 2024

Got it. Thanks for all the help!!!

from hetu.

lmohit95 avatar lmohit95 commented on August 17, 2024

Hello,
Thanks for all the help until now.
I am running HET on criteo dataset on a single GPU node by setting HETU_VERSION = 'gpu' in HYBRID mode. I ran `bash examples/ctr/tests/hybrid_wdl_criteo.sh, but I am getting the following error:

image

This is my configuration file

shared :
  DMLC_PS_ROOT_URI : 127.0.0.1
  DMLC_PS_ROOT_PORT : 13100
  DMLC_NUM_WORKER : 2
  DMLC_NUM_SERVER : 1
  DMLC_PS_VAN_TYPE : p3
launch :
  worker : 2
  server : 1
  scheduler : true
nodes:
  - host: lmohit95
    servers: 1
    workers: 2
    chief: true

from hetu.

Related Issues (19)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.