Hi, I am using the demo to verify the size effect on Fuchsia compile. But it failed at

Replace tensorflow c api from cpu to gpu about ml-compiler-opt HOT 6 CLOSED

google commented on May 2, 2024

Replace tensorflow c api from cpu to gpu

from ml-compiler-opt.

Comments (6)

yundiqian commented on May 2, 2024 1

To confirm, is it that you your failed to compile llvm with libtensorflow-gpu-linux-x86_64-1.15.0.tar.gz, then you used tensorflow-c-api to compile llvm and successfully trained the fuscia demo?

During training, the pipeline iterates between data collection (compiling the given corpus with the current TF model) and training (update the the TF model). In this process, the majority of the time is spent on data collection instead of on training (data collection/compiling is much more expensive than training in our pipeline), so it's not surprising that the gpu usage is low during training. Therefore, the best way to accelerate training process is to train on a beefy machine with higher data collection parallelism to save the data collection time.

from ml-compiler-opt.

yundiqian commented on May 2, 2024 1

To confirm, is it that you your failed to compile llvm with libtensorflow-gpu-linux-x86_64-1.15.0.tar.gz, then you used tensorflow-c-api to compile llvm and successfully trained the fuscia demo?

Yes, that's right and I am using a cloud server with 60 vCpu and a tesla T4 16G gpu. Does it mean that I need to change to a higher cpu to do this training?

yes, higher/more powerful cpu would be far more helpful; you definitely don't need a powerful gpu for the training (cpu can probably do the training pretty fast cause it's quite lightweight)

from ml-compiler-opt.

yundiqian commented on May 2, 2024

Thanks for trying it out and providing the feedback! We are looking into this.

I see your comment under another issue that the Fuchsia demo worked for you, so I'm wondering is it that you managed to figure out this issue?

from ml-compiler-opt.

Colibrow commented on May 2, 2024

Hi, I didn't solve this issue but I successfully trained the Fuchsia demo following the instructions. I am working on migrating the demo to my project which the python script used a lot of time (maybe 12h or so) and I found the gpu usage is lower than 20%, is there any idea to accelerate training process.

from ml-compiler-opt.

Colibrow commented on May 2, 2024

To confirm, is it that you your failed to compile llvm with libtensorflow-gpu-linux-x86_64-1.15.0.tar.gz, then you used tensorflow-c-api to compile llvm and successfully trained the fuscia demo?

Yes, that's right and I am using a cloud server with 60 vCpu and a tesla T4 16G gpu. Does it mean that I need to change to a higher cpu to do this training?

from ml-compiler-opt.

Colibrow commented on May 2, 2024

Thanks a lot for saving my day~

from ml-compiler-opt.

Recommend Projects

Replace tensorflow c api from cpu to gpu about ml-compiler-opt HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent