Repro: <div class="snippet-clipboard-content notranslate position-relative overflo

cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

The fix should be to add a meta registration (<a class="issue-link js-issue-link" data

From offline discussion: <a class="user-mention notranslate" data-hovercard-type="user

While the PR <a class="issue-link js-issue-link" data-error-text="Failed to load title

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Running `aten._foreach_addcdiv_.ScalarList` with `FakeTensor`s uses high CPU memory and is slow about pytorch HOT 7 OPEN

awgu commented on June 4, 2024 1

Running `aten._foreach_addcdiv_.ScalarList` with `FakeTensor`s uses high CPU memory and is slow

from pytorch.

Comments (7)

eellison commented on June 4, 2024 1

We could add it to torch logs. I don't want to add to warning spam, especially for something that is not actionable to users.

But in either case we should get to full meta coverage of foreach kernels. As issue #105105 here shows there are a number of foreach without meta.

from pytorch.

shunting314 commented on June 4, 2024

cc @mlazos ?

from pytorch.

awgu commented on June 4, 2024

The fix should be to add a meta registration (#123463). I have confirmed that fixes the issue locally. I am not familiar with the test infra to adjust the expected failing tests, as it seems like some change is needed for the sample inputs.

Something for discussion though is if we can warn when the fallback behavior happens. It looks like the op was silently being run on CPU despite being under FakeTensorMode.

from pytorch.

awgu commented on June 4, 2024

From offline discussion: @janeyx99 will figure out how to land the PR with the appropriate testing changes.

from pytorch.

janeyx99 commented on June 4, 2024

While the PR #123486 "fixes" this issue for this specific use case, it'd be good to confirm why this happened in the first place/if the behavior is expected.

from pytorch.

eellison commented on June 4, 2024

@janeyx99 yes it is expected that if you don't have a meta registration we fallback to the eager implementation. This was originally for bootstrapping before we had large meta kernel support. @zou3519 turned the default to not fallback for non-aten kernels but could not switch the default for aten kernels due to long-tail ops.

For foreach kernels that we are running with parameters IMO it would makes sense to prioritize meta coverage here since fallback can be memory intensive.

cc @mlazos

from pytorch.

awgu commented on June 4, 2024

@eellison What is your opinion on adding a warning when falling back to the eager implementation? Would it be too noisy?

For this particular case, debugging this CPU OOM was non-trivial (first seen on MAST and having to iteratively narrow down the culprit through the training loop to DTensor to the FakeTensorMode part of DTensor sharding propagation), and it would have saved a lot of time if we knew that the fallback was materializing CPU tensors.

from pytorch.

Recommend Projects

Running `aten._foreach_addcdiv_.ScalarList` with `FakeTensor`s uses high CPU memory and is slow about pytorch HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent