<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Result of running horovod benchmark on 2 DGX: <div class="snippet-clipboard-conten

Horovod 1 DGX result: <div class="snippet-clipboard-content notranslate position-r

KungFu 1 DGX result: with XLA JIT: <div class="snippet-clipboard

Performance drops when TensorFlow experimental XLA JIT is enabled. about kungfu HOT 6 CLOSED

lsds commented on June 22, 2024 1

Performance drops when TensorFlow experimental XLA JIT is enabled.

from kungfu.

Comments (6)

luomai commented on June 22, 2024

Thanks for reporting this. I can confirm that this issue can be re-produced on our side.

We have duplicate the same setting in the Horovod's benchmark. We see the same performance downgrade with XLA when using Horovod. This means this issue is not only applicable to KungFu.

I have explored different setting for XLA. Only the JIT can speed up the training throughput of the ResNet-50 model. The options I played are:

config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1 # Only this make difference
config.graph_options.optimizer_options.opt_level = tf.OptimizerOptions.L1
config.graph_options.optimizer_options.do_common_subexpression_elimination = True # No change
config.graph_options.optimizer_options.do_constant_folding = True # No change
config.graph_options.optimizer_options.do_function_inlining = True # No change

We are studying why XLA JIT will cause the performance downgrade when using multiple GPUs with Horovod and KungFu.

The tested environment has TensorFlow 1.13.2 and Horovod 0.16.0.

from kungfu.

luomai commented on June 22, 2024

Just tried TensorFlow 1.15.2 and Horovod 0.16.1. Still the same issue.

from kungfu.

luomai commented on June 22, 2024

The Horovod and Nvidia teams find the same issue. The "cluster of ops" feature in XLA-JIT unfortrunately prevents communication and computation operators to be overlapped. The same problem is applied to KungFu. That is why Horovod and KungFu both drop performance with XLA-JIT.

Here is the quoted reply from the Nvidia team (horovod/horovod#1673):

Due to clustering of ops by XLA, enabling XLA can cause Horovod ops to no longer overlap (or overlap less efficiently) with computation, causing degradation in scaling performance. The issue is that Horovod will only be informed of tensors needing processing between XLA clusters. With that being said, depending on the scale you are running at, the increase in performance provided by XLA may outweigh the loss in scalability, resulting in higher raw throughput (as seen in @vilmara's results where enabling XLA reduces scaling efficiency, but achieves a much higher throughput).

You can try limiting the XLA cluster size by setting the following environment variable TF_XLA_FLAGS="--tf_xla_max_cluster_size=N" where you set N to a moderately size value like 500, 1000, or maybe more. Limiting the max cluster size can enable Horovod to overlap communication more, but may reduce raw performance so you'll have to experiment to see if the tradeoff is worth it for your application.

Similar issues are reported by other Horovod users:

horovod/horovod#1283

Similar issues are also reported by TensorFlow users when using XLA-JIT in large clusters:

tensorflow/tensorflow#28890

from kungfu.

lgarithm commented on June 22, 2024

Result of running horovod benchmark on 2 DGX:

# with `config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1`
RESULT: 224.546734 +-14.929252 {"framework":"horovod","version":"0.16.1","np":16,"bs":32,"model":"ResNet50"}

# without XLA
RESULT: 240.788208 +-6.951544 {"framework":"horovod","version":"0.16.1","np":16,"bs":32,"model":"ResNet50"}

from kungfu.

lgarithm commented on June 22, 2024

Horovod 1 DGX result:

# with `config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1`
RESULT: 330.247851 +-8.340494 {"framework":"horovod","version":"0.16.1","np":8,"bs":32,"model":"ResNet50"}

# without XLA
RESULT: 292.453241 +-2.209386 {"framework":"horovod","version":"0.16.1","np":8,"bs":32,"model":"ResNet50"}

from kungfu.

luomai commented on June 22, 2024

KungFu 1 DGX result:

with XLA JIT:

RESULT: 298.445066 +-3.038366 {"framework":"kungfu","np":8,"strategy":"BINARY_TREE_STAR","bs":32,"model":"ResNet50","xla":false,"kf-opt":"sync-sgd-nccl","fuse":false,"nvlink":"true"}

without XLA JIT

RESULT: 311.254795 +-3.357482 {"framework":"kungfu","np":8,"strategy":"BINARY_TREE_STAR","bs":32,"model":"ResNet50","xla":true,"kf-opt":"sync-sgd-nccl","fuse":false,"nvlink":"true"}

from kungfu.

Performance drops when TensorFlow experimental XLA JIT is enabled. about kungfu HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent