Git Product home page Git Product logo

Comments (6)

luomai avatar luomai commented on June 22, 2024

Thanks for reporting this. I can confirm that this issue can be re-produced on our side.

We have duplicate the same setting in the Horovod's benchmark. We see the same performance downgrade with XLA when using Horovod. This means this issue is not only applicable to KungFu.

I have explored different setting for XLA. Only the JIT can speed up the training throughput of the ResNet-50 model. The options I played are:

config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1 # Only this make difference
config.graph_options.optimizer_options.opt_level = tf.OptimizerOptions.L1
config.graph_options.optimizer_options.do_common_subexpression_elimination = True # No change
config.graph_options.optimizer_options.do_constant_folding = True # No change
config.graph_options.optimizer_options.do_function_inlining = True # No change

We are studying why XLA JIT will cause the performance downgrade when using multiple GPUs with Horovod and KungFu.

The tested environment has TensorFlow 1.13.2 and Horovod 0.16.0.

from kungfu.

luomai avatar luomai commented on June 22, 2024

Just tried TensorFlow 1.15.2 and Horovod 0.16.1. Still the same issue.

from kungfu.

luomai avatar luomai commented on June 22, 2024

The Horovod and Nvidia teams find the same issue. The "cluster of ops" feature in XLA-JIT unfortrunately prevents communication and computation operators to be overlapped. The same problem is applied to KungFu. That is why Horovod and KungFu both drop performance with XLA-JIT.

Here is the quoted reply from the Nvidia team (horovod/horovod#1673):

Due to clustering of ops by XLA, enabling XLA can cause Horovod ops to no longer overlap (or overlap less efficiently) with computation, causing degradation in scaling performance. The issue is that Horovod will only be informed of tensors needing processing between XLA clusters. With that being said, depending on the scale you are running at, the increase in performance provided by XLA may outweigh the loss in scalability, resulting in higher raw throughput (as seen in @vilmara's results where enabling XLA reduces scaling efficiency, but achieves a much higher throughput).

You can try limiting the XLA cluster size by setting the following environment variable TF_XLA_FLAGS="--tf_xla_max_cluster_size=N" where you set N to a moderately size value like 500, 1000, or maybe more. Limiting the max cluster size can enable Horovod to overlap communication more, but may reduce raw performance so you'll have to experiment to see if the tradeoff is worth it for your application.

Similar issues are reported by other Horovod users:

Similar issues are also reported by TensorFlow users when using XLA-JIT in large clusters:

from kungfu.

lgarithm avatar lgarithm commented on June 22, 2024

Result of running horovod benchmark on 2 DGX:

# with `config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1`
RESULT: 224.546734 +-14.929252 {"framework":"horovod","version":"0.16.1","np":16,"bs":32,"model":"ResNet50"}

# without XLA
RESULT: 240.788208 +-6.951544 {"framework":"horovod","version":"0.16.1","np":16,"bs":32,"model":"ResNet50"}

from kungfu.

lgarithm avatar lgarithm commented on June 22, 2024

Horovod 1 DGX result:

# with `config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1`
RESULT: 330.247851 +-8.340494 {"framework":"horovod","version":"0.16.1","np":8,"bs":32,"model":"ResNet50"}

# without XLA
RESULT: 292.453241 +-2.209386 {"framework":"horovod","version":"0.16.1","np":8,"bs":32,"model":"ResNet50"}

from kungfu.

luomai avatar luomai commented on June 22, 2024

KungFu 1 DGX result:

with XLA JIT:

RESULT: 298.445066 +-3.038366 {"framework":"kungfu","np":8,"strategy":"BINARY_TREE_STAR","bs":32,"model":"ResNet50","xla":false,"kf-opt":"sync-sgd-nccl","fuse":false,"nvlink":"true"}

without XLA JIT

RESULT: 311.254795 +-3.357482 {"framework":"kungfu","np":8,"strategy":"BINARY_TREE_STAR","bs":32,"model":"ResNet50","xla":true,"kf-opt":"sync-sgd-nccl","fuse":false,"nvlink":"true"}

from kungfu.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.