Comments (6)
Thanks for reporting this. I can confirm that this issue can be re-produced on our side.
We have duplicate the same setting in the Horovod's benchmark. We see the same performance downgrade with XLA when using Horovod. This means this issue is not only applicable to KungFu.
I have explored different setting for XLA. Only the JIT can speed up the training throughput of the ResNet-50 model. The options I played are:
config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1 # Only this make difference
config.graph_options.optimizer_options.opt_level = tf.OptimizerOptions.L1
config.graph_options.optimizer_options.do_common_subexpression_elimination = True # No change
config.graph_options.optimizer_options.do_constant_folding = True # No change
config.graph_options.optimizer_options.do_function_inlining = True # No change
We are studying why XLA JIT will cause the performance downgrade when using multiple GPUs with Horovod and KungFu.
The tested environment has TensorFlow 1.13.2 and Horovod 0.16.0.
from kungfu.
Just tried TensorFlow 1.15.2 and Horovod 0.16.1. Still the same issue.
from kungfu.
The Horovod and Nvidia teams find the same issue. The "cluster of ops" feature in XLA-JIT unfortrunately prevents communication and computation operators to be overlapped. The same problem is applied to KungFu. That is why Horovod and KungFu both drop performance with XLA-JIT.
Here is the quoted reply from the Nvidia team (horovod/horovod#1673):
Due to clustering of ops by XLA, enabling XLA can cause Horovod ops to no longer overlap (or overlap less efficiently) with computation, causing degradation in scaling performance. The issue is that Horovod will only be informed of tensors needing processing between XLA clusters. With that being said, depending on the scale you are running at, the increase in performance provided by XLA may outweigh the loss in scalability, resulting in higher raw throughput (as seen in @vilmara's results where enabling XLA reduces scaling efficiency, but achieves a much higher throughput).
You can try limiting the XLA cluster size by setting the following environment variable
TF_XLA_FLAGS="--tf_xla_max_cluster_size=N"
where you set N to a moderately size value like 500, 1000, or maybe more. Limiting the max cluster size can enable Horovod to overlap communication more, but may reduce raw performance so you'll have to experiment to see if the tradeoff is worth it for your application.
Similar issues are reported by other Horovod users:
Similar issues are also reported by TensorFlow users when using XLA-JIT in large clusters:
from kungfu.
Result of running horovod benchmark on 2 DGX:
# with `config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1`
RESULT: 224.546734 +-14.929252 {"framework":"horovod","version":"0.16.1","np":16,"bs":32,"model":"ResNet50"}
# without XLA
RESULT: 240.788208 +-6.951544 {"framework":"horovod","version":"0.16.1","np":16,"bs":32,"model":"ResNet50"}
from kungfu.
Horovod 1 DGX result:
# with `config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1`
RESULT: 330.247851 +-8.340494 {"framework":"horovod","version":"0.16.1","np":8,"bs":32,"model":"ResNet50"}
# without XLA
RESULT: 292.453241 +-2.209386 {"framework":"horovod","version":"0.16.1","np":8,"bs":32,"model":"ResNet50"}
from kungfu.
KungFu 1 DGX result:
with XLA JIT:
RESULT: 298.445066 +-3.038366 {"framework":"kungfu","np":8,"strategy":"BINARY_TREE_STAR","bs":32,"model":"ResNet50","xla":false,"kf-opt":"sync-sgd-nccl","fuse":false,"nvlink":"true"}
without XLA JIT
RESULT: 311.254795 +-3.357482 {"framework":"kungfu","np":8,"strategy":"BINARY_TREE_STAR","bs":32,"model":"ResNet50","xla":true,"kf-opt":"sync-sgd-nccl","fuse":false,"nvlink":"true"}
from kungfu.
Related Issues (20)
- bert demo question HOT 4
- panic error HOT 3
- When using the config-server, if you call allgather api, it will block.
- After remove the worker from the cluster, it is better to set the rank id to -1. HOT 3
- Elastic hook can't support training from checkpoint.
- Support real global batch normalisation HOT 2
- Inconsistency detected by ld.so
- failed to establish connection to the newly runner HOT 5
- the kungfu-job is hang when it scale down HOT 2
- kungfu job is hang in a inconsistent version when i scale down/up mutiple times HOT 14
- [doc] request parameters doc when the -init-version=-1 HOT 5
- Support for share-memory channels? HOT 1
- use a dedicated thread for NCCL operations
- Access to Adaptive Batch Size Policy
- Is Windows supported?
- Error from pytoch demo HOT 1
- code loss HOT 2
- A question about Horovod central coordinator in the paper of KungFu HOT 2
- With PairAveragingOptimizer, is it possible that two workers in different iterations average their models? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kungfu.