Comments (13)
Hi @duncanriach, thanks for your reply. It's strange that you couldn't reproduce the non-determinism by removing cross-entropy op, it never happened to me! It's fine to reduce the model size, if it's not fitted into memory, but sometimes you might need to run couple of times to see the non-determinism.
Anyway, thanks for your effort, looking forward to hearing from you.
from framework-reproducibility.
Hi @atebbifakhr,
My understanding is that XLA JIT compilation is not currently enabled by default in TensorFlow. I assume that you're not enabling XLA and therefore that, if there is in fact a source of non-determinism, it's not an XLA-originated op.
Can you tell me more about your model and settings? There remain various sources of non-determinism in TensorFlow which are not addressed by the patch.
from framework-reproducibility.
Hi @duncanriach,
I'm using Tensorflow-gpu==2.0.0
and my model is Transformer for seq2seq. I noticed the source of non-determinism is in tf.nn.softmax_cross_entropy_with_logits
.
I decided to call tf.nn.softmax_cross_entropy_with_logits
on CPU to make my code deterministic. It works for the first computed gradients, but for the following gradients is still non-deterministic. My guess is optimizer.apply_gradients()
is also non-deterministic.
from framework-reproducibility.
Until now, I was unaware of non-determinism issues with tf.nn.softmax_cross_entropy_with_logits
, but I have started digging into it, and will add it to a list of things to look at and potentially fix.
I have personally never seen optimizer.apply_gradients()
operate non-deterministically on a GPU, and many folks are now doing deterministic deep learning with TensorFlow, which makes it even less likely to be an issue.
You've also said that the computed gradients are non-deterministic. If non-determinism is appearing in the computed gradients, then it is, by definition, being injected before the gradients are applied. Another op in your model may be injecting non-determinism in back-prop.
I recommend making sure that the examples that are being fed into the model are deterministic and that your trainable variables are initialized deterministically, once you have confirmed that then it's possible to debug and locate the source of non-determinism in the model. Unfortunately, I have not had time to release the debugging tool yet, which makes it harder for others to debug.
If you can provide me with a simple-as-possible, self-contained example that clearly demonstrates non-determinism, then I may be able to debug it relatively quickly and identify the source, or sources, of non-determinism in it. Self-contained means that all the files needs are provided, including training data or code that generates synthetic data. Simple-as-possible means that it's as simple as possible while still demonstrating the issue.
Also, I assuming that the seq2seq model you're using is Google's Seq2seq. Please confirm.
from framework-reproducibility.
I prepare this notebook that you can replicate the problem.
Actually, I'm using OpenNMT-tf toolkit. However, the problem is not related to the toolkit. If you change tf.nn.sparse_softmax_cross_entropy_with_logits
to something else, the code becomes deterministic.
from framework-reproducibility.
Thanks for providing that code, @atebbifakhr! Nice and simple and self-contained. I love it. I have been able to reproduce the non-determinism, but not the determinism when the cross-entropy op is removed. It seems that the two pkl files generated in that case still differ. Perhaps I'm doing something wrong though.
Please will you run again and confirm that you're definitely seeing the pkl files matching when you remove the cross-entropy op?
In any case, this example is great because it give me something specific to run and debug.
from framework-reproducibility.
Hey, I'm running this locally so that I can instrument and debug it. My machine contains a 12GB TITAN V. I'm getting this error:
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[12544,32001] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Are you familiar with this error and how to resolve it?
from framework-reproducibility.
In the model, I reduced num_units
from 512 down to 32 and ffn_inner_dim
from 2048 down to 128 for both the encoder and the decoder. This resolved the problem. The machine under my colab is an NVIDIA Tesla T4 with 16GB of GPU memory. I wonder if the model, as configured, fitted into 16GB but not into 12GB.
Anyway, I am able to locally reproduce the non-determinism and also the determinism (without the cross-entropy op). I'm not sure why I could not reproduce the determinism on colab; possible operator error since the process is very manual.
Well done for isolating this source of non-determinism! Thank you.
I also want to acknowledge that all of the work that has gone into TensorFlow determinism so far made it so that it was possible to isolate a single op as a source of non-determinism without using the non-determinism debugging tool. This is because removing that one op reveals the underlying determinism that we now have.
I intend to instrument this model and confirm the non-determinism and also that the cross-entropy op is the only source. Then we can look at potential fixes or work-arounds.
from framework-reproducibility.
Hi @duncanriach,
Any update on this issue? could you confirm the non-deterministim?
from framework-reproducibility.
Hey @atebbifakhr, Sorry, I have not gotten to this yet. Will do as soon as I can and get back to you.
from framework-reproducibility.
Hi @atebbifakhr, I looked into this more deeply. Removing tf.nn.sparse_softmax_cross_entropy_with_logits
from the loss function only makes the gradients reproducible for the first step. They still go non-deterministic on the second step. The trainable variables actually go non-deterministic on the first step (somehow) regardless of whether tf.nn.sparse_softmax_cross_entropy_with_logits
is in the loss function.
The fact that the gradients are deterministic for the first step but the trainable variables are not suggests that non-determinism is being introduced in the gradient update step. I hope to continue investigating soon.
from framework-reproducibility.
Hi @atebbifakhr,
After further investigation, there seems to be two or three sources of non-determinism in this system.
- Confirmed that back-prop of
tf.nn.sparse_softmax_cross_entropy_with_logits
does inject non-determinism. Opened TensorFlow issue 38185. - Discovered that
tf.keras.optimizers.Optimizer::apply_gradients
seems to inject non-determinism into the trainable state of the source and target inputters (instances ofWordEmbedder
) at the end of the first training step. This is mitigated by making the batch size smaller, but I don't know why. In the configuration that I am running, setting the batch size to 1 appears to make the state of the inputters deterministic at the end of the first step. - Discovered that the source and target inputters also inject non-determinism in the forward path by making the samples applied to the model non-reproducible on the second step and onwards (when the state of the inputters is deterministic from the previous step).
There is more work to do on this issue, but I wanted to give you an interim update.
I've also added your name to the credits section of this repo in recognition of your effort in enabling me to reproduce and isolate the problems you've been seeing.
from framework-reproducibility.
I updated my previous comment to include additional information that come from more investigation.
from framework-reproducibility.
Related Issues (20)
- Nondeterminism from tf.image.crop_and_resize HOT 2
- Reproducibility issue with transformers (BERT) and tf2.2 HOT 21
- Make tf_determinism.patch() handle tf>=2.1 as well HOT 4
- Running on stock TensorFlow version >= 2.1 HOT 5
- non-determinism on tf.keras.layers.UpSampling2D(..., interpolation='nearest') HOT 3
- Message passing neural network determinism thwarted by tf.math.segment_sum and tf.gather HOT 13
- Model nondeterminism when using `K.layers.DepthwiseConv2D` / `tf.nn.depthwise_conv2d` (known issue) HOT 18
- Exception thrown with "No algorithm worked!" message on NGC 20.09 HOT 13
- Determinism across GPU architectures? HOT 5
- Don't get deterministic results with NGC 20.09 HOT 8
- Model not deterministic/reproducible with all seeds set and os.environ['TF_DETERMINISTIC_OPS'] = '1'
- More segment ops need to be patched HOT 5
- patch tf.image.resize with bilinear by casting image to tf.float64 HOT 2
- garder14/byol-tensorflow2 (batch-norm & softmax/cross-entropy) HOT 1
- Usage of numpy.random.Generator to deal with Data-Loader Parallelism HOT 7
- Early versions of TensorFlow (e.g. 1.9) do not have a version attribute HOT 3
- Non-reproducible model training results with TensorFlow 2.5 HOT 7
- tf.data.experimental.sample_from_datasets non-deterministic in multi-gpu. HOT 4
- Changing class name/structure changes program functionality (using TensorFlow) HOT 2
- `torch_status.md` determinism tracker similiar to `tensorflow_status.md` HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from framework-reproducibility.