Hi, DJL's TensorFlow engine is depending on tensorflow-core-api' SNAPSHOT package. Our

I think may related to CI build failure: <a href="https://github.com/tensorflow/java/a

Looks like Linux builds are back up: <a class="issue-link js-issue-link" data-erro

It looks like we have a solution for Windows as well: <a class="issue-link js-issu

Failed to fetch latest snapshot for tensorflow-core-api linux gpu mkl about java HOT 29 CLOSED

tensorflow commented on May 21, 2024

Failed to fetch latest snapshot for tensorflow-core-api linux gpu mkl

from java.

Comments (29)

karllessard commented on May 21, 2024 2

The mac-os-mkl.jar is there, but the libjnimkldnn.dylib and libiomp5.dylib extra libraries are missing, is this intended

Yes it is intended, we don't distribute them in our artifacts, you need to have them installed on your machine or add dependencies to other JavaCPP artifacts to upload them via Maven/Gradle, like this one: https://github.com/bytedeco/javacpp-presets/tree/master/mkl-dnn. (@saudet correct me if I'm wrong)

from java.

saudet commented on May 21, 2024 1

The idea behind having MKL-DNN in a different package is because it's over 100 MB in size (all platforms) and these binaries are used by other libraries, with which they can be shared, for example, MXNet: https://github.com/bytedeco/javacpp-presets/blob/master/mxnet/platform/pom.xml#L33-L37

I talk about the need of a distribution that would allow us to share components like that in this post:
http://bytedeco.org/news/2019/01/11/importance-of-a-distribution/
This is going to be crucial for Java applications of machine learning to succeed in the enterprise.

from java.

karllessard commented on May 21, 2024 1

Still, that is kind of refreshing to see all these green checks on that page, thanks @saudet !

from java.

Craigacp commented on May 21, 2024 1

I patched the CMake build locally so it added things to the include & library paths if running on macOS, after running brew install libomp. I think their CI was already configured in some way I didn't quite understand.

from java.

karllessard commented on May 21, 2024 1

An update on this: I've added the --define build_with_mkl_dnn_v1_only=false flag to the bazel command line and now MacOS + MKL passes, looks like MKL-DNN v1 only works out with Linux.

I was planning to leave it that way if everything goes fine and then we can check how to enable MKL-DNN 1.x on all platforms or only on Linux.

from java.

karllessard commented on May 21, 2024 1

Ok, it's unfortunate but it looks like we cannot increase the timeout of a job on GitHub action on their hosted-runners beyond 6 hours, as dictated here. Setting the timeout-minutes parameter to a higher value didn't helped.

So this build (based on TF2.2) is the best we had got so far on this CI solution. So I'll go ahead with merging PR #44 and we will probably need to continue the discussion to find an alternative or a complement to it.

from java.

frankfliu commented on May 21, 2024

I think may related to CI build failure: https://github.com/tensorflow/java/actions
the snapshot is partially uploaded.

from java.

saudet commented on May 21, 2024

Yes, as per these threads mentioned at the last meeting, the builds on Linux are no longer reliable:

That is in addition to this issue preventing builds for Windows from succeeding:

https://github.community/t5/GitHub-Actions/Error-quot-The-paging-file-is-too-small-for-this-operation-to/m-p/40632

If you could intervene on behalf of Amazon via your GitHub support channel, please do.

from java.

saudet commented on May 21, 2024

Looks like Linux builds are back up:
actions/runner-images#709 (comment)
We're now getting 13 GB of free space after CUDA, etc:
https://github.com/tensorflow/java/runs/669404530

from java.

saudet commented on May 21, 2024

It looks like we have a solution for Windows as well:
actions/runner-images#785 (comment)
Let me try that right away...

from java.

saudet commented on May 21, 2024

The CPU builds for Windows will also now work, but not for CUDA, yet (see pull #54).

from java.

saudet commented on May 21, 2024

Ok, builds for Linux and Mac are back up:
https://github.com/tensorflow/java/runs/672994641
Windows is still a WIP though...

from java.

karllessard commented on May 21, 2024

@saudet , I've rebase the TF2.2.0 branch with your last changes it did not went that well: https://github.com/tensorflow/java/runs/674534254?check_suite_focus=true

From what I understand, the Mac build got interrupted because the Mac + MKL build failed first, seems like something have changed in TF2.2.0 that I need to look at. Same thing for the Windows build, who got interrupted because Windows + MKL build also failed first (that one was expected though).

So I'm wondering if we can prevent the builds from being interrupted if other builds for the same platform are failing, at least for the "vanilla" ones so we could deliver them since they are more stable than their MKL/GPU flavours?

from java.

karllessard commented on May 21, 2024

...at the same time, only Windows MKL/GPU artifacts are expected to fail now, so it's fine for other platforms to stop if something goes wrong...

Then maybe we just need to remove back the if ERRORLEVEL 1 exit /b condition check if there is no way to break the dependencies between the different builds for Windows (or we can simply disable the MKL/GPU build for Windows until we fix them)

from java.

saudet commented on May 21, 2024

It looks like that's just the way GitHub Actions works. I haven't found a way to disable that "feature".

from java.

saudet commented on May 21, 2024

The Windows builds do not fail, they just timeout. Removing an "if failed" clause that never gets executed won't change anything.

from java.

saudet commented on May 21, 2024

To work around this, make your PRs smaller. Making 1000 different changes in the same PR makes it difficult to debug anyway.

from java.

karllessard commented on May 21, 2024

To work around this, make your PRs smaller. Making 1000 different changes
in the same PR makes it difficult to debug anyway.

I’m not sure to understand that last part, what the length of a PR has to do with the success of a build? Btw, the PR I’m referring to looks larger than it is because we persist the generated files, there are actually just a few changes.

Going back to this issue, I suggest then that we disable MKL/GPU builds for Windows until we know we can build them successfully. Then we can really rely on the CI build check status, which we unfortunately got used to ignore. I can make that change in my actual PR.

from java.

karllessard commented on May 21, 2024

Or did we ever tried to increase that timeout value? https://help.github.com/en/actions/reference/workflow-syntax-for-github-actions#jobsjob_idtimeout-minutes

If not, I’ll give it a try

from java.

saudet commented on May 21, 2024

I’m not sure to understand that last part, what the length of a PR has to do with the success of a build? Btw, the PR I’m referring to looks larger than it is because we persist the generated files, there are actually just a few changes.

I mean like with pull #56. Do one "small" change (upgrade TensorFlow and nothing else), see if that works, merge it if it works, then go on with your other changes in another PR.

Or did we ever tried to increase that timeout value? https://help.github.com/en/actions/reference/workflow-syntax-for-github-actions#jobsjob_idtimeout-minutes

If not, I’ll give it a try

I didn't notice that, no. Must be new. Let's see if it allows us to run builds for longer than 6 hours. GitHub Actions seems to have taken it without error in pull #56...

from java.

karllessard commented on May 21, 2024

And it looks like we are not the only one having trouble to build TF2.2.0+MKL on Mac, these guys reported the same error that we hit in our CI build:

ERROR: /private/var/tmp/_bazel_runner/17ed05ace81230cf3c69807725c012c8/external/mkl_dnn_v1/BUILD.bazel:52:1: C++ compilation of rule '@mkl_dnn_v1//:mkl_dnn' failed (Exit 1)
In file included from external/mkl_dnn_v1/src/cpu/rnn/ref_postgemm_lstm.cpp:21:
external/mkl_dnn_v1/src/common/dnnl_thread.hpp:42:10: fatal error: 'omp.h' file not found
#include <omp.h>
         ^~~~~~~

Looks like we need to install more stuff on the Mac machines?

from java.

karllessard commented on May 21, 2024

So the missing file in MacOSX seems to be fixed now, I brew install libomp before building on the Mac machine. Still, the MKL build fail for Mac and Windows.

But I'm a bit concerned about this comment I found in TensorFlow configuration file:

# Please note that MKL on MacOS or windows is still not supported.
# If you would like to use a local MKL instead of downloading, please set the
# environment variable "TF_MKL_ROOT" every time before build.

So it sounds like we cannot assume that the MKL version coming with TF will work on those platforms for all TF releases and we need to install our own? Here is another related thread.

/cc @saudet

from java.

Craigacp commented on May 21, 2024

The OpenMP support on MacOS works fine if you have libomp installed via brew, and you add it's location to the header and linker paths (usually /usr/local/include and /usr/local/lib). I had to jump through these hoops a couple of times for onnx-runtime.

from java.

karllessard commented on May 21, 2024

~~Thanks @Craigacp , any tip on how to install it easily from the command line?~~

Oh sorry, I misread you, so you said that libomp should have done it, I'll check the include paths or maybe try to set this TF_MKL_ROOT env variable

from java.

karllessard commented on May 21, 2024

But what concerns me here is that the error I'm having is not with missing omp.h header anymore (that, just brew install it did the trick). The error is not that the code does not compile on Mac and Windows because of what seems an incompatibility on those platforms between the MKL library released with TF and the some kernels making use of it:

ERROR: /private/var/tmp/_bazel_runner/17ed05ace81230cf3c69807725c012c8/external/org_tensorflow/tensorflow/core/kernels/BUILD:7897:1: C++ compilation of rule '@org_tensorflow//tensorflow/core/kernels:mkl_conv_op' failed (Exit 1)
In file included from external/org_tensorflow/tensorflow/core/kernels/mkl_conv_ops.cc:19:
external/org_tensorflow/tensorflow/core/kernels/mkl_conv_ops.h:157:19: error: no viable overloaded '='
      *input_dims = mkldnn_sizes;
      ~~~~~~~~~~~ ^ ~~~~~~~~~~~~

Both platforms have the same error. So just to confirm @Craigacp , you are telling me that playing with include paths may fix this error as well or you were referring to the previous problem with omp.h?

from java.

Craigacp commented on May 21, 2024

Ah yeah, I meant header and linker issues, I didn't see that error.

from java.

karllessard commented on May 21, 2024

Yeah, sorry, my previous message was not very explicit (it's still the morning and I did not went through my first coffee yet...). So if you have any tip for that second problem, please let me know!

from java.

karllessard commented on May 21, 2024

Hi @roywei , we have changed a little bit the way we redeploy all the artifacts after building to normalize the timestamp associated to the last snapshots.

Please let us know if you notice any issue with fetching the artifacts again from Gradle (note also that MKL and GPU builds for Windows have been temporarily removed). If everything is fine, let us know as well so we can close safely this issue. Thanks!

from java.

karllessard commented on May 21, 2024

Ok, I think we can close this one now.

from java.

Failed to fetch latest snapshot for tensorflow-core-api linux gpu mkl about java HOT 29 CLOSED

Comments (29)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent