Comments (29)
The mac-os-mkl.jar is there, but the libjnimkldnn.dylib and libiomp5.dylib extra libraries are missing, is this intended
Yes it is intended, we don't distribute them in our artifacts, you need to have them installed on your machine or add dependencies to other JavaCPP artifacts to upload them via Maven/Gradle, like this one: https://github.com/bytedeco/javacpp-presets/tree/master/mkl-dnn. (@saudet correct me if I'm wrong)
from java.
The idea behind having MKL-DNN in a different package is because it's over 100 MB in size (all platforms) and these binaries are used by other libraries, with which they can be shared, for example, MXNet: https://github.com/bytedeco/javacpp-presets/blob/master/mxnet/platform/pom.xml#L33-L37
The same applies to most popular native libraries used by other native libraries such as OpenBLAS, OpenCV, FFmpeg, Arrow, HDF5, LLVM, MKL, CUDA/cuDNN/NCCL/TensorRT, etc:
https://github.com/bytedeco/javacpp-presets/tree/master/openblas
https://github.com/bytedeco/javacpp-presets/tree/master/opencv
https://github.com/bytedeco/javacpp-presets/tree/master/ffmpeg
https://github.com/bytedeco/javacpp-presets/tree/master/arrow
https://github.com/bytedeco/javacpp-presets/tree/master/hdf5
https://github.com/bytedeco/javacpp-presets/tree/master/llvm
https://github.com/bytedeco/javacpp-presets/tree/master/mkl
https://github.com/bytedeco/javacpp-presets/tree/master/cuda
https://github.com/bytedeco/javacpp-presets/tree/master/tensorrt
I talk about the need of a distribution that would allow us to share components like that in this post:
http://bytedeco.org/news/2019/01/11/importance-of-a-distribution/
This is going to be crucial for Java applications of machine learning to succeed in the enterprise.
from java.
Still, that is kind of refreshing to see all these green checks on that page, thanks @saudet !
from java.
I patched the CMake build locally so it added things to the include & library paths if running on macOS, after running brew install libomp
. I think their CI was already configured in some way I didn't quite understand.
from java.
An update on this: I've added the --define build_with_mkl_dnn_v1_only=false
flag to the bazel command line and now MacOS + MKL passes, looks like MKL-DNN v1 only works out with Linux.
I was planning to leave it that way if everything goes fine and then we can check how to enable MKL-DNN 1.x on all platforms or only on Linux.
from java.
Ok, it's unfortunate but it looks like we cannot increase the timeout of a job on GitHub action on their hosted-runners beyond 6 hours, as dictated here. Setting the timeout-minutes
parameter to a higher value didn't helped.
So this build (based on TF2.2) is the best we had got so far on this CI solution. So I'll go ahead with merging PR #44 and we will probably need to continue the discussion to find an alternative or a complement to it.
from java.
I think may related to CI build failure: https://github.com/tensorflow/java/actions
the snapshot is partially uploaded.
from java.
Yes, as per these threads mentioned at the last meeting, the builds on Linux are no longer reliable:
- https://github.community/t5/GitHub-Actions/Diminishing-disk-space-on-Ubuntu-images/m-p/54376
- actions/runner-images#709
That is in addition to this issue preventing builds for Windows from succeeding:
If you could intervene on behalf of Amazon via your GitHub support channel, please do.
from java.
Looks like Linux builds are back up:
actions/runner-images#709 (comment)
We're now getting 13 GB of free space after CUDA, etc:
https://github.com/tensorflow/java/runs/669404530
from java.
It looks like we have a solution for Windows as well:
actions/runner-images#785 (comment)
Let me try that right away...
from java.
The CPU builds for Windows will also now work, but not for CUDA, yet (see pull #54).
from java.
Ok, builds for Linux and Mac are back up:
https://github.com/tensorflow/java/runs/672994641
Windows is still a WIP though...
from java.
@saudet , I've rebase the TF2.2.0 branch with your last changes it did not went that well: https://github.com/tensorflow/java/runs/674534254?check_suite_focus=true
From what I understand, the Mac build got interrupted because the Mac + MKL build failed first, seems like something have changed in TF2.2.0 that I need to look at. Same thing for the Windows build, who got interrupted because Windows + MKL build also failed first (that one was expected though).
So I'm wondering if we can prevent the builds from being interrupted if other builds for the same platform are failing, at least for the "vanilla" ones so we could deliver them since they are more stable than their MKL/GPU flavours?
from java.
...at the same time, only Windows MKL/GPU artifacts are expected to fail now, so it's fine for other platforms to stop if something goes wrong...
Then maybe we just need to remove back the if ERRORLEVEL 1 exit /b
condition check if there is no way to break the dependencies between the different builds for Windows (or we can simply disable the MKL/GPU build for Windows until we fix them)
from java.
from java.
from java.
from java.
To work around this, make your PRs smaller. Making 1000 different changes
in the same PR makes it difficult to debug anyway.
I’m not sure to understand that last part, what the length of a PR has to do with the success of a build? Btw, the PR I’m referring to looks larger than it is because we persist the generated files, there are actually just a few changes.
Going back to this issue, I suggest then that we disable MKL/GPU builds for Windows until we know we can build them successfully. Then we can really rely on the CI build check status, which we unfortunately got used to ignore. I can make that change in my actual PR.
from java.
Or did we ever tried to increase that timeout value? https://help.github.com/en/actions/reference/workflow-syntax-for-github-actions#jobsjob_idtimeout-minutes
If not, I’ll give it a try
from java.
I’m not sure to understand that last part, what the length of a PR has to do with the success of a build? Btw, the PR I’m referring to looks larger than it is because we persist the generated files, there are actually just a few changes.
I mean like with pull #56. Do one "small" change (upgrade TensorFlow and nothing else), see if that works, merge it if it works, then go on with your other changes in another PR.
Or did we ever tried to increase that timeout value? https://help.github.com/en/actions/reference/workflow-syntax-for-github-actions#jobsjob_idtimeout-minutes
If not, I’ll give it a try
I didn't notice that, no. Must be new. Let's see if it allows us to run builds for longer than 6 hours. GitHub Actions seems to have taken it without error in pull #56...
from java.
And it looks like we are not the only one having trouble to build TF2.2.0+MKL on Mac, these guys reported the same error that we hit in our CI build:
ERROR: /private/var/tmp/_bazel_runner/17ed05ace81230cf3c69807725c012c8/external/mkl_dnn_v1/BUILD.bazel:52:1: C++ compilation of rule '@mkl_dnn_v1//:mkl_dnn' failed (Exit 1)
In file included from external/mkl_dnn_v1/src/cpu/rnn/ref_postgemm_lstm.cpp:21:
external/mkl_dnn_v1/src/common/dnnl_thread.hpp:42:10: fatal error: 'omp.h' file not found
#include <omp.h>
^~~~~~~
Looks like we need to install more stuff on the Mac machines?
from java.
So the missing file in MacOSX seems to be fixed now, I brew install libomp
before building on the Mac machine. Still, the MKL build fail for Mac and Windows.
But I'm a bit concerned about this comment I found in TensorFlow configuration file:
# Please note that MKL on MacOS or windows is still not supported.
# If you would like to use a local MKL instead of downloading, please set the
# environment variable "TF_MKL_ROOT" every time before build.
So it sounds like we cannot assume that the MKL version coming with TF will work on those platforms for all TF releases and we need to install our own? Here is another related thread.
/cc @saudet
from java.
The OpenMP support on MacOS works fine if you have libomp installed via brew, and you add it's location to the header and linker paths (usually /usr/local/include
and /usr/local/lib
). I had to jump through these hoops a couple of times for onnx-runtime.
from java.
Thanks @Craigacp , any tip on how to install it easily from the command line?
Oh sorry, I misread you, so you said that libomp
should have done it, I'll check the include paths or maybe try to set this TF_MKL_ROOT
env variable
from java.
But what concerns me here is that the error I'm having is not with missing omp.h
header anymore (that, just brew install
it did the trick). The error is not that the code does not compile on Mac and Windows because of what seems an incompatibility on those platforms between the MKL library released with TF and the some kernels making use of it:
ERROR: /private/var/tmp/_bazel_runner/17ed05ace81230cf3c69807725c012c8/external/org_tensorflow/tensorflow/core/kernels/BUILD:7897:1: C++ compilation of rule '@org_tensorflow//tensorflow/core/kernels:mkl_conv_op' failed (Exit 1)
In file included from external/org_tensorflow/tensorflow/core/kernels/mkl_conv_ops.cc:19:
external/org_tensorflow/tensorflow/core/kernels/mkl_conv_ops.h:157:19: error: no viable overloaded '='
*input_dims = mkldnn_sizes;
~~~~~~~~~~~ ^ ~~~~~~~~~~~~
Both platforms have the same error. So just to confirm @Craigacp , you are telling me that playing with include paths may fix this error as well or you were referring to the previous problem with omp.h
?
from java.
Ah yeah, I meant header and linker issues, I didn't see that error.
from java.
Yeah, sorry, my previous message was not very explicit (it's still the morning and I did not went through my first coffee yet...). So if you have any tip for that second problem, please let me know!
from java.
Hi @roywei , we have changed a little bit the way we redeploy all the artifacts after building to normalize the timestamp associated to the last snapshots.
Please let us know if you notice any issue with fetching the artifacts again from Gradle (note also that MKL and GPU builds for Windows have been temporarily removed). If everything is fine, let us know as well so we can close safely this issue. Thanks!
from java.
Ok, I think we can close this one now.
from java.
Related Issues (20)
- Use of tensorflow_decision_forests HOT 16
- Failure when a tensorflow variable has no gradients
- Custom gradient registration is broken in Windows
- How to deal with the cold start problem? HOT 6
- [THIS ISSUE WAS AN EQUIVOCATION] Tensorflow 0.4.0 docs say it should be compatible with JAva 8 but it is not HOT 1
- Issue with loading model from: tensorflow_decision_forests HOT 5
- protobuf-java 3.19.4 contains 3 high vulnerabilities HOT 2
- Complex Tensor Implementation Missing? HOT 5
- ivy dependency not working on windows or linux, native TF code not found on classpath HOT 5
- Could not load dynamic library 'xxxxx'; dlerror: xxxxx.dll not found HOT 5
- org.tensorflow.TensorFlowException: Can't parse /<modelPath>/<somePathToFolder>/saved_model.pb as binary proto - JDK 17 HOT 15
- Compiling from source, cuDNN version is not compatible? How can I change the cuDNN compile version? HOT 2
- SavedModelBundle Unable to Load Models with coo_sparse Encoded Input HOT 1
- Tensor type issue HOT 3
- Unable to build the project using 'mvn install ' command HOT 4
- how to use importGraphDef to load model.pb file? HOT 2
- Modular Java app can't create tensor object HOT 3
- Read/Write method of DataBuffer is against intuition HOT 1
- Distributing an Apple Silicon binary HOT 2
- Error when using tensorflow-text on tensorflow-core HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from java.