Git Product home page Git Product logo

Comments (12)

samuelotter avatar samuelotter commented on May 17, 2024 2

It works now. Thanks for looking into this and fixing it!

from java.

samuelotter avatar samuelotter commented on May 17, 2024 1

It seems the tensorflow python libraries are built with GCC4 and is using the old ABI as well, and tensorflow-text works with that so I don't think the ABI is the problem here.

Through debugging I found the error is caused by different symbols being referenced at run time for the hash_code check at https://github.com/tensorflow/tensorflow/blob/2b96f3662bd776e277f86997659e61046b56c315/tensorflow/core/framework/resource_mgr.h#L721

In my example the hash_code in the resource handle is the address of the hash_bit symbol in libtensorflow.so.2 but the newly created TypeIndex hash_code resolves to the address of hash_bit symbol in libtensorflow_framework.so.2.

All libraries are loaded up front so naively I would have thought it would consistently resolve to the same hash_bit symbol but evidently that is not the case here.

from java.

saudet avatar saudet commented on May 17, 2024

It's probably because we're building against CentOS, which uses the old C++ ABI of GCC:
https://bugzilla.redhat.com/show_bug.cgi?id=1546704
libstdc++ is backward compatible, so we can run TensorFlow compiled that way on Ubuntu no problem, but anything that interacts with the C++ API also needs to be compiled that way.

from java.

samuelotter avatar samuelotter commented on May 17, 2024

It seems to come down to how bindings are set up when loading the shared objects.

Using LD_DEBUG=bindings when running the model via java it shows the op library binds to libtensorflow_framework instead of libtensorflow:

binding file /usr/local/lib/python3.6/dist-packages/tensorflow_text/python/ops/_wordpiece_tokenizer.so [0] to /root/.javacpp/cache/tensorflow-core-api-0.1.0-SNAPSHOT-linux-x86_64.jar/org/tensorflow/internal/c_api/linux-x86_64/libtensorflow_framework.so.2 [0]: normal symbol `_ZZN10tensorflow9TypeIndex4MakeINS_6lookup15LookupInterfaceEEES0_PKcE8hash_bit'

When running the model via python it doesn't bind to libtensorflow_framework but to _pywrap_tensorflow_internal.so (which basically serves the role of libtensorflow for python I guess):

binding file /usr/local/lib/python3.6/dist-packages/tensorflow_text/python/ops/_wordpiece_tokenizer.so [0] to /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so [0]: normal symbol `_ZZN10tensorflow9TypeIndex4MakeINS_6lookup15LookupInterfaceEEES0_PKcE8hash_bit'

One difference seems to be that the tensorflow libraries are loaded with RTLD_GLOBAL in python, which doesn't seem to be the case currently with the java libs. I tested this by appending '!' to the library names in the tensorflow javacpp preset properties and rebuilding but that didn't help (I also tried to load the libraries directly with Loader.loadGlobal just to be sure). So RTLD_GLOBAL is probably not the cause for this difference.

from java.

samuelotter avatar samuelotter commented on May 17, 2024

It seems like the difference is that libtensorflow only exports TF_* and TFE_* symbols globally while the python library exports many more symbols (including the hash_bit symbols). When modifying the build of libtensorflow to export the same symbols as the python shared library it works.

This seems to be more of a general issue with libtensorflow since I guess anyone using that directly might run into similar problems with custom ops and exporting all symbols doesn't feel like the best solution.

But until that is fixed upstream it is trivial to apply this change as a patch when building the java libraries:

diff --git a/tensorflow/c/version_script.lds b/tensorflow/c/version_script.lds
index c352a1440d..7716dcb231 100644
--- a/tensorflow/c/version_script.lds
+++ b/tensorflow/c/version_script.lds
@@ -1,8 +1,16 @@
 VERS_1.0 {
   # Export symbols in c_api.h.
   global:
+  *tensorflow*;
+    *toco*;
+    *perftools*gputools*;
+    *tf_*;
     *TF_*;
+    *Eager*;
     *TFE_*;
+    *nsync_*;
+    *stream_executor*;
+    *xla*;

   # Hide everything else.
   local:

I can put together a PR for it as well of course if you want

from java.

Craigacp avatar Craigacp commented on May 17, 2024

We should find out why those symbols aren't exported. I suspect it's because the python API is using things that aren't considered part of the C API. It is a little odd to me that other native libraries depend on those symbols. Maybe the modular tensorflow people know more?

from java.

karllessard avatar karllessard commented on May 17, 2024

While on the Java side, we restrict our development using the C API, we cannot prevent maintainers of these custom op libraries from trespassing it. So if we want to support them, it looks like we need to export these symbols. What are the side effects of this @samuelotter , does it increases significantly the TF binary size?

from java.

samuelotter avatar samuelotter commented on May 17, 2024

It shouldn't increase the size of the binaries in any material way and while I haven't built and tested all targets the ones I have built show no significant difference.

I'm not an expert on custom ops but it seems they are usually implemented in C++ directly towards the C++ API right now, which is not ABI stable. I'm not even sure the C API exposes everything you need to implement many custom ops right now. While there seems to be ongoing work to provide an ABI stable way of doing this by wrapping the C APIs that doesn't seem to be fully complete yet. I think chances are high that many custom ops would not work out of the box with libtensorflow right now which is unfortunate.

Working with this has made me think about how using custom ops with the java bindings could be improved long term. Getting custom ops to work with the java bindings won't be a great experience even if this problem is solved since most custom ops are primarily distributed as python packages you either have to also have python libraries installed and try to find the location of the custom ops library paths on your system, or do what I've done and extract the shared objects from the python wheels and repackage them in jar. Neither option is very elegant.

I guess the ideal scenario would be for popular custom op libraries to also be wrapped and distributed as java libraries. I'm not sure how to get there though, but a start may be to provide some sort of template project which could be used as a starting point for wrapping a custom op library to at least reduce the barrier somewhat.

Tensorflow Serving comes with a bunch of commonly used/popular custom ops supported out of the box, in a similar vein it might be worth considering maintaining packages for a few of the more popular ones (such as tensorflow-text and tensorflow-io which are also tensorflow projects but not in the core repo) under the umbrella of this project?

from java.

karllessard avatar karllessard commented on May 17, 2024

I guess the ideal scenario would be for popular custom op libraries to also be wrapped and distributed as java libraries.

I share your point of view here. Ideally, it should be distributed directly by the providers of these ops and we should make it easy for them to do it. Basically, we could just the providers define an API def of their op like those ones and let them use the same generator as we use to create concrete Java wrappers that could then be packaged with Maven.

There has been great progress to support custom ops from the C API with the addition of the Kernel endpoints. But probably these libraries were never updated to make use of them, could also be a contribution we do for them? This way, we won't have these symbols export issues. But I don't know how complex is this task.

from java.

saudet avatar saudet commented on May 17, 2024

@samuelotter BTW, we would also need to modify exported_symbols.lds for Mac and something else for this to work on Windows.

from java.

saudet avatar saudet commented on May 17, 2024

It looks like the tensorflow_cc target is still there and supported, for example, see tensorflow/tensorflow#41904.
Maybe we could try to use that instead of tensorflow?

from java.

karllessard avatar karllessard commented on May 17, 2024

@samuelotter , now that we have applied these changes, can you please validate that you can now successfully load tensorflow-text when using the latest snapshots of TF Java?

from java.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.