Git Product home page Git Product logo

turbotransformers's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

turbotransformers's Issues

gpu benchmark对onnx-runtime的版本支持

您好!我按照run_gpu_benchmark.sh里注释的,用"pip install onnxruntime-gpu"安装了onnx,添加了"onnxruntime",运行时遇到了和该issue同样的问题,libcublas.so.10无法加载;该issue的解答是让安装老版本1.1.0的onnx;请问,官方首页里的benchmark结果,是和哪个版本的onnx比较的呢?怎样才能跑起来最新版本的onnx作为benchmark对照组呢?

另外,我安装1.1.0的老版本onnx,跑gpu benchmark时也报错:

Warning: ATen was a removed experimental ops. In the future, we may directly reject this operator. Please update your model as soon as possible.
Warning: ATen was a removed experimental ops. In the future, we may directly reject this operator. Please update your model as soon as possible.
Traceback (most recent call last):
File "gpu_benchmark.py", line 114, in
main()
File "gpu_benchmark.py", line 108, in main
benchmark_helper.onnxruntime_benchmark_creator('GPU')(**kwargs)
File "/workspace/benchmark/benchmark_helper.py", line 114, in impl
graph_optimization_level=onnxruntime.GraphOptimizationLevel.
File "/opt/miniconda3/lib/python3.7/site-packages/onnxruntime/backend/backend.py", line 80, in prepare
return cls.prepare(bin, device, **kwargs)
File "/opt/miniconda3/lib/python3.7/site-packages/onnxruntime/backend/backend.py", line 69, in prepare
inf = InferenceSession(model, options)
File "/opt/miniconda3/lib/python3.7/site-packages/onnxruntime/capi/session.py", line 25, in init
self._load_model(providers)
File "/opt/miniconda3/lib/python3.7/site-packages/onnxruntime/capi/session.py", line 43, in _load_model
self._sess.load_model(providers)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Fatal error: ATen is not a registered function/op

wccms / LAB_seq-malGAN 请教,很心急

你好,很抱歉在这个项目里提别的问题,实在没有别的方式联系您了。我在研究您在这个LAB_seq-malGAN中提交的代码,比较熟悉pytorch,但是好像您没有写完,我就去攻了下tensorflow版的,但是不是很明白,不知道为什么运行起来很慢,您里面的候选API词典有100多个,我用了几万的一个,就,运行贼慢。很想问下您有没有这个的原论文我想研究下,然后如果您pytorch的有更新了,也非常希望您能提交下,真的非常感谢

What is variable-length and comparison with onnxruntime.

Could you explain a little bit more of the support of variable-length? Does it mean the runtime can support inputs with different sequences in a single session, like [batch, 8], [batch, 32] and etc.? Or it actually can support difference sequence length in one input, e.g., for an input with batch size of 2 like below, it runs sequence 128 for the first input, and sequence 3 for the second input for better performance?
[
[1,2,3,4, ...,128],
[5,6,7],
]

大batchsize的效果

我用python的api测试的,batchsize=1的时候是几毫秒,batchsize=120的时候就快100ms了,到batchsize=256时就到100多毫秒了,这个符合预期么?

怎么编译出二进制可执行文件bert_model_example?

我执行的步骤如下:

  1. cd example/cpp
  2. mkdir build
  3. cd build
  4. cmake ..
  5. make

其中第4步输出为:
-- The C compiler identification is GNU 7.5.0
-- The CXX compiler identification is GNU 7.5.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
CMake Warning (dev) at CMakeLists.txt:22 (add_executable):
Policy CMP0003 should be set before this line. Add code such as

if(COMMAND cmake_policy)
  cmake_policy(SET CMP0003 NEW)
endif(COMMAND cmake_policy)

as early as possible but after the most recent call to
cmake_minimum_required or cmake_policy(VERSION). This warning appears
because target "bert_model_example" links to some libraries for which the
linker must search:

tt_npz_loader, tt_layers, tt_kernels

and other libraries with known full path:

/workspace/TurboTransformers/example/cpp/build/libbert_model.a

CMake is adding directories in the second list to the linker search path in
case they are needed to find libraries from the first list (for backwards
compatibility with CMake 2.4). Set policy CMP0003 to OLD or NEW to enable
or disable this behavior explicitly. Run "cmake --help-policy CMP0003" for
more information.
This warning is for project developers. Use -Wno-dev to suppress it.

-- Generating done
-- Build files have been written to: /workspace/TurboTransformers/example/cpp/build

第五步输出为
Scanning dependencies of target bert_model
[ 25%] Building CXX object CMakeFiles/bert_model.dir/bert_model.cpp.o
In file included from /workspace/TurboTransformers/example/cpp/bert_model.cpp:14:0:
/workspace/TurboTransformers/example/cpp/bert_model.h:20:10: fatal error: dlpack/dlpack.h: No such file or directory
#include "dlpack/dlpack.h"
^~~~~~~~~~~~~~~~~
compilation terminated.
CMakeFiles/bert_model.dir/build.make:62: recipe for target 'CMakeFiles/bert_model.dir/bert_model.cpp.o' failed
make[2]: *** [CMakeFiles/bert_model.dir/bert_model.cpp.o] Error 1
CMakeFiles/Makefile2:72: recipe for target 'CMakeFiles/bert_model.dir/all' failed
make[1]: *** [CMakeFiles/bert_model.dir/all] Error 2
Makefile:83: recipe for target 'all' failed
make: *** [all] Error 2

编译报错

请问大佬 运行 bash tools/compile.sh $PWD -DWITH_GPU=ON /tmp/build报错:
avx512fintrin.h(1761): error: identifier "__builtin_ia32_sqrtsd_round" is undefined 是啥原因啊?

Intel MKL FATAL ERROR: Cannot load libmkl_avx512.so or libmkl_def.so.

(py36) [benchmark]# echo $LD_LIBRARY_PATH
/opt/intel/mkl/lib/intel64_lin

(py36) [rbenchmark]# ls $LD_LIBRARY_PATH
libmkl_avx2.so        libmkl_gnu_thread.so    libmkl_mc.so              libmkl_vml_avx512.so
libmkl_avx512_mic.so  libmkl_intel_ilp64.a    libmkl_rt.so              libmkl_vml_avx.so
libmkl_avx512.so      libmkl_intel_ilp64.so   libmkl_sequential.a       libmkl_vml_cmpt.so
libmkl_avx.so         libmkl_intel_lp64.a     libmkl_sequential.so      libmkl_vml_def.so
libmkl_core.a         libmkl_intel_lp64.so    libmkl_tbb_thread.a       libmkl_vml_mc2.so
libmkl_core.so        libmkl_intel_thread.a   libmkl_tbb_thread.so      libmkl_vml_mc3.so
libmkl_def.so         libmkl_intel_thread.so  libmkl_vml_avx2.so        libmkl_vml_mc.so
libmkl_gnu_thread.a   libmkl_mc3.so           libmkl_vml_avx512_mic.so  locale

what shoud I do to solve this problem? thanks

找不到 turbo_transformer.BertModel

docker 环境中跑 run_gpu_benchmarrk.py 以及 example/python 中 gpu_example 的时候都存在这个问题 AttributeError: module 'turbo_transformers' has no attribute 'BertModel' 麻烦请教一下 谢谢

When will albert be released ?

Dear developers, I am working on optimizing albert serving performance.
I find that albert is under development. When will it be released ?

关于gpu_example结果的问题

安装GPU 镜像后, 在docker 中运行gpu_example.py 的过程中,出现了如下问题
image
目前已通过修改 代码中该部分为如下图 解决
image
首先想问问这种改法是否正确

第二是跑出来的结果是
image
对这个0.42的意义不是很清楚, 想请教一下。
非常感谢

benchmark程序运行失败

我们试图重现性能,按照文档自己编译成功docker镜像,并且单元测试成功,但是运行benchmark时模型加载不了,请问是什么原因?报错如下:
model = transformers.BertModel.from_pretrained(model_id)
File "/opt/conda/lib/python3.7/site-packages/transformers/modeling_utils.py", line 629, in from_pretrained
"Unable to load weights from pytorch checkpoint file. "
我们使用的bert模型下载路径是https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json bert-base-uncased-vocab.txt以及bert-base-uncased-pytorch_model.bin
对模型文件有什么特殊要求吗?或者需要做什么预处理?
谢谢

cmake crash with "Cannot find librt from"

cmake -G Ninja -DCMAKE_BUILD_TYPE=Release .. -DWITH_GPU=OFF

-- /include
-- Blas provider is mkl
-- pybind11 v2.4.dev4
-- OpenMP USED FLAGS
CMake Error at CMakeLists.txt:79 (message):
  Cannot find librt from

thank you.

docker run GPU容器失败

不好意思,再请问个问题。运行完"sh tools/build_docker_gpu.sh $PWD"后,再运行下面这条docker run实例化刚创建的image,会报错。(git clone后没有对build_docker_gpu.sh等进行过改动)
sudo docker run --gpus all --net=host --rm -it -v $PWD:/workspace -v /etc/passwd:/etc/passwd --name=gt_gpu_env ccr.ccs.tencentyun.com/mmspr/turbo_transformers:0.2.1-cuda10.0-cudnn7-devel-ubuntu18.04-gpu-dev

docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused "process_linux.go:432: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request\\n\""": unknown.

此问题从今天中午开始出现的。会不会是刚刚从0.2.0升级到0.2.1造成的问题呢?

Release conda packages

Using the docker image is not easy for all developers. It is better to release coda packages. The possible packages are:

Device\CPU AVX2 AVX SSE4 ARMv7
CPU
CUDA

ModuleNotFoundError: No module named 'turbo_transformers.turbo_transformers_cxx'

I followed the steps given in the README and I setup everything inside the docker.

In this steps given pip install -r find . -name *whl`` is not doing anything because that search is returning nothing. Other than that I followed everything.

Afterwhich, I tried cpu_example.py to generate embeddings, I just ran the code from the repo, and I got the following error

Traceback (most recent call last):
  File "cpu_example.py", line 15, in <module>
    import turbo_transformers
  File "/workspace/TurboTransformers/turbo_transformers/python/turbo_transformers/__init__.py", line 14, in <module>
    from .layers import *
  File "/workspace/TurboTransformers/turbo_transformers/python/turbo_transformers/layers/__init__.py", line 14, in <module>
    from .modeling_bert import BertEmbeddings, BertIntermediate, BertOutput, BertAttention, BertLayer, SequencePool, \
  File "/workspace/TurboTransformers/turbo_transformers/python/turbo_transformers/layers/modeling_bert.py", line 18, in <module>
    import turbo_transformers.turbo_transformers_cxx as cxx
ModuleNotFoundError: No module named 'turbo_transformers.turbo_transformers_cxx' 

Steps I tried,

I saw some make files inside the repo, hence I ran make command, it setup lot of things and recompiled lot of source code required for C++(mainly).

Any help would be appreciated! The Readme looks interesting, excited to see the results, hope once I overcome this error I could play around with it for a while. Thanks.

无法确定turbo_transformers.BertModel.from_torch到底调用的是哪个函数

情况是这样的。在运行gpu_benchmark.py时候,我看到在运行turbo_transformers时,调用的函数是turbo_transformers.BertModel.from_torch。

我看了一下代码,觉得调用的函数应该是TurboTransformers/turbo_transformers/python/turbo_transformers/layers/modeling_bert.py中class BertModel中的from_torch函数。于是我在此函数中,加载了一些print以打印信息。却发现执行bash run_gpu_benchmark.sh这些print打印不出任何信息。

所以我很疑问turbo_transformers.BertModel.from_torch到底调用的是哪个函数?

在显卡K80上跑的benchmark,Turbo相对pytorch加速不如预期

我cd benchmark,然后执行脚本bash run_gpu_benchmark.sh,结果如下:

{"QPS": 141.9624342537054, "elapsed": 1.0566175537109375, "n": 150, "batch_size": 1, "seq_len": 10, "framework": "turbo", "thread_num": 1}
{"QPS": 79.07311483207774, "elapsed": 1.896978515625, "n": 150, "batch_size": 1, "seq_len": 10, "framework": "torch", "thread_num": 1}
{"QPS": 95.6774176067004, "elapsed": 1.56776806640625, "n": 150, "batch_size": 1, "seq_len": 20, "framework": "turbo", "thread_num": 1}
{"QPS": 78.33956337296448, "elapsed": 1.9147413330078125, "n": 150, "batch_size": 1, "seq_len": 20, "framework": "torch", "thread_num": 1}
{"QPS": 95.04851976317696, "elapsed": 1.578141357421875, "n": 150, "batch_size": 1, "seq_len": 30, "framework": "turbo", "thread_num": 1}
{"QPS": 78.68219615532972, "elapsed": 1.9064033203125, "n": 150, "batch_size": 1, "seq_len": 30, "framework": "torch", "thread_num": 1}

以上是batch size为1,seq_len分别为10、20、30时,turbo相对于torch的QPS。

在Readme中GPU M40测试中,seq_len分别为10、20、30时,turbo相对torch加速倍数分别为3.31、3.17、2.88。而换成了K80显卡,turbo相对torch加速倍数分别为1.79、1.22和1.20。

我的问题是为什么换成了显卡K80,turbo相对torch的加速倍数就不是那么明显了呢,turbo是不是针对具体的显卡型号进行了优化?

unitest fail

Test project /tmp/build
Start 1: tt_core_test
1/12 Test #1: tt_core_test ..................... Passed 0.01 sec
Start 2: tt_kernels_test
2/12 Test #2: tt_kernels_test .................. Passed 0.09 sec
Start 3: bert_attention_test
3/12 Test #3: bert_attention_test ..............***Failed 0.39 sec
date time ( uptime ) [ thread name/id ] file:line v|
2020-05-16 14:14:00.527 ( 0.000s) [main thread ] loguru.cpp:610 INFO| arguments: turbo_transformers_cxx
2020-05-16 14:14:00.527 ( 0.000s) [main thread ] loguru.cpp:613 INFO| Current dir: /tmp/build/turbo_transformers/python
2020-05-16 14:14:00.527 ( 0.000s) [main thread ] loguru.cpp:615 INFO| stderr verbosity: 0
2020-05-16 14:14:00.527 ( 0.000s) [main thread ] loguru.cpp:616 INFO| -----------------------------------
Traceback (most recent call last):
File "/home/liusiyang/TurboTransformers/turbo_transformers/python/tests/bert_attention_test.py", line 27, in
import turbo_transformers
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/init.py", line 14, in
from .layers import *
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/layers/init.py", line 14, in
from .modeling_bert import BertEmbeddings, BertIntermediate, BertOutput, BertAttention, BertLayer, SequencePool,
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/layers/modeling_bert.py", line 20, in
import torch
File "/home/liusiyang/anaconda3/envs/python3.6/lib/python3.6/site-packages/torch/init.py", line 81, in
from torch._C import *
ImportError: numpy.core.multiarray failed to import
2020-05-16 14:14:00.800 ( 0.272s) [main thread ] loguru.cpp:489 INFO| atexit

  Start  4: bert_embedding_test

4/12 Test #4: bert_embedding_test ..............***Failed 0.35 sec
date time ( uptime ) [ thread name/id ] file:line v|
2020-05-16 14:14:00.890 ( 0.000s) [main thread ] loguru.cpp:610 INFO| arguments: turbo_transformers_cxx
2020-05-16 14:14:00.890 ( 0.000s) [main thread ] loguru.cpp:613 INFO| Current dir: /tmp/build/turbo_transformers/python
2020-05-16 14:14:00.890 ( 0.000s) [main thread ] loguru.cpp:615 INFO| stderr verbosity: 0
2020-05-16 14:14:00.890 ( 0.000s) [main thread ] loguru.cpp:616 INFO| -----------------------------------
Traceback (most recent call last):
File "/home/liusiyang/TurboTransformers/turbo_transformers/python/tests/bert_embedding_test.py", line 27, in
import turbo_transformers
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/init.py", line 14, in
from .layers import *
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/layers/init.py", line 14, in
from .modeling_bert import BertEmbeddings, BertIntermediate, BertOutput, BertAttention, BertLayer, SequencePool,
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/layers/modeling_bert.py", line 20, in
import torch
File "/home/liusiyang/anaconda3/envs/python3.6/lib/python3.6/site-packages/torch/init.py", line 81, in
from torch._C import *
ImportError: numpy.core.multiarray failed to import
2020-05-16 14:14:01.152 ( 0.262s) [main thread ] loguru.cpp:489 INFO| atexit

  Start  5: bert_encoder_test

5/12 Test #5: bert_encoder_test ................***Failed 0.40 sec
date time ( uptime ) [ thread name/id ] file:line v|
2020-05-16 14:14:01.251 ( 0.000s) [main thread ] loguru.cpp:610 INFO| arguments: turbo_transformers_cxx
2020-05-16 14:14:01.251 ( 0.000s) [main thread ] loguru.cpp:613 INFO| Current dir: /tmp/build/turbo_transformers/python
2020-05-16 14:14:01.251 ( 0.000s) [main thread ] loguru.cpp:615 INFO| stderr verbosity: 0
2020-05-16 14:14:01.251 ( 0.000s) [main thread ] loguru.cpp:616 INFO| -----------------------------------
Traceback (most recent call last):
File "/home/liusiyang/TurboTransformers/turbo_transformers/python/tests/bert_encoder_test.py", line 27, in
import turbo_transformers
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/init.py", line 14, in
from .layers import *
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/layers/init.py", line 14, in
from .modeling_bert import BertEmbeddings, BertIntermediate, BertOutput, BertAttention, BertLayer, SequencePool,
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/layers/modeling_bert.py", line 20, in
import torch
File "/home/liusiyang/anaconda3/envs/python3.6/lib/python3.6/site-packages/torch/init.py", line 81, in
from torch._C import *
ImportError: numpy.core.multiarray failed to import
2020-05-16 14:14:01.547 ( 0.296s) [main thread ] loguru.cpp:489 INFO| atexit

  Start  6: bert_intermediate_test

6/12 Test #6: bert_intermediate_test ...........***Failed 0.30 sec
Traceback (most recent call last):
File "/home/liusiyang/TurboTransformers/turbo_transformers/python/tests/bert_intermediate_test.py", line 30, in
import torch
File "/home/liusiyang/anaconda3/envs/python3.6/lib/python3.6/site-packages/torch/init.py", line 81, in
from torch._C import *
ImportError: numpy.core.multiarray failed to import

  Start  7: bert_layer_test

7/12 Test #7: bert_layer_test ..................***Failed 0.41 sec
date time ( uptime ) [ thread name/id ] file:line v|
2020-05-16 14:14:01.948 ( 0.000s) [main thread ] loguru.cpp:610 INFO| arguments: turbo_transformers_cxx
2020-05-16 14:14:01.949 ( 0.000s) [main thread ] loguru.cpp:613 INFO| Current dir: /tmp/build/turbo_transformers/python
2020-05-16 14:14:01.949 ( 0.000s) [main thread ] loguru.cpp:615 INFO| stderr verbosity: 0
2020-05-16 14:14:01.949 ( 0.000s) [main thread ] loguru.cpp:616 INFO| -----------------------------------
Traceback (most recent call last):
File "/home/liusiyang/TurboTransformers/turbo_transformers/python/tests/bert_layer_test.py", line 27, in
import turbo_transformers
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/init.py", line 14, in
from .layers import *
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/layers/init.py", line 14, in
from .modeling_bert import BertEmbeddings, BertIntermediate, BertOutput, BertAttention, BertLayer, SequencePool,
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/layers/modeling_bert.py", line 20, in
import torch
File "/home/liusiyang/anaconda3/envs/python3.6/lib/python3.6/site-packages/torch/init.py", line 81, in
from torch._C import *
ImportError: numpy.core.multiarray failed to import
2020-05-16 14:14:02.254 ( 0.305s) [main thread ] loguru.cpp:489 INFO| atexit

  Start  8: bert_model_test

8/12 Test #8: bert_model_test ..................***Failed 0.30 sec
Traceback (most recent call last):
File "/home/liusiyang/TurboTransformers/turbo_transformers/python/tests/bert_model_test.py", line 28, in
import torch
File "/home/liusiyang/anaconda3/envs/python3.6/lib/python3.6/site-packages/torch/init.py", line 81, in
from torch._C import *
ImportError: numpy.core.multiarray failed to import

  Start  9: bert_output_test

9/12 Test #9: bert_output_test .................***Failed 0.33 sec
date time ( uptime ) [ thread name/id ] file:line v|
2020-05-16 14:14:02.656 ( 0.000s) [main thread ] loguru.cpp:610 INFO| arguments: turbo_transformers_cxx
2020-05-16 14:14:02.656 ( 0.000s) [main thread ] loguru.cpp:613 INFO| Current dir: /tmp/build/turbo_transformers/python
2020-05-16 14:14:02.656 ( 0.000s) [main thread ] loguru.cpp:615 INFO| stderr verbosity: 0
2020-05-16 14:14:02.656 ( 0.000s) [main thread ] loguru.cpp:616 INFO| -----------------------------------
Traceback (most recent call last):
File "/home/liusiyang/TurboTransformers/turbo_transformers/python/tests/bert_output_test.py", line 27, in
import turbo_transformers
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/init.py", line 14, in
from .layers import *
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/layers/init.py", line 14, in
from .modeling_bert import BertEmbeddings, BertIntermediate, BertOutput, BertAttention, BertLayer, SequencePool,
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/layers/modeling_bert.py", line 20, in
import torch
File "/home/liusiyang/anaconda3/envs/python3.6/lib/python3.6/site-packages/torch/init.py", line 81, in
from torch._C import *
ImportError: numpy.core.multiarray failed to import
2020-05-16 14:14:02.887 ( 0.230s) [main thread ] loguru.cpp:489 INFO| atexit

  Start 10: bert_pooler_test

10/12 Test #10: bert_pooler_test .................***Failed 0.26 sec
Traceback (most recent call last):
File "/home/liusiyang/TurboTransformers/turbo_transformers/python/tests/bert_pooler_test.py", line 30, in
import torch
File "/home/liusiyang/anaconda3/envs/python3.6/lib/python3.6/site-packages/torch/init.py", line 81, in
from torch._C import *
ImportError: numpy.core.multiarray failed to import

  Start 11: sequence_pool_test

11/12 Test #11: sequence_pool_test ...............***Failed 0.30 sec
date time ( uptime ) [ thread name/id ] file:line v|
2020-05-16 14:14:03.253 ( 0.000s) [main thread ] loguru.cpp:610 INFO| arguments: turbo_transformers_cxx
2020-05-16 14:14:03.253 ( 0.000s) [main thread ] loguru.cpp:613 INFO| Current dir: /tmp/build/turbo_transformers/python
2020-05-16 14:14:03.253 ( 0.000s) [main thread ] loguru.cpp:615 INFO| stderr verbosity: 0
2020-05-16 14:14:03.253 ( 0.000s) [main thread ] loguru.cpp:616 INFO| -----------------------------------
Traceback (most recent call last):
File "/home/liusiyang/TurboTransformers/turbo_transformers/python/tests/sequence_pool_test.py", line 15, in
import turbo_transformers
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/init.py", line 14, in
from .layers import *
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/layers/init.py", line 14, in
from .modeling_bert import BertEmbeddings, BertIntermediate, BertOutput, BertAttention, BertLayer, SequencePool,
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/layers/modeling_bert.py", line 20, in
import torch
File "/home/liusiyang/anaconda3/envs/python3.6/lib/python3.6/site-packages/torch/init.py", line 81, in
from torch._C import *
ImportError: numpy.core.multiarray failed to import
2020-05-16 14:14:03.452 ( 0.198s) [main thread ] loguru.cpp:489 INFO| atexit

  Start 12: tensor_conversion_test

12/12 Test #12: tensor_conversion_test ...........***Failed 0.28 sec
Traceback (most recent call last):
File "/home/liusiyang/TurboTransformers/turbo_transformers/python/tests/tensor_conversion_test.py", line 14, in
import torch
File "/home/liusiyang/anaconda3/envs/python3.6/lib/python3.6/site-packages/torch/init.py", line 81, in
from torch._C import *
ImportError: numpy.core.multiarray failed to import

17% tests passed, 10 tests failed out of 12

Total Test time (real) = 3.43 sec

The following tests FAILED:
3 - bert_attention_test (Failed)
4 - bert_embedding_test (Failed)
5 - bert_encoder_test (Failed)
6 - bert_intermediate_test (Failed)
7 - bert_layer_test (Failed)
8 - bert_model_test (Failed)
9 - bert_output_test (Failed)
10 - bert_pooler_test (Failed)
11 - sequence_pool_test (Failed)
12 - tensor_conversion_test (Failed)
Errors while running CTest

gpu benchmark does not support PyTorch 1.5.0.

您好,我试图用最新的pytorch做gpu benchmark的对照组,配置如下:
pytorch: 1.5.0
torchvision: 0.6.0
CUDA: 10.2
OS: Ubuntu18.04
即Dockerfile.gpu里对应行修改为"conda install pytorch=1.5.0 torchvision=0.6.0 cudatoolkit=10.2 -c pytorch"

在docker里build后在test时,有好几个测试用例通不过:

Test project /tmp/build
Start 1: tt_core_test
1/12 Test #1: tt_core_test ..................... Passed 0.52 sec
Start 2: tt_kernels_test
2/12 Test #2: tt_kernels_test .................. Passed 29.18 sec
Start 3: bert_attention_test
3/12 Test #3: bert_attention_test ..............***Failed 4.50 sec
date time ( uptime ) [ thread name/id ] file:line v|
2020-06-08 13:10:51.358 ( 0.000s) [main thread ] loguru.cpp:610 INFO| arguments: turbo_transformers_cxx
2020-06-08 13:10:51.358 ( 0.000s) [main thread ] loguru.cpp:613 INFO| Current dir: /tmp/build/turbo_transformers/python
2020-06-08 13:10:51.358 ( 0.000s) [main thread ] loguru.cpp:615 INFO| stderr verbosity: 0
2020-06-08 13:10:51.358 ( 0.000s) [main thread ] loguru.cpp:616 INFO| -----------------------------------
FFFFFFFFFFFFFFFFFFFFFFFBertAttention "(1,010)" CPU Torch QPS, 492.80298203436234, time, 0.002029208500061941
BertAttention "(1,010)" CPU Turbo QPS, 1082.363535550833, time, 0.0009239040000466048`

...

The following tests FAILED:
3 - bert_attention_test (Failed)
5 - bert_encoder_test (Failed)
6 - bert_intermediate_test (Failed)
7 - bert_layer_test (Failed)
8 - bert_model_test (Failed)
9 - bert_output_test (Failed)
10 - bert_pooler_test (Failed)

请问目前这个gpu benchmark对pytorch版本最高支持到多少呢?官方首页里的benchmark实验结果,是和哪个版本的pytorch比较的呢?

Add a customized layer and compilation failed!

编译turbo,运行benchmark报错
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/turbo_transformers/layers/modeling_bert.py", line 16, in
import turbo_transformers.turbo_transformers_cxxd as cxx
ModuleNotFoundError: No module named 'turbo_transformers.turbo_transformers_cxxd'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "gpu_benchmark.py", line 114, in
main()
File "gpu_benchmark.py", line 104, in main
benchmark_turbo_transformers(**kwargs)
File "gpu_benchmark.py", line 38, in benchmark_turbo_transformers
import turbo_transformers
File "/usr/local/lib/python3.6/dist-packages/turbo_transformers/init.py", line 14, in
from .layers import *
File "/usr/local/lib/python3.6/dist-packages/turbo_transformers/layers/init.py", line 14, in
from .modeling_bert import BertEmbeddings, BertIntermediate, BertOutput, BertAttention, BertLayer, SequencePool,
File "/usr/local/lib/python3.6/dist-packages/turbo_transformers/layers/modeling_bert.py", line 18, in
import turbo_transformers.turbo_transformers_cxx as cxx
ImportError: /usr/local/lib/python3.6/dist-packages/turbo_transformers/turbo_transformers_cxx.cpython-36m-x86_64-linux-gnu.so: undefined symbol: ZNK18turbo_transformers6layers15DEEPFMEmbeddingclERKNS_4core6TensorEPS3
请问大佬,可能是什么原因呢?

bert的输出为什么没有把hidden作为输出?

hidden_cache = self.encoder(hidden_states=hidden_cache,
attention_mask=extended_attention_masks,
return_type=ReturnType.turbo_transformers,
output=hidden_cache)
self.seq_pool = SequencePool(PoolingMap[pooling_type])
output = self.seq_pool(input_tensor=hidden_cache,
return_type=return_type,
output_tensor=output)
return output

基于工作需求,要对hidden做进一步的处理

benchmark的baseline用fp16还是fp32做对比更合理呢?

请问,TurboTransformer对绝大部分网络层计算,使用的是16位Float还是32位Float的呢?这个issue里,您有说"对Tensor Core做了支持",不知道有没有大量转成fp16去计算。
我用onnxruntime针对transformer的优化工具做baseline时(官方README说该工具使用了最新的优化技术,比onnxruntime-gpu还要快: "Some of the latest optimizations that have not yet been integrated into ONNX Runtime are available in this tool that tunes models for the best performance."),不太清楚其fp32还是fp16版本,哪个来和TurboTransformer比较更公平。
我在V100卡上的测试结果,是Turbo相比fp32的onnxruntime-transformers-tools达到1.5X至2.5X的加速比,相比fp16的onnxruntime-transformers-tools会落败即加速比0.7X至0.8X

Turbo slower than Torch on V100

Dear developers,

I am trying to reproduce the bert benchmarking result on my machine.

image

I just run bash run_gpu_benchmark.sh but the QPS is much slower than the declared value. When seq_len becomes larger than 80, turbo becomes slower than torch.

image

I installed TurboTransformers from source

mkdir -p build && cd build
cmake .. -DWITH_GPU=ON
make -j 4
pip install `find . -name *whl`

ONNXRT can not be applied in Albert

/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py:738: UserWarning: ONNX export failed on ATen operator einsum because torch.onnx.symbolic_opset9.einsum does not exist
.format(op_name, opset_version, op_name))
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/workspace/benchmark/benchmark_helper.py", line 89, in generate_onnx_model
torch.onnx.export(model=model, args=(input_ids, ), f=outf)
File "/opt/conda/lib/python3.7/site-packages/torch/onnx/init.py", line 168, in export
custom_opsets, enable_onnx_checker, use_external_data_format)
File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 69, in export
use_external_data_format=use_external_data_format)
File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 488, in _export
fixed_batch_size=fixed_batch_size)
File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 351, in _model_to_graph
fixed_batch_size=fixed_batch_size, params_dict=params_dict)
File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 154, in _optimize_graph
graph = torch._C._jit_pass_onnx(graph, operator_export_type)
File "/opt/conda/lib/python3.7/site-packages/torch/onnx/init.py", line 199, in _run_symbolic_function
return utils._run_symbolic_function(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 739, in _run_symbolic_function
op_fn = sym_registry.get_registered_op(op_name, '', opset_version)
File "/opt/conda/lib/python3.7/site-packages/torch/onnx/symbolic_registry.py", line 109, in get_registered_op
raise RuntimeError(msg)
RuntimeError: Exporting the operator einsum to ONNX opset version 9 is not supported. Support for this operator was added in version 12, try exporting with this version.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "cpu_benchmark.py", line 173, in
main()
File "cpu_benchmark.py", line 164, in main
benchmark_helper.onnxruntime_benchmark_creator('CPU')(**kwargs)
File "/workspace/benchmark/benchmark_helper.py", line 106, in impl
backend))
File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 261, in apply
return self.apply_async(func, args, kwds).get()
File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
RuntimeError: Exporting the operator einsum to ONNX opset version 9 is not supported. Support for this operator was added in version 12, try exporting with this version.

怎么设置core::IsCompiledWithCUDA()为true

我现在在使用C++ APIs。我按照https://github.com/Tencent/TurboTransformers/blob/master/example/cpp/README.md里面的说明,进行./bert_model_example bert.npz。

但是我发现example/cpp/bert_model_example.cpp里面的代码core::IsCompiledWithCUDA()永远返回false,导致没办法在c++ api中使用GPU。

我在代码库中grep了这个函数,返回如下:

example/cpp/bert_model_test.cpp:97: if (core::IsCompiledWithCUDA()) {
example/cpp/bert_model_test.cpp:106: if (core::IsCompiledWithCUDA()) {
example/cpp/bert_model_example.cpp:140: if (core::IsCompiledWithCUDA()) {
example/cpp/bert_model_example.cpp:149:// if (core::IsCompiledWithCUDA()) {
turbo_transformers/core/config.h:26:constexpr bool IsCompiledWithCUDA() {
turbo_transformers/python/pybind.cpp:52: m.def("is_compiled_with_cuda", &core::IsCompiledWithCUDA)

安装环境

小白说一下,感觉好用是很好用;但是这个安装流程是不是太重型了。。。

Inference custom model

Hello! I have Pytorch model that consists of Hugging Face BERT model followed by several PyTorch nn layers (such as Feed Forward and LSTM). Can I accelerate and run it (or at least BERT part of it) using TurboTransformers?

And how can I take advantage of variable length support? Should I just pad my tensor up to the length of the longest element?

基于python还是基于c++搭建服务?

我目前的需求,是远程传过来一个文本,我需要在机器上搭建一个服务器,这个服务器用bert返回CLS向量(也就是原始bert中的pooled_output),那么我是基于example中的cpp接口进行搭建服务器,还是基于python接口搭建服务器,这两者在速度上,有什么差异么?

tensor数据类型不匹配

运行自定义代码,报错L:
RuntimeError: enforce errordetails::IsDataType(t.dtype) at /TurboTransformers/turbo_transformers/core/tensor.h:302
data type mismatch, request l, actual (0,32)
Callstack 0
请问大佬,可能是什么原因啊?

Slower than original torch model (batch_gemm kernel is slow on my CPU)

Hi, I just run the example in the readme and found the turbo_model is slower than the original torch model. I pull the latest CPU docker image and run the sample script in the container.
I use the following command to start the container:

docker run -itd --name turbo_test thufeifeibear/turbo_transformers_cpu:latest
docker exec -it turbo_test bash

image

关于GPU安装的第一步

sh tools/build_docker_gpu.sh $PWD

我的系统是CentOS Linux release 7.6.1810 (Core)
cuda版本是10.2
cudnn版本是7.65
显卡是K80,驱动型号是440.33.01
pytorch版本是1.5.0
python版本是3.7.5

请问build_docker_gpu.sh该怎么修改?

我目前改成

CUDA_VERSION=10.2
DOCKER_BASE=${CUDA_VERSION}-cudnn7-devel-CentOS7.6
PYTORCH_VERSION=1.5.0

结果报错

++ cat ../CMakeLists.txt
++ grep TURBO_TRANSFORMERS_VERSION
++ sed 's#set(TURBO_TRANSFORMERS_VERSION ##g'
++ sed 's#)##g'

  • VERSION=0.2.0
  • CUDA_VERSION=10.2
  • DOCKER_BASE=10.2-cudnn7-devel-CentOS7.6
  • PYTORCH_VERSION=1.5.0
  • sed s#IMAGE_BASE#nvidia/cuda:10.2-cudnn7-devel-CentOS7.6#g ./docker/Dockerfile_dev.gpu
  • sed s#CUDA_VERSION#10.2#g
  • sed s#PYTORCH_VERSION#1.5.0#g
  • docker build -t ccr.ccs.tencentyun.com/mmspr/turbo_transformers:0.2.0-cuda10.2-cudnn7-devel-CentOS7.6-gpu-dev -f Dockerfile.gpu .
    Sending build context to Docker daemon 34.82kB
    Step 1/6 : FROM nvidia/cuda:10.2-cudnn7-devel-CentOS7.6
    manifest for nvidia/cuda:10.2-cudnn7-devel-CentOS7.6 not found

使用conda安装包后,导入时出错,缺少libmkl_intel_lp64.so

按照教程4,

sh tool/build_conda_package.sh
# The conda package will be in /workspace/dist/*.tar.bz2
# When using turbo_transformers in other environments outside this container : python -m pip install your_root_path / dist / *. Tar.bz2

通过某一台机器,编译出包后(*. Tar.bz2),导入服务器安装后,import 报错,错误信息如下:

>>> import turbo_transformers
Traceback (most recent call last):
  File "/home/work/anaconda3/envs/pytorch/lib/python3.7/site-packages/turbo_transformers/layers/modeling_bert.py", line 16, in <module>
    import turbo_transformers.turbo_transformers_cxxd as cxx
ModuleNotFoundError: No module named 'turbo_transformers.turbo_transformers_cxxd'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/work/anaconda3/envs/pytorch/lib/python3.7/site-packages/turbo_transformers/__init__.py", line 14, in <module>
    from .layers import *
  File "/home/work/anaconda3/envs/pytorch/lib/python3.7/site-packages/turbo_transformers/layers/__init__.py", line 14, in <module>
    from .modeling_bert import BertEmbeddings, BertIntermediate, BertOutput, BertAttention, BertLayer, SequencePool, \
  File "/home/work/anaconda3/envs/pytorch/lib/python3.7/site-packages/turbo_transformers/layers/modeling_bert.py", line 18, in <module>
    import turbo_transformers.turbo_transformers_cxx as cxx
ImportError: libmkl_intel_lp64.so: cannot open shared object file: No such file or directory

Turbo is slow when thread_num is set as the number of CPU core.

我发现很多情况下,当thread_num等于CPU核数时,turbo的速度会变得非常慢,远远落后torch等其他对照组;当把thread_num设少一些,就又正常了;请问这是什么原因呢?
机器型号:Azure VM, "8 Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz"
pytorch版本:1.5.0

2020-06-11 00:43:28.694 ( 0.000s) [main thread ] loguru.cpp:610 INFO| arguments: turbo_transformers_cxx
2020-06-11 00:43:28.694 ( 0.000s) [main thread ] loguru.cpp:613 INFO| Current dir: /workspace/benchmark
2020-06-11 00:43:28.694 ( 0.000s) [main thread ] loguru.cpp:615 INFO| stderr verbosity: 0
2020-06-11 00:43:28.694 ( 0.000s) [main thread ] loguru.cpp:616 INFO| -----------------------------------
{'model': 'bert-base-uncased', 'seq_len': 10, 'batch_size': 1, 'n': 150, 'num_threads': 8}
{"QPS": 2.400054755041472, "elapsed": 62.498574119992554, "n": 150, "batch_size": 1, "seq_len": 10, "framework": "turbo", "thread_num": 8}
2020-06-11 00:44:33.975 ( 65.280s) [main thread ] loguru.cpp:489 INFO| atexit
{'model': 'bert-base-uncased', 'seq_len': 10, 'batch_size': 1, 'n': 150, 'num_threads': 8}
{"QPS": 33.7716449691695, "elapsed": 4.441595904994756, "n": 150, "batch_size": 1, "seq_len": 10, "framework": "onnx_rt_MKL", "n_threads": 8}
{'model': 'bert-base-uncased', 'seq_len': 10, 'batch_size': 1, 'n': 150, 'num_threads': 8}
{"QPS": 31.75250998386658, "elapsed": 4.724035991996061, "n": 150, "batch_size": 1, "seq_len": 10, "framework": "onnx_rt_CPU", "n_threads": 8}
{'model': 'bert-base-uncased', 'seq_len': 10, 'batch_size': 1, 'n': 150, 'num_threads': 8}
{"QPS": 19.904907790425327, "elapsed": 7.535829935979564, "n": 150, "batch_size": 1, "seq_len": 10, "framework": "torch", "thread_num": 8}
{'model': 'bert-base-uncased', 'seq_len': 10, 'batch_size': 1, 'n': 150, 'num_threads': 8}
{"QPS": 23.488386050582193, "elapsed": 6.386134819011204, "n": 150, "batch_size": 1, "seq_len": 10, "framework": "torch_jit", "n_threads": 8}
date time ( uptime ) [ thread name/id ] file:line v|
2020-06-11 00:45:28.113 ( 0.000s) [main thread ] loguru.cpp:610 INFO| arguments: turbo_transformers_cxx
2020-06-11 00:45:28.113 ( 0.000s) [main thread ] loguru.cpp:613 INFO| Current dir: /workspace/benchmark
2020-06-11 00:45:28.113 ( 0.000s) [main thread ] loguru.cpp:615 INFO| stderr verbosity: 0
2020-06-11 00:45:28.113 ( 0.000s) [main thread ] loguru.cpp:616 INFO| -----------------------------------
{'model': 'bert-base-uncased', 'seq_len': 20, 'batch_size': 1, 'n': 150, 'num_threads': 8}
{"QPS": 2.3016662434532256, "elapsed": 65.17017852899153, "n": 150, "batch_size": 1, "seq_len": 20, "framework": "turbo", "thread_num": 8}
2020-06-11 00:46:35.959 ( 67.845s) [main thread ] loguru.cpp:489 INFO| atexit
{'model': 'bert-base-uncased', 'seq_len': 20, 'batch_size': 1, 'n': 150, 'num_threads': 8}
{"QPS": 22.960897563908322, "elapsed": 6.532845659996383, "n": 150, "batch_size": 1, "seq_len": 20, "framework": "onnx_rt_MKL", "n_threads": 8}
{'model': 'bert-base-uncased', 'seq_len': 20, 'batch_size': 1, 'n': 150, 'num_threads': 8}
{"QPS": 22.442801442538517, "elapsed": 6.683657580986619, "n": 150, "batch_size": 1, "seq_len": 20, "framework": "onnx_rt_CPU", "n_threads": 8}
{'model': 'bert-base-uncased', 'seq_len': 20, 'batch_size': 1, 'n': 150, 'num_threads': 8}
{"QPS": 12.459391223496187, "elapsed": 12.039111487014452, "n": 150, "batch_size": 1, "seq_len": 20, "framework": "torch", "thread_num": 8}
{'model': 'bert-base-uncased', 'seq_len': 20, 'batch_size': 1, 'n': 150, 'num_threads': 8}
{"QPS": 14.308623019358208, "elapsed": 10.483189039019635, "n": 150, "batch_size": 1, "seq_len": 20, "framework": "torch_jit", "n_threads": 8}
date time ( uptime ) [ thread name/id ] file:line v|
2020-06-11 00:47:42.783 ( 0.000s) [main thread ] loguru.cpp:610 INFO| arguments: turbo_transformers_cxx
2020-06-11 00:47:42.783 ( 0.000s) [main thread ] loguru.cpp:613 INFO| Current dir: /workspace/benchmark
2020-06-11 00:47:42.783 ( 0.000s) [main thread ] loguru.cpp:615 INFO| stderr verbosity: 0
2020-06-11 00:47:42.783 ( 0.000s) [main thread ] loguru.cpp:616 INFO| -----------------------------------

Refactor MultiheadAttention and other Layers

Now the logic inside MultiheadAttention Layer is too complex for development.
Moreover, some bugs exist in intermediate management.
It is the first priority to rewrite these codes to make others easily understand what Turbo is doing.

GPT2

支持GPT2吗,例如:GPT2-Chinese,可以给出对应的示例吗,谢谢

没看见dropout

bert_attention.cpp 里面,dense和laynorm之间应该要有dropout吧,我似乎没看到?
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.