tencent / turbotransformers Goto Github PK
View Code? Open in Web Editor NEWa fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.
License: Other
a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.
License: Other
请问有turbo 优化原理的paper吗?
您好!我按照run_gpu_benchmark.sh里注释的,用"pip install onnxruntime-gpu"安装了onnx,添加了"onnxruntime",运行时遇到了和该issue同样的问题,libcublas.so.10无法加载;该issue的解答是让安装老版本1.1.0的onnx;请问,官方首页里的benchmark结果,是和哪个版本的onnx比较的呢?怎样才能跑起来最新版本的onnx作为benchmark对照组呢?
另外,我安装1.1.0的老版本onnx,跑gpu benchmark时也报错:
Warning: ATen was a removed experimental ops. In the future, we may directly reject this operator. Please update your model as soon as possible.
Warning: ATen was a removed experimental ops. In the future, we may directly reject this operator. Please update your model as soon as possible.
Traceback (most recent call last):
File "gpu_benchmark.py", line 114, in
main()
File "gpu_benchmark.py", line 108, in main
benchmark_helper.onnxruntime_benchmark_creator('GPU')(**kwargs)
File "/workspace/benchmark/benchmark_helper.py", line 114, in impl
graph_optimization_level=onnxruntime.GraphOptimizationLevel.
File "/opt/miniconda3/lib/python3.7/site-packages/onnxruntime/backend/backend.py", line 80, in prepare
return cls.prepare(bin, device, **kwargs)
File "/opt/miniconda3/lib/python3.7/site-packages/onnxruntime/backend/backend.py", line 69, in prepare
inf = InferenceSession(model, options)
File "/opt/miniconda3/lib/python3.7/site-packages/onnxruntime/capi/session.py", line 25, in init
self._load_model(providers)
File "/opt/miniconda3/lib/python3.7/site-packages/onnxruntime/capi/session.py", line 43, in _load_model
self._sess.load_model(providers)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Fatal error: ATen is not a registered function/op
你好,很抱歉在这个项目里提别的问题,实在没有别的方式联系您了。我在研究您在这个LAB_seq-malGAN中提交的代码,比较熟悉pytorch,但是好像您没有写完,我就去攻了下tensorflow版的,但是不是很明白,不知道为什么运行起来很慢,您里面的候选API词典有100多个,我用了几万的一个,就,运行贼慢。很想问下您有没有这个的原论文我想研究下,然后如果您pytorch的有更新了,也非常希望您能提交下,真的非常感谢
Could you explain a little bit more of the support of variable-length? Does it mean the runtime can support inputs with different sequences in a single session, like [batch, 8], [batch, 32] and etc.? Or it actually can support difference sequence length in one input, e.g., for an input with batch size of 2 like below, it runs sequence 128 for the first input, and sequence 3 for the second input for better performance?
[
[1,2,3,4, ...,128],
[5,6,7],
]
我用python的api测试的,batchsize=1的时候是几毫秒,batchsize=120的时候就快100ms了,到batchsize=256时就到100多毫秒了,这个符合预期么?
我执行的步骤如下:
其中第4步输出为:
-- The C compiler identification is GNU 7.5.0
-- The CXX compiler identification is GNU 7.5.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
CMake Warning (dev) at CMakeLists.txt:22 (add_executable):
Policy CMP0003 should be set before this line. Add code such as
if(COMMAND cmake_policy)
cmake_policy(SET CMP0003 NEW)
endif(COMMAND cmake_policy)
as early as possible but after the most recent call to
cmake_minimum_required or cmake_policy(VERSION). This warning appears
because target "bert_model_example" links to some libraries for which the
linker must search:
tt_npz_loader, tt_layers, tt_kernels
and other libraries with known full path:
/workspace/TurboTransformers/example/cpp/build/libbert_model.a
CMake is adding directories in the second list to the linker search path in
case they are needed to find libraries from the first list (for backwards
compatibility with CMake 2.4). Set policy CMP0003 to OLD or NEW to enable
or disable this behavior explicitly. Run "cmake --help-policy CMP0003" for
more information.
This warning is for project developers. Use -Wno-dev to suppress it.
-- Generating done
-- Build files have been written to: /workspace/TurboTransformers/example/cpp/build
第五步输出为
Scanning dependencies of target bert_model
[ 25%] Building CXX object CMakeFiles/bert_model.dir/bert_model.cpp.o
In file included from /workspace/TurboTransformers/example/cpp/bert_model.cpp:14:0:
/workspace/TurboTransformers/example/cpp/bert_model.h:20:10: fatal error: dlpack/dlpack.h: No such file or directory
#include "dlpack/dlpack.h"
^~~~~~~~~~~~~~~~~
compilation terminated.
CMakeFiles/bert_model.dir/build.make:62: recipe for target 'CMakeFiles/bert_model.dir/bert_model.cpp.o' failed
make[2]: *** [CMakeFiles/bert_model.dir/bert_model.cpp.o] Error 1
CMakeFiles/Makefile2:72: recipe for target 'CMakeFiles/bert_model.dir/all' failed
make[1]: *** [CMakeFiles/bert_model.dir/all] Error 2
Makefile:83: recipe for target 'all' failed
make: *** [all] Error 2
请问大佬 运行 bash tools/compile.sh $PWD -DWITH_GPU=ON /tmp/build报错:
avx512fintrin.h(1761): error: identifier "__builtin_ia32_sqrtsd_round" is undefined 是啥原因啊?
(py36) [benchmark]# echo $LD_LIBRARY_PATH
/opt/intel/mkl/lib/intel64_lin
(py36) [rbenchmark]# ls $LD_LIBRARY_PATH
libmkl_avx2.so libmkl_gnu_thread.so libmkl_mc.so libmkl_vml_avx512.so
libmkl_avx512_mic.so libmkl_intel_ilp64.a libmkl_rt.so libmkl_vml_avx.so
libmkl_avx512.so libmkl_intel_ilp64.so libmkl_sequential.a libmkl_vml_cmpt.so
libmkl_avx.so libmkl_intel_lp64.a libmkl_sequential.so libmkl_vml_def.so
libmkl_core.a libmkl_intel_lp64.so libmkl_tbb_thread.a libmkl_vml_mc2.so
libmkl_core.so libmkl_intel_thread.a libmkl_tbb_thread.so libmkl_vml_mc3.so
libmkl_def.so libmkl_intel_thread.so libmkl_vml_avx2.so libmkl_vml_mc.so
libmkl_gnu_thread.a libmkl_mc3.so libmkl_vml_avx512_mic.so locale
what shoud I do to solve this problem? thanks
请问如何能把core::Tensor转换成tf的tensor
docker 环境中跑 run_gpu_benchmarrk.py 以及 example/python 中 gpu_example 的时候都存在这个问题 AttributeError: module 'turbo_transformers' has no attribute 'BertModel' 麻烦请教一下 谢谢
Dear developers, I am working on optimizing albert serving performance.
I find that albert is under development. When will it be released ?
Using FBGEMM to support CPU quantization.
我们试图重现性能,按照文档自己编译成功docker镜像,并且单元测试成功,但是运行benchmark时模型加载不了,请问是什么原因?报错如下:
model = transformers.BertModel.from_pretrained(model_id)
File "/opt/conda/lib/python3.7/site-packages/transformers/modeling_utils.py", line 629, in from_pretrained
"Unable to load weights from pytorch checkpoint file. "
我们使用的bert模型下载路径是https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json bert-base-uncased-vocab.txt以及bert-base-uncased-pytorch_model.bin
对模型文件有什么特殊要求吗?或者需要做什么预处理?
谢谢
cmake -G Ninja -DCMAKE_BUILD_TYPE=Release .. -DWITH_GPU=OFF
-- /include
-- Blas provider is mkl
-- pybind11 v2.4.dev4
-- OpenMP USED FLAGS
CMake Error at CMakeLists.txt:79 (message):
Cannot find librt from
thank you.
不好意思,再请问个问题。运行完"sh tools/build_docker_gpu.sh $PWD"后,再运行下面这条docker run实例化刚创建的image,会报错。(git clone后没有对build_docker_gpu.sh等进行过改动)
sudo docker run --gpus all --net=host --rm -it -v $PWD:/workspace -v /etc/passwd:/etc/passwd --name=gt_gpu_env ccr.ccs.tencentyun.com/mmspr/turbo_transformers:0.2.1-cuda10.0-cudnn7-devel-ubuntu18.04-gpu-dev
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused "process_linux.go:432: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request\\n\""": unknown.
此问题从今天中午开始出现的。会不会是刚刚从0.2.0升级到0.2.1造成的问题呢?
您好,问一下这个自己用tensorflow bert fintune(带自己的任务)的 模型 想按(example/python)中提供的方式要怎么用呢,现在看起来似乎只能 转huggenface 自带的预训练模型
Using the docker image is not easy for all developers. It is better to release coda packages. The possible packages are:
Device\CPU | AVX2 | AVX | SSE4 | ARMv7 |
---|---|---|---|---|
CPU | ||||
CUDA |
I followed the steps given in the README and I setup everything inside the docker.
In this steps given pip install -r
find . -name *whl`` is not doing anything because that search is returning nothing. Other than that I followed everything.
Afterwhich, I tried cpu_example.py to generate embeddings, I just ran the code from the repo, and I got the following error
Traceback (most recent call last):
File "cpu_example.py", line 15, in <module>
import turbo_transformers
File "/workspace/TurboTransformers/turbo_transformers/python/turbo_transformers/__init__.py", line 14, in <module>
from .layers import *
File "/workspace/TurboTransformers/turbo_transformers/python/turbo_transformers/layers/__init__.py", line 14, in <module>
from .modeling_bert import BertEmbeddings, BertIntermediate, BertOutput, BertAttention, BertLayer, SequencePool, \
File "/workspace/TurboTransformers/turbo_transformers/python/turbo_transformers/layers/modeling_bert.py", line 18, in <module>
import turbo_transformers.turbo_transformers_cxx as cxx
ModuleNotFoundError: No module named 'turbo_transformers.turbo_transformers_cxx'
Steps I tried,
I saw some make files inside the repo, hence I ran make
command, it setup lot of things and recompiled lot of source code required for C++(mainly).
Any help would be appreciated! The Readme looks interesting, excited to see the results, hope once I overcome this error I could play around with it for a while. Thanks.
情况是这样的。在运行gpu_benchmark.py时候,我看到在运行turbo_transformers时,调用的函数是turbo_transformers.BertModel.from_torch。
我看了一下代码,觉得调用的函数应该是TurboTransformers/turbo_transformers/python/turbo_transformers/layers/modeling_bert.py中class BertModel中的from_torch函数。于是我在此函数中,加载了一些print以打印信息。却发现执行bash run_gpu_benchmark.sh这些print打印不出任何信息。
所以我很疑问turbo_transformers.BertModel.from_torch到底调用的是哪个函数?
如题
我cd benchmark,然后执行脚本bash run_gpu_benchmark.sh,结果如下:
{"QPS": 141.9624342537054, "elapsed": 1.0566175537109375, "n": 150, "batch_size": 1, "seq_len": 10, "framework": "turbo", "thread_num": 1}
{"QPS": 79.07311483207774, "elapsed": 1.896978515625, "n": 150, "batch_size": 1, "seq_len": 10, "framework": "torch", "thread_num": 1}
{"QPS": 95.6774176067004, "elapsed": 1.56776806640625, "n": 150, "batch_size": 1, "seq_len": 20, "framework": "turbo", "thread_num": 1}
{"QPS": 78.33956337296448, "elapsed": 1.9147413330078125, "n": 150, "batch_size": 1, "seq_len": 20, "framework": "torch", "thread_num": 1}
{"QPS": 95.04851976317696, "elapsed": 1.578141357421875, "n": 150, "batch_size": 1, "seq_len": 30, "framework": "turbo", "thread_num": 1}
{"QPS": 78.68219615532972, "elapsed": 1.9064033203125, "n": 150, "batch_size": 1, "seq_len": 30, "framework": "torch", "thread_num": 1}
以上是batch size为1,seq_len分别为10、20、30时,turbo相对于torch的QPS。
在Readme中GPU M40测试中,seq_len分别为10、20、30时,turbo相对torch加速倍数分别为3.31、3.17、2.88。而换成了K80显卡,turbo相对torch加速倍数分别为1.79、1.22和1.20。
我的问题是为什么换成了显卡K80,turbo相对torch的加速倍数就不是那么明显了呢,turbo是不是针对具体的显卡型号进行了优化?
Hi, I am so interesting in your project, and wonder if you need contributor and how could I make my own contribution?
Test project /tmp/build
Start 1: tt_core_test
1/12 Test #1: tt_core_test ..................... Passed 0.01 sec
Start 2: tt_kernels_test
2/12 Test #2: tt_kernels_test .................. Passed 0.09 sec
Start 3: bert_attention_test
3/12 Test #3: bert_attention_test ..............***Failed 0.39 sec
date time ( uptime ) [ thread name/id ] file:line v|
2020-05-16 14:14:00.527 ( 0.000s) [main thread ] loguru.cpp:610 INFO| arguments: turbo_transformers_cxx
2020-05-16 14:14:00.527 ( 0.000s) [main thread ] loguru.cpp:613 INFO| Current dir: /tmp/build/turbo_transformers/python
2020-05-16 14:14:00.527 ( 0.000s) [main thread ] loguru.cpp:615 INFO| stderr verbosity: 0
2020-05-16 14:14:00.527 ( 0.000s) [main thread ] loguru.cpp:616 INFO| -----------------------------------
Traceback (most recent call last):
File "/home/liusiyang/TurboTransformers/turbo_transformers/python/tests/bert_attention_test.py", line 27, in
import turbo_transformers
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/init.py", line 14, in
from .layers import *
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/layers/init.py", line 14, in
from .modeling_bert import BertEmbeddings, BertIntermediate, BertOutput, BertAttention, BertLayer, SequencePool,
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/layers/modeling_bert.py", line 20, in
import torch
File "/home/liusiyang/anaconda3/envs/python3.6/lib/python3.6/site-packages/torch/init.py", line 81, in
from torch._C import *
ImportError: numpy.core.multiarray failed to import
2020-05-16 14:14:00.800 ( 0.272s) [main thread ] loguru.cpp:489 INFO| atexit
Start 4: bert_embedding_test
4/12 Test #4: bert_embedding_test ..............***Failed 0.35 sec
date time ( uptime ) [ thread name/id ] file:line v|
2020-05-16 14:14:00.890 ( 0.000s) [main thread ] loguru.cpp:610 INFO| arguments: turbo_transformers_cxx
2020-05-16 14:14:00.890 ( 0.000s) [main thread ] loguru.cpp:613 INFO| Current dir: /tmp/build/turbo_transformers/python
2020-05-16 14:14:00.890 ( 0.000s) [main thread ] loguru.cpp:615 INFO| stderr verbosity: 0
2020-05-16 14:14:00.890 ( 0.000s) [main thread ] loguru.cpp:616 INFO| -----------------------------------
Traceback (most recent call last):
File "/home/liusiyang/TurboTransformers/turbo_transformers/python/tests/bert_embedding_test.py", line 27, in
import turbo_transformers
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/init.py", line 14, in
from .layers import *
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/layers/init.py", line 14, in
from .modeling_bert import BertEmbeddings, BertIntermediate, BertOutput, BertAttention, BertLayer, SequencePool,
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/layers/modeling_bert.py", line 20, in
import torch
File "/home/liusiyang/anaconda3/envs/python3.6/lib/python3.6/site-packages/torch/init.py", line 81, in
from torch._C import *
ImportError: numpy.core.multiarray failed to import
2020-05-16 14:14:01.152 ( 0.262s) [main thread ] loguru.cpp:489 INFO| atexit
Start 5: bert_encoder_test
5/12 Test #5: bert_encoder_test ................***Failed 0.40 sec
date time ( uptime ) [ thread name/id ] file:line v|
2020-05-16 14:14:01.251 ( 0.000s) [main thread ] loguru.cpp:610 INFO| arguments: turbo_transformers_cxx
2020-05-16 14:14:01.251 ( 0.000s) [main thread ] loguru.cpp:613 INFO| Current dir: /tmp/build/turbo_transformers/python
2020-05-16 14:14:01.251 ( 0.000s) [main thread ] loguru.cpp:615 INFO| stderr verbosity: 0
2020-05-16 14:14:01.251 ( 0.000s) [main thread ] loguru.cpp:616 INFO| -----------------------------------
Traceback (most recent call last):
File "/home/liusiyang/TurboTransformers/turbo_transformers/python/tests/bert_encoder_test.py", line 27, in
import turbo_transformers
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/init.py", line 14, in
from .layers import *
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/layers/init.py", line 14, in
from .modeling_bert import BertEmbeddings, BertIntermediate, BertOutput, BertAttention, BertLayer, SequencePool,
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/layers/modeling_bert.py", line 20, in
import torch
File "/home/liusiyang/anaconda3/envs/python3.6/lib/python3.6/site-packages/torch/init.py", line 81, in
from torch._C import *
ImportError: numpy.core.multiarray failed to import
2020-05-16 14:14:01.547 ( 0.296s) [main thread ] loguru.cpp:489 INFO| atexit
Start 6: bert_intermediate_test
6/12 Test #6: bert_intermediate_test ...........***Failed 0.30 sec
Traceback (most recent call last):
File "/home/liusiyang/TurboTransformers/turbo_transformers/python/tests/bert_intermediate_test.py", line 30, in
import torch
File "/home/liusiyang/anaconda3/envs/python3.6/lib/python3.6/site-packages/torch/init.py", line 81, in
from torch._C import *
ImportError: numpy.core.multiarray failed to import
Start 7: bert_layer_test
7/12 Test #7: bert_layer_test ..................***Failed 0.41 sec
date time ( uptime ) [ thread name/id ] file:line v|
2020-05-16 14:14:01.948 ( 0.000s) [main thread ] loguru.cpp:610 INFO| arguments: turbo_transformers_cxx
2020-05-16 14:14:01.949 ( 0.000s) [main thread ] loguru.cpp:613 INFO| Current dir: /tmp/build/turbo_transformers/python
2020-05-16 14:14:01.949 ( 0.000s) [main thread ] loguru.cpp:615 INFO| stderr verbosity: 0
2020-05-16 14:14:01.949 ( 0.000s) [main thread ] loguru.cpp:616 INFO| -----------------------------------
Traceback (most recent call last):
File "/home/liusiyang/TurboTransformers/turbo_transformers/python/tests/bert_layer_test.py", line 27, in
import turbo_transformers
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/init.py", line 14, in
from .layers import *
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/layers/init.py", line 14, in
from .modeling_bert import BertEmbeddings, BertIntermediate, BertOutput, BertAttention, BertLayer, SequencePool,
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/layers/modeling_bert.py", line 20, in
import torch
File "/home/liusiyang/anaconda3/envs/python3.6/lib/python3.6/site-packages/torch/init.py", line 81, in
from torch._C import *
ImportError: numpy.core.multiarray failed to import
2020-05-16 14:14:02.254 ( 0.305s) [main thread ] loguru.cpp:489 INFO| atexit
Start 8: bert_model_test
8/12 Test #8: bert_model_test ..................***Failed 0.30 sec
Traceback (most recent call last):
File "/home/liusiyang/TurboTransformers/turbo_transformers/python/tests/bert_model_test.py", line 28, in
import torch
File "/home/liusiyang/anaconda3/envs/python3.6/lib/python3.6/site-packages/torch/init.py", line 81, in
from torch._C import *
ImportError: numpy.core.multiarray failed to import
Start 9: bert_output_test
9/12 Test #9: bert_output_test .................***Failed 0.33 sec
date time ( uptime ) [ thread name/id ] file:line v|
2020-05-16 14:14:02.656 ( 0.000s) [main thread ] loguru.cpp:610 INFO| arguments: turbo_transformers_cxx
2020-05-16 14:14:02.656 ( 0.000s) [main thread ] loguru.cpp:613 INFO| Current dir: /tmp/build/turbo_transformers/python
2020-05-16 14:14:02.656 ( 0.000s) [main thread ] loguru.cpp:615 INFO| stderr verbosity: 0
2020-05-16 14:14:02.656 ( 0.000s) [main thread ] loguru.cpp:616 INFO| -----------------------------------
Traceback (most recent call last):
File "/home/liusiyang/TurboTransformers/turbo_transformers/python/tests/bert_output_test.py", line 27, in
import turbo_transformers
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/init.py", line 14, in
from .layers import *
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/layers/init.py", line 14, in
from .modeling_bert import BertEmbeddings, BertIntermediate, BertOutput, BertAttention, BertLayer, SequencePool,
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/layers/modeling_bert.py", line 20, in
import torch
File "/home/liusiyang/anaconda3/envs/python3.6/lib/python3.6/site-packages/torch/init.py", line 81, in
from torch._C import *
ImportError: numpy.core.multiarray failed to import
2020-05-16 14:14:02.887 ( 0.230s) [main thread ] loguru.cpp:489 INFO| atexit
Start 10: bert_pooler_test
10/12 Test #10: bert_pooler_test .................***Failed 0.26 sec
Traceback (most recent call last):
File "/home/liusiyang/TurboTransformers/turbo_transformers/python/tests/bert_pooler_test.py", line 30, in
import torch
File "/home/liusiyang/anaconda3/envs/python3.6/lib/python3.6/site-packages/torch/init.py", line 81, in
from torch._C import *
ImportError: numpy.core.multiarray failed to import
Start 11: sequence_pool_test
11/12 Test #11: sequence_pool_test ...............***Failed 0.30 sec
date time ( uptime ) [ thread name/id ] file:line v|
2020-05-16 14:14:03.253 ( 0.000s) [main thread ] loguru.cpp:610 INFO| arguments: turbo_transformers_cxx
2020-05-16 14:14:03.253 ( 0.000s) [main thread ] loguru.cpp:613 INFO| Current dir: /tmp/build/turbo_transformers/python
2020-05-16 14:14:03.253 ( 0.000s) [main thread ] loguru.cpp:615 INFO| stderr verbosity: 0
2020-05-16 14:14:03.253 ( 0.000s) [main thread ] loguru.cpp:616 INFO| -----------------------------------
Traceback (most recent call last):
File "/home/liusiyang/TurboTransformers/turbo_transformers/python/tests/sequence_pool_test.py", line 15, in
import turbo_transformers
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/init.py", line 14, in
from .layers import *
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/layers/init.py", line 14, in
from .modeling_bert import BertEmbeddings, BertIntermediate, BertOutput, BertAttention, BertLayer, SequencePool,
File "/tmp/build/turbo_transformers/python/pypackage/turbo_transformers/layers/modeling_bert.py", line 20, in
import torch
File "/home/liusiyang/anaconda3/envs/python3.6/lib/python3.6/site-packages/torch/init.py", line 81, in
from torch._C import *
ImportError: numpy.core.multiarray failed to import
2020-05-16 14:14:03.452 ( 0.198s) [main thread ] loguru.cpp:489 INFO| atexit
Start 12: tensor_conversion_test
12/12 Test #12: tensor_conversion_test ...........***Failed 0.28 sec
Traceback (most recent call last):
File "/home/liusiyang/TurboTransformers/turbo_transformers/python/tests/tensor_conversion_test.py", line 14, in
import torch
File "/home/liusiyang/anaconda3/envs/python3.6/lib/python3.6/site-packages/torch/init.py", line 81, in
from torch._C import *
ImportError: numpy.core.multiarray failed to import
17% tests passed, 10 tests failed out of 12
Total Test time (real) = 3.43 sec
The following tests FAILED:
3 - bert_attention_test (Failed)
4 - bert_embedding_test (Failed)
5 - bert_encoder_test (Failed)
6 - bert_intermediate_test (Failed)
7 - bert_layer_test (Failed)
8 - bert_model_test (Failed)
9 - bert_output_test (Failed)
10 - bert_pooler_test (Failed)
11 - sequence_pool_test (Failed)
12 - tensor_conversion_test (Failed)
Errors while running CTest
Hi,
Zhihu also have a optimized bert project Cubert, have you compare speed with that one?
I want to run bert on GPU with C++, and the default device is 0.
Is there any way to select other device with C++ API?
您好,我试图用最新的pytorch做gpu benchmark的对照组,配置如下:
pytorch: 1.5.0
torchvision: 0.6.0
CUDA: 10.2
OS: Ubuntu18.04
即Dockerfile.gpu里对应行修改为"conda install pytorch=1.5.0 torchvision=0.6.0 cudatoolkit=10.2 -c pytorch"
在docker里build后在test时,有好几个测试用例通不过:
Test project /tmp/build
Start 1: tt_core_test
1/12 Test #1: tt_core_test ..................... Passed 0.52 sec
Start 2: tt_kernels_test
2/12 Test #2: tt_kernels_test .................. Passed 29.18 sec
Start 3: bert_attention_test
3/12 Test #3: bert_attention_test ..............***Failed 4.50 sec
date time ( uptime ) [ thread name/id ] file:line v|
2020-06-08 13:10:51.358 ( 0.000s) [main thread ] loguru.cpp:610 INFO| arguments: turbo_transformers_cxx
2020-06-08 13:10:51.358 ( 0.000s) [main thread ] loguru.cpp:613 INFO| Current dir: /tmp/build/turbo_transformers/python
2020-06-08 13:10:51.358 ( 0.000s) [main thread ] loguru.cpp:615 INFO| stderr verbosity: 0
2020-06-08 13:10:51.358 ( 0.000s) [main thread ] loguru.cpp:616 INFO| -----------------------------------
FFFFFFFFFFFFFFFFFFFFFFFBertAttention "(1,010)" CPU Torch QPS, 492.80298203436234, time, 0.002029208500061941
BertAttention "(1,010)" CPU Turbo QPS, 1082.363535550833, time, 0.0009239040000466048`
...
The following tests FAILED:
3 - bert_attention_test (Failed)
5 - bert_encoder_test (Failed)
6 - bert_intermediate_test (Failed)
7 - bert_layer_test (Failed)
8 - bert_model_test (Failed)
9 - bert_output_test (Failed)
10 - bert_pooler_test (Failed)
请问目前这个gpu benchmark对pytorch版本最高支持到多少呢?官方首页里的benchmark实验结果,是和哪个版本的pytorch比较的呢?
编译turbo,运行benchmark报错
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/turbo_transformers/layers/modeling_bert.py", line 16, in
import turbo_transformers.turbo_transformers_cxxd as cxx
ModuleNotFoundError: No module named 'turbo_transformers.turbo_transformers_cxxd'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "gpu_benchmark.py", line 114, in
main()
File "gpu_benchmark.py", line 104, in main
benchmark_turbo_transformers(**kwargs)
File "gpu_benchmark.py", line 38, in benchmark_turbo_transformers
import turbo_transformers
File "/usr/local/lib/python3.6/dist-packages/turbo_transformers/init.py", line 14, in
from .layers import *
File "/usr/local/lib/python3.6/dist-packages/turbo_transformers/layers/init.py", line 14, in
from .modeling_bert import BertEmbeddings, BertIntermediate, BertOutput, BertAttention, BertLayer, SequencePool,
File "/usr/local/lib/python3.6/dist-packages/turbo_transformers/layers/modeling_bert.py", line 18, in
import turbo_transformers.turbo_transformers_cxx as cxx
ImportError: /usr/local/lib/python3.6/dist-packages/turbo_transformers/turbo_transformers_cxx.cpython-36m-x86_64-linux-gnu.so: undefined symbol: ZNK18turbo_transformers6layers15DEEPFMEmbeddingclERKNS_4core6TensorEPS3
请问大佬,可能是什么原因呢?
首页写到“我们提供了载入huggingface/transformers的pytorch和tensorflow预训练模型方式”,但是
https://github.com/Tencent/TurboTransformers/blob/master/example/python/README_cn.md
又写到“首先我们需要准备一个使用huggingface训练好的bert模型”,也没看到tensorflow方式的例子。
基于工作需求,要对hidden做进一步的处理
请问,TurboTransformer对绝大部分网络层计算,使用的是16位Float还是32位Float的呢?这个issue里,您有说"对Tensor Core做了支持",不知道有没有大量转成fp16去计算。
我用onnxruntime针对transformer的优化工具做baseline时(官方README说该工具使用了最新的优化技术,比onnxruntime-gpu还要快: "Some of the latest optimizations that have not yet been integrated into ONNX Runtime are available in this tool that tunes models for the best performance."),不太清楚其fp32还是fp16版本,哪个来和TurboTransformer比较更公平。
我在V100卡上的测试结果,是Turbo相比fp32的onnxruntime-transformers-tools达到1.5X至2.5X的加速比,相比fp16的onnxruntime-transformers-tools会落败即加速比0.7X至0.8X
Dear developers,
I am trying to reproduce the bert benchmarking result on my machine.
I just run bash run_gpu_benchmark.sh
but the QPS is much slower than the declared value. When seq_len
becomes larger than 80, turbo becomes slower than torch.
I installed TurboTransformers from source
mkdir -p build && cd build
cmake .. -DWITH_GPU=ON
make -j 4
pip install `find . -name *whl`
/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py:738: UserWarning: ONNX export failed on ATen operator einsum because torch.onnx.symbolic_opset9.einsum does not exist
.format(op_name, opset_version, op_name))
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/workspace/benchmark/benchmark_helper.py", line 89, in generate_onnx_model
torch.onnx.export(model=model, args=(input_ids, ), f=outf)
File "/opt/conda/lib/python3.7/site-packages/torch/onnx/init.py", line 168, in export
custom_opsets, enable_onnx_checker, use_external_data_format)
File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 69, in export
use_external_data_format=use_external_data_format)
File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 488, in _export
fixed_batch_size=fixed_batch_size)
File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 351, in _model_to_graph
fixed_batch_size=fixed_batch_size, params_dict=params_dict)
File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 154, in _optimize_graph
graph = torch._C._jit_pass_onnx(graph, operator_export_type)
File "/opt/conda/lib/python3.7/site-packages/torch/onnx/init.py", line 199, in _run_symbolic_function
return utils._run_symbolic_function(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 739, in _run_symbolic_function
op_fn = sym_registry.get_registered_op(op_name, '', opset_version)
File "/opt/conda/lib/python3.7/site-packages/torch/onnx/symbolic_registry.py", line 109, in get_registered_op
raise RuntimeError(msg)
RuntimeError: Exporting the operator einsum to ONNX opset version 9 is not supported. Support for this operator was added in version 12, try exporting with this version.
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "cpu_benchmark.py", line 173, in
main()
File "cpu_benchmark.py", line 164, in main
benchmark_helper.onnxruntime_benchmark_creator('CPU')(**kwargs)
File "/workspace/benchmark/benchmark_helper.py", line 106, in impl
backend))
File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 261, in apply
return self.apply_async(func, args, kwds).get()
File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
RuntimeError: Exporting the operator einsum to ONNX opset version 9 is not supported. Support for this operator was added in version 12, try exporting with this version.
我现在在使用C++ APIs。我按照https://github.com/Tencent/TurboTransformers/blob/master/example/cpp/README.md里面的说明,进行./bert_model_example bert.npz。
但是我发现example/cpp/bert_model_example.cpp里面的代码core::IsCompiledWithCUDA()永远返回false,导致没办法在c++ api中使用GPU。
我在代码库中grep了这个函数,返回如下:
example/cpp/bert_model_test.cpp:97: if (core::IsCompiledWithCUDA()) {
example/cpp/bert_model_test.cpp:106: if (core::IsCompiledWithCUDA()) {
example/cpp/bert_model_example.cpp:140: if (core::IsCompiledWithCUDA()) {
example/cpp/bert_model_example.cpp:149:// if (core::IsCompiledWithCUDA()) {
turbo_transformers/core/config.h:26:constexpr bool IsCompiledWithCUDA() {
turbo_transformers/python/pybind.cpp:52: m.def("is_compiled_with_cuda", &core::IsCompiledWithCUDA)
Turbo supports MKL and OpenBLAS for CPU GEMM operations. Neither of them is optimized for AMD CPU. We are looking forwards for supporting with AMD BLIS
https://github.com/amd/blis
https://developer.amd.com/amd-aocl/
小白说一下,感觉好用是很好用;但是这个安装流程是不是太重型了。。。
Hello! I have Pytorch model that consists of Hugging Face BERT model followed by several PyTorch nn layers (such as Feed Forward and LSTM). Can I accelerate and run it (or at least BERT part of it) using TurboTransformers?
And how can I take advantage of variable length support? Should I just pad my tensor up to the length of the longest element?
我目前的需求,是远程传过来一个文本,我需要在机器上搭建一个服务器,这个服务器用bert返回CLS向量(也就是原始bert中的pooled_output),那么我是基于example中的cpp接口进行搭建服务器,还是基于python接口搭建服务器,这两者在速度上,有什么差异么?
运行自定义代码,报错L:
RuntimeError: enforce errordetails::IsDataType(t.dtype) at /TurboTransformers/turbo_transformers/core/tensor.h:302
data type mismatch, request l, actual (0,32)
Callstack 0
请问大佬,可能是什么原因啊?
Hi, I just run the example in the readme and found the turbo_model is slower than the original torch model. I pull the latest CPU docker image and run the sample script in the container.
I use the following command to start the container:
docker run -itd --name turbo_test thufeifeibear/turbo_transformers_cpu:latest
docker exec -it turbo_test bash
sh tools/build_docker_gpu.sh $PWD
请问build_docker_gpu.sh该怎么修改?
我目前改成
CUDA_VERSION=10.2
DOCKER_BASE=${CUDA_VERSION}-cudnn7-devel-CentOS7.6
PYTORCH_VERSION=1.5.0
结果报错
++ cat ../CMakeLists.txt
++ grep TURBO_TRANSFORMERS_VERSION
++ sed 's#set(TURBO_TRANSFORMERS_VERSION ##g'
++ sed 's#)##g'
GPU型号
GPU 0: GeForce RTX 2080 Ti
怎样调节batch_size、vocab_size、seq_len大小都比tensoflow要慢,
请问这符合预期吗? 可能是什么原因呢
按照教程4,
sh tool/build_conda_package.sh
# The conda package will be in /workspace/dist/*.tar.bz2
# When using turbo_transformers in other environments outside this container : python -m pip install your_root_path / dist / *. Tar.bz2
通过某一台机器,编译出包后(*. Tar.bz2),导入服务器安装后,import 报错,错误信息如下:
>>> import turbo_transformers
Traceback (most recent call last):
File "/home/work/anaconda3/envs/pytorch/lib/python3.7/site-packages/turbo_transformers/layers/modeling_bert.py", line 16, in <module>
import turbo_transformers.turbo_transformers_cxxd as cxx
ModuleNotFoundError: No module named 'turbo_transformers.turbo_transformers_cxxd'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/work/anaconda3/envs/pytorch/lib/python3.7/site-packages/turbo_transformers/__init__.py", line 14, in <module>
from .layers import *
File "/home/work/anaconda3/envs/pytorch/lib/python3.7/site-packages/turbo_transformers/layers/__init__.py", line 14, in <module>
from .modeling_bert import BertEmbeddings, BertIntermediate, BertOutput, BertAttention, BertLayer, SequencePool, \
File "/home/work/anaconda3/envs/pytorch/lib/python3.7/site-packages/turbo_transformers/layers/modeling_bert.py", line 18, in <module>
import turbo_transformers.turbo_transformers_cxx as cxx
ImportError: libmkl_intel_lp64.so: cannot open shared object file: No such file or directory
我发现很多情况下,当thread_num等于CPU核数时,turbo的速度会变得非常慢,远远落后torch等其他对照组;当把thread_num设少一些,就又正常了;请问这是什么原因呢?
机器型号:Azure VM, "8 Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz"
pytorch版本:1.5.0
2020-06-11 00:43:28.694 ( 0.000s) [main thread ] loguru.cpp:610 INFO| arguments: turbo_transformers_cxx
2020-06-11 00:43:28.694 ( 0.000s) [main thread ] loguru.cpp:613 INFO| Current dir: /workspace/benchmark
2020-06-11 00:43:28.694 ( 0.000s) [main thread ] loguru.cpp:615 INFO| stderr verbosity: 0
2020-06-11 00:43:28.694 ( 0.000s) [main thread ] loguru.cpp:616 INFO| -----------------------------------
{'model': 'bert-base-uncased', 'seq_len': 10, 'batch_size': 1, 'n': 150, 'num_threads': 8}
{"QPS": 2.400054755041472, "elapsed": 62.498574119992554, "n": 150, "batch_size": 1, "seq_len": 10, "framework": "turbo", "thread_num": 8}
2020-06-11 00:44:33.975 ( 65.280s) [main thread ] loguru.cpp:489 INFO| atexit
{'model': 'bert-base-uncased', 'seq_len': 10, 'batch_size': 1, 'n': 150, 'num_threads': 8}
{"QPS": 33.7716449691695, "elapsed": 4.441595904994756, "n": 150, "batch_size": 1, "seq_len": 10, "framework": "onnx_rt_MKL", "n_threads": 8}
{'model': 'bert-base-uncased', 'seq_len': 10, 'batch_size': 1, 'n': 150, 'num_threads': 8}
{"QPS": 31.75250998386658, "elapsed": 4.724035991996061, "n": 150, "batch_size": 1, "seq_len": 10, "framework": "onnx_rt_CPU", "n_threads": 8}
{'model': 'bert-base-uncased', 'seq_len': 10, 'batch_size': 1, 'n': 150, 'num_threads': 8}
{"QPS": 19.904907790425327, "elapsed": 7.535829935979564, "n": 150, "batch_size": 1, "seq_len": 10, "framework": "torch", "thread_num": 8}
{'model': 'bert-base-uncased', 'seq_len': 10, 'batch_size': 1, 'n': 150, 'num_threads': 8}
{"QPS": 23.488386050582193, "elapsed": 6.386134819011204, "n": 150, "batch_size": 1, "seq_len": 10, "framework": "torch_jit", "n_threads": 8}
date time ( uptime ) [ thread name/id ] file:line v|
2020-06-11 00:45:28.113 ( 0.000s) [main thread ] loguru.cpp:610 INFO| arguments: turbo_transformers_cxx
2020-06-11 00:45:28.113 ( 0.000s) [main thread ] loguru.cpp:613 INFO| Current dir: /workspace/benchmark
2020-06-11 00:45:28.113 ( 0.000s) [main thread ] loguru.cpp:615 INFO| stderr verbosity: 0
2020-06-11 00:45:28.113 ( 0.000s) [main thread ] loguru.cpp:616 INFO| -----------------------------------
{'model': 'bert-base-uncased', 'seq_len': 20, 'batch_size': 1, 'n': 150, 'num_threads': 8}
{"QPS": 2.3016662434532256, "elapsed": 65.17017852899153, "n": 150, "batch_size": 1, "seq_len": 20, "framework": "turbo", "thread_num": 8}
2020-06-11 00:46:35.959 ( 67.845s) [main thread ] loguru.cpp:489 INFO| atexit
{'model': 'bert-base-uncased', 'seq_len': 20, 'batch_size': 1, 'n': 150, 'num_threads': 8}
{"QPS": 22.960897563908322, "elapsed": 6.532845659996383, "n": 150, "batch_size": 1, "seq_len": 20, "framework": "onnx_rt_MKL", "n_threads": 8}
{'model': 'bert-base-uncased', 'seq_len': 20, 'batch_size': 1, 'n': 150, 'num_threads': 8}
{"QPS": 22.442801442538517, "elapsed": 6.683657580986619, "n": 150, "batch_size": 1, "seq_len": 20, "framework": "onnx_rt_CPU", "n_threads": 8}
{'model': 'bert-base-uncased', 'seq_len': 20, 'batch_size': 1, 'n': 150, 'num_threads': 8}
{"QPS": 12.459391223496187, "elapsed": 12.039111487014452, "n": 150, "batch_size": 1, "seq_len": 20, "framework": "torch", "thread_num": 8}
{'model': 'bert-base-uncased', 'seq_len': 20, 'batch_size': 1, 'n': 150, 'num_threads': 8}
{"QPS": 14.308623019358208, "elapsed": 10.483189039019635, "n": 150, "batch_size": 1, "seq_len": 20, "framework": "torch_jit", "n_threads": 8}
date time ( uptime ) [ thread name/id ] file:line v|
2020-06-11 00:47:42.783 ( 0.000s) [main thread ] loguru.cpp:610 INFO| arguments: turbo_transformers_cxx
2020-06-11 00:47:42.783 ( 0.000s) [main thread ] loguru.cpp:613 INFO| Current dir: /workspace/benchmark
2020-06-11 00:47:42.783 ( 0.000s) [main thread ] loguru.cpp:615 INFO| stderr verbosity: 0
2020-06-11 00:47:42.783 ( 0.000s) [main thread ] loguru.cpp:616 INFO| -----------------------------------
Now the logic inside MultiheadAttention Layer is too complex for development.
Moreover, some bugs exist in intermediate management.
It is the first priority to rewrite these codes to make others easily understand what Turbo is doing.
支持GPT2吗,例如:GPT2-Chinese,可以给出对应的示例吗,谢谢
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.