verisilicon / tim-vx Goto Github PK

View Code? Open in Web Editor NEW

219.0 219.0 84.0 74.13 MB

VeriSilicon Tensor Interface Module

License: Other

Starlark 0.20% C++ 12.86% C 86.63% Makefile 0.05% Shell 0.05% CMake 0.19% Python 0.02%

deep-learning neural-network tensorflow

tim-vx's People

Contributors

Stargazers

Watchers

Forkers

xiafeimao bug1989 devinxie cyberfire liyuenan2333 zongwuyang bowshotds zhouhengzheng hawk081 kaihongxia peakhillsgroup mfkiwl ayounes-synaptics deepware-ai 2019dj nightingalei antkillerfarm robert-kalmar janoslim tangyuan-liu zhengzhouheng chxin66 auuuux xiangjun0103 jp1010620676 geneyoung changerzz iveschow siqi-yang gigi9411 unsky stella0816 nullkooland hanhwi yan-wyb lileiigithub thezha zhupengyang zihaomu fengyuentau xuke537 paultcn hibillchou onepick shagergel yingshengbd gdh1995 ahqzy dongka sunshinemyson zjd1988 small-v harwardding meseraph zhangjianfei-2007 chenfeiyue-cfy wpgwen mercurychen faintnj wwtghx arouniversal leecheedoo wei8171023 vimers huanyucai frankdenuijl shijie-nv scuwq ddzhangmengwei starstylesky zhenglinx ttfleet xie-oritek pearshen anyj0527 syna-synap lix19937 mahiru-mahiru aaronrong nxp-upstream zhongzhuonan synaptics-synap itri-t300 hjessie

tim-vx's Issues

compile NBG failed

Hello,

I followed source code defined at graph_test.cc that use NBG,
However I failed to compile nbg file.
I wonder my creating nbg file code is wrong. Thank you.

Compile as binary and save.

size_t bin_size = 0;

graph->CompileToBinary(nullptr, &bin_size); // bin size A  
std::vector<char> nbg_buf(bin_size); 

this->graph->CompileToBinary(nbg_buf.data(), &bin_size);  // bin size B
std::string file_path = this->makeChunkPath(nbgdir); 

std::ofstream FILE(file_path, std::ofstream::binary);
for (auto byte : nbg_buf)
    FILE << byte;

// while bin size A and B is slightly different, I wonder it does not matter.

Read NBG and compile

// read file size
FILE* nbg = NULL;    
nbg = fopen(file_path.c_str(), "rb");
fseek(nbg, 0, SEEK_END);
long rst = ftell(nbg);

fseek(nbg, 0, SEEK_SET);

this->nbg_buf = std::vector<char>(rst);

while((read_size = fread(nbg_buf.data(), 1, 4096, nbg)) > 0)
{ }

std::vector<std::shared_ptr<tim::vx::Tensor>> iVXTensors;
std::vector<std::shared_ptr<tim::vx::Tensor>> oVXTensors;
// Create tensors...
...
...
//

//  make f auto nbg_node = this->graph->CreateOperation<tim::vx::ops::NBG>(
  (nbg_buf.data()), /*num_of_input*/ iVXTensors.size(), /*num_of_output*/ oVXTensors.size());

(*nbg_node).BindInputs(iVXTensors).BindOutputs(oVXTensors);

iVXTensors[0]->CopyDataToTensor(nbg_buf.data(), shape);

std::cout << "Compile start." << std::endl;
this->graph->Compile();
// This goes seg fault...

TVM RPC Error "PLS isn't existed" on Khadas VIM3 Pro (Amlogic A311D)

@sunshinemyson

I tried the VSI NPU as TVM target, ran the test_operations.py in TVM_FOLDER/tests/python/contrib/test_vsi_npu.
It had error "PLS isn't existed" on VIM3 Pro side. I found the previous issue , I cannot solve the problem by setting "VSIMULATOR_CONFIG=VIPNANOQI_PID0X88".
The following is my environment:

Environment variable (Host)

export VSIMULATOR_CONFIG=VIPNANOQI_PID0X88 # This PID is provided by khadas document.
export VIV_VX_DEBUG_LEVEL=1

Environment variable (VIM3 Pro)

export VIV_VX_DEBUG_LEVEL=1

Model: Khadas VIM3 Pro
SoC: Amlogic A311D with 5 TOPS Performance NPU

OS information:

khadas@Khadas:~$ uname -a
Linux Khadas 4.9.241 #18 SMP PREEMPT Fri Jun 25 14:18:34 CST 2021 aarch64 aarch64 aarch64 GNU/Linux

khadas@Khadas:~$ cat /etc/fenix-release
# PLEASE DO NOT EDIT THIS FILE
BOARD=VIM3
VENDOR=Amlogic
VERSION=1.0.7
ARCH=arm64
INITRD_ARCH=arm64
INSTALL_TYPE=EMMC
IMAGE_VERSION=V1.0.7-210625
################ GIT VERSION ################
UBOOT_GIT_VERSION=khadas-vims-v1.0.5-release
LINUX_GIT_VERSION=khadas-vims-v1.0.5-release-6-gc5aa6ab
FENIX_GIT_VERSION=v1.0.7
#############################################

NPU information:

khadas@Khadas:~$ dpkg -l | grep npu
ii  aml-npu                              6.4.4.3AAA-2                                                 arm64        Amlogic NPU libraries.
ii  evtest                               1:1.34-1                                                     arm64        utility to monitor Linux input device events
ii  libinput-bin                         1.15.5-1ubuntu0.2                                            arm64        input device management and event handling library - udev quirks
ii  libinput10:arm64                     1.15.5-1ubuntu0.2                                            arm64        input device management and event handling library - shared library
ii  libxi6:arm64                         2:1.7.10-0ubuntu1                                            arm64        X11 Input extension library

khadas@Khadas:~$ lsmod
Module                  Size  Used by
cpufreq_powersave      16384  0
cpufreq_userspace      16384  0
cpufreq_conservative    16384  0
cpufreq_ondemand       20480  0
iv009_isp_sensor      270336  0
iv009_isp_lens         69632  0
iv009_isp_iq          544768  0
galcore               462848  0
vpu                    49152  0
encoder                53248  0
amvdec_avs2           192512  0
amvdec_vp9            151552  0
amvdec_vc1             53248  0
amvdec_real            40960  0
amvdec_mmpeg4          32768  0
amvdec_mpeg4           53248  0
amvdec_mmpeg12         40960  0
amvdec_mpeg12          90112  0
amvdec_mmjpeg          28672  0
amvdec_mjpeg           36864  0
amvdec_h265           135168  0
amvdec_h264mvc         49152  0
amvdec_mh264          151552  0
amvdec_h264           118784  0
amvdec_avs             61440  0
stream_input          180224  10 amvdec_h265,amvdec_mh264,amvdec_h264mvc,amvdec_real,amvdec_vp9,amvdec_h264,amvdec_avs2,amvdec_mpeg12,amvdec_avs,amvdec_mmpeg12
decoder_common        176128  17 amvdec_h265,amvdec_mjpeg,amvdec_mh264,amvdec_mmpeg4,amvdec_h264mvc,amvdec_mmjpeg,amvdec_real,stream_input,amvdec_vp9,amvdec_h264,encoder,amvdec_avs2,amvdec_mpeg12,amvdec_avs,amvdec_vc1,amvdec_mmpeg12,amvdec_mpeg4
firmware               28672  18 amvdec_h265,amvdec_mjpeg,amvdec_mh264,amvdec_mmpeg4,amvdec_h264mvc,amvdec_mmjpeg,decoder_common,amvdec_real,stream_input,amvdec_vp9,amvdec_h264,encoder,amvdec_avs2,amvdec_mpeg12,amvdec_avs,amvdec_vc1,amvdec_mmpeg12,amvdec_mpeg4
media_clock            45056  12 amvdec_h265,amvdec_mh264,decoder_common,vpu,firmware,stream_input,amvdec_vp9,amvdec_h264,encoder,amvdec_avs2,amvdec_mpeg12,amvdec_avs
mali_kbase            475136  0
iv009_isp             540672  2
zram                   36864  4
dhd                  1404928  0
btrfs                1269760  0
xor                    20480  1 btrfs
raid6_pq              106496  1 btrfs

khadas@Khadas:~$ ls /dev/galcore
/dev/galcore

khadas@Khadas:~$ sudo dmesg | grep Gal
[   12.202405] Galcore version 6.4.4.3.310723AAA

TIM-VX Version:1.1.32

TVM Branch commit id: b822ec32702e2676dce1e430221e8efc05c98935

The output message after executing Unittest program of TIM-VX:

 khadas@Khadas:~/TIM-VX-1.1.32/install/bin$ ./unit_test 
 Running main() from /home/khadas/TIM-VX-1.1.32/_deps/googletest-src/googletest/src/gtest_main.cc
 [==========] Running 104 tests from 33 test suites.
 [----------] Global test environment set-up.
 [----------] 1 test from Context
 <Skip the PASS Items. >
 [----------] 1 test from Context (25 ms total)
 
 [----------] 2 tests from graph
 [ RUN      ] graph.gen_binary_graph_with_empty_graph
 E [_graph_optimization_convert_int8_to_uint8:792]CHECK STATUS(-1:A generic error code, used when no other describes the error.)
 E [vsi_nn_OptimizeGraph:827]CHECK STATUS(-1:A generic error code, used when no other describes the error.)
 [       OK ] graph.gen_binary_graph_with_empty_graph (3 ms)
 [ RUN      ] graph.gen_binary_graph_with_simple_add
 [       OK ] graph.gen_binary_graph_with_simple_add (8 ms)
 [----------] 2 tests from graph (11 ms total)
 
 [----------] 2 tests from Linear
 <Skip the PASS Items. >
 [----------] 2 tests from Linear (13 ms total)
 
 [----------] 3 tests from Conv1d
 <Skip the PASS Items. >
 [----------] 3 tests from Conv1d (22 ms total)
 
 [----------] 19 tests from Conv2d
 <Skip the PASS Items. >
 [----------] 19 tests from Conv2d (195 ms total)
 
 [----------] 2 tests from DeConv1d
 [ RUN      ] DeConv1d.no_bias_layout_whcn_depthwise_shape_3_2_1
 /home/khadas/TIM-VX-1.1.32/src/tim/vx/ops/deconv1d_test.cc:69: Failure
 Expected equality of these values:
   golden
     Which is: { 27, 81, 30, 9, 3, 21, 15, 27, 0, 0 }
   output_data
     Which is: { 48, 96, 57, 9, 3, 0, 0, 0, 0, 0 }
 Result mismatch
 [  FAILED  ] DeConv1d.no_bias_layout_whcn_depthwise_shape_3_2_1 (9 ms)
 <Skip the PASS Items. >
 [----------] 2 tests from DeConv1d (56 ms total)
 
 [----------] 2 tests from DeConv2d
 [ RUN      ] DeConv2d.shape_3_3_2_1_float_depthwise
 /home/khadas/TIM-VX-1.1.32/src/tim/vx/ops/deconv2d_test.cc:85: Failure
 Expected equality of these values:
   golden
     Which is: { 27, 72, 18, 24, 3, 81, 45, 90, 15, 21, 30, 26, 43, 22, 11, 9, 5, 25, 10, 14, 3, 2, 9, 4, 6, 21, 27, 52, 63, 7, 15, 6, ... }
   output_data
     Which is: { 48, 99, 70, 87, 10, 96, 51, 134, 29, 42, 57, 26, 168, 94, 33, 9, 5, 65, 26, 38, 3, 2, 81, 4, 22, 0, 0, 0, 0, 0, 0, 0, ... }
 Result mismatch
 [  FAILED  ] DeConv2d.shape_3_3_2_1_float_depthwise (9 ms)
 <Skip the PASS Items. >
 [----------] 2 tests from DeConv2d (18 ms total)
 
 [----------] 16 tests from DepthwiseConv
 <Skip the PASS Items. >
 [----------] 16 tests from DepthwiseConv (176 ms total)
 
 [----------] 3 tests from FloorDiv
 <Skip the PASS Items. >
 
 (10:0) : error : Error(0,10) : Cannot find the header file cl_viv_vx_ext.h.
 (255:0) : error : Error(0,255) : Cannot find the header file cl_viv_vx_ext.h.
 (27:0) : error : undefined identifier: 'COPY'
 (55:0) : error : undefined identifier: 'COPY'
 (257:0) : error : syntax error at 'VXC_512Bits'
 
 ERROR: Failed to compile vx shader. (error: FFFFFFFF)
 E [_gpu_register:476]Build program fail.
 E [vsi_nn_kernel_create_node:631]Register client kernel com.vivantecorp.extension.evis.floordiv_U8U8toU8_2D fail with -1.
 
 [       OK ] FloorDiv.shape_5_1_broadcast_uint8 (56 ms)
 [----------] 3 tests from FloorDiv (135 ms total)
 
 [----------] 3 tests from GroupedConv2d
 <Skip the PASS Items. >
 [----------] 3 tests from GroupedConv2d (29 ms total)
 
 [----------] 2 tests from InstanceNorm
 <Skip the PASS Items. >

 [----------] 2 tests from InstanceNorm (208 ms total)
 
 [----------] 2 tests from LayerNorm
 <Skip the PASS Items. >
 [----------] 2 tests from LayerNorm (117 ms total)
 
 [----------] 3 tests from LogSoftmax
 <Skip the PASS Items. >
 [ RUN      ] LogSoftmax.shape_3_6_1_uint8_axis_1
 
 (10:0) : error : Error(0,10) : Cannot find the header file cl_viv_vx_ext.h.
 (255:0) : error : Error(0,255) : Cannot find the header file cl_viv_vx_ext.h.
 (27:0) : error : undefined identifier: 'COPY'
 (55:0) : error : undefined identifier: 'COPY'
 (263:0) : error : syntax error at 'VXC_512Bits'
 
 ERROR: Failed to compile vx shader. (error: FFFFFFFF)
 E [_gpu_register:476]Build program fail.
 E [vsi_nn_kernel_create_node:631]Register client kernel com.vivantecorp.extension.evis.log_softmax_axis1_U8toU8_2D fail with -1.
 
 [       OK ] LogSoftmax.shape_3_6_1_uint8_axis_1 (70 ms)
 [----------] 3 tests from LogSoftmax (161 ms total)
 
 [----------] 3 tests from Matmul
 <Skip the PASS Items. >
 [ RUN      ] Matmul.shape_2_3_2_shape_2_3_2_uint8_transpose_a
 
 (10:0) : error : Error(0,10) : Cannot find the header file cl_viv_vx_ext.h.
 (255:0) : error : Error(0,255) : Cannot find the header file cl_viv_vx_ext.h.
 (27:0) : error : undefined identifier: 'COPY'
 (55:0) : error : undefined identifier: 'COPY'
 (261:0) : error : syntax error at 'VXC_512Bits'
 
 ERROR: Failed to compile vx shader. (error: FFFFFFFF)
 E [_gpu_register:476]Build program fail.
 E [vsi_nn_kernel_create_node:631]Register client kernel com.vivantecorp.extension.evis.gemm_transa_U8U8toU8 fail with -1.
 
 [       OK ] Matmul.shape_2_3_2_shape_2_3_2_uint8_transpose_a (30 ms)
 [----------] 3 tests from Matmul (113 ms total)
 
 [----------] 2 tests from MaxpoolWithArgmax
 <Skip the PASS Items. >
 [ RUN      ] MaxpoolWithArgmax.shape_4_4_1_uint8_kernel_2_stride_2
 
 (10:0) : error : Error(0,10) : Cannot find the header file cl_viv_vx_ext.h.
 (255:0) : error : Error(0,255) : Cannot find the header file cl_viv_vx_ext.h.
 (27:0) : error : undefined identifier: 'COPY'
 (55:0) : error : undefined identifier: 'COPY'
 (258:0) : error : syntax error at 'VXC_512Bits'
 
 ERROR: Failed to compile vx shader. (error: FFFFFFFF)
 E [_gpu_register:476]Build program fail.
 E [vsi_nn_kernel_create_node:631]Register client kernel com.vivantecorp.extension.evis.poolwithargmax_U8to_U8_U8_2D fail with -1.
 
 [       OK ] MaxpoolWithArgmax.shape_4_4_1_uint8_kernel_2_stride_2 (54 ms)
 [----------] 2 tests from MaxpoolWithArgmax (100 ms total)
 
 [----------] 2 tests from MaxUnpool2d
 <Skip the PASS Items. >
 [ RUN      ] MaxUnpool2d.shape_2_2_1_uint8_kernel_2_stride_2
 
(10:0) : error : Error(0,10) : Cannot find the header file cl_viv_vx_ext.h.
(256:0) : error : Error(0,256) : Cannot find the header file cl_viv_vx_ext.h.
(27:0) : error : undefined identifier: 'COPY'
(55:0) : error : undefined identifier: 'COPY'
(296:0) : error : undefined identifier: 'vxc_uchar8'
(296:0) : error : undefined identifier: 'vxc_uchar8'
(296:0) : error : undefined identifier: 'vxc_uchar16'
(296:0) : error : undefined identifier: 'vxc_uchar16'
(296:0) : error : undefined identifier: 'vxc_uchar16'
(296:0) : error : undefined identifier: 'vxc_uchar16'
(296:0) : error : undefined identifier: 'vxc_uchar16'
(296:0) : error : undefined identifier: 'vxc_uchar16'
(296:0) : error : undefined identifier: 'din'
(296:0) : error : undefined identifier: 'axisIn'
(296:0) : error : undefined identifier: 'dinExpand'
(296:0) : error : undefined identifier: 'axisInExpand'
(296:0) : error : undefined identifier: 'zpValue'
(296:0) : error : undefined identifier: 'constAxis'
(296:0) : error : undefined identifier: 'axisData'
(296:0) : error : undefined identifier: 'dout'
(296:0) : error : undefined identifier: 'dout'
(296:0) : error : undefined identifier: 'constAxis'
(296:0) : error : undefined identifier: 'axisData'
(296:0) : error : undefined identifier: 'dout'
(296:0) : error : undefined identifier: 'dout'
(308:0) : error : undefined identifier: 'vxc_uchar8'
(308:0) : error : undefined identifier: 'vxc_uchar8'
(308:0) : error : undefined identifier: 'vxc_uchar16'
(308:0) : error : undefined identifier: 'vxc_uchar16'
(308:0) : error : undefined identifier: 'vxc_uchar16'
(308:0) : error : undefined identifier: 'vxc_uchar16'
(308:0) : error : undefined identifier: 'vxc_uchar16'
(308:0) : error : undefined identifier: 'vxc_uchar16'
(308:0) : error : undefined identifier: 'din'
(308:0) : error : undefined identifier: 'axisIn'
(308:0) : error : undefined identifier: 'dinExpand'
(308:0) : error : undefined identifier: 'axisInExpand'
(308:0) : error : undefined identifier: 'zpValue'
(308:0) : error : undefined identifier: 'constAxis'
(308:0) : error : undefined identifier: 'axisData'
(308:0) : error : undefined identifier: 'dout'
(308:0) : error : undefined identifier: 'dout'
(308:0) : error : undefined identifier: 'constAxis'
(308:0) : error : undefined identifier: 'axisData'
(308:0) : error : undefined identifier: 'dout'
(308:0) : error : undefined identifier: 'dout'
(312:0) : error : syntax error at 'VXC_512Bits'

ERROR: Failed to compile vx shader. (error: FFFFFFFF)
E [_gpu_register:476]Build program fail.
E [vsi_nn_kernel_create_node:631]Register client kernel com.vivantecorp.extension.evis.upsample_U8_U8to_U8_SAME_2D fail with -1.

[       OK ] MaxUnpool2d.shape_2_2_1_uint8_kernel_2_stride_2 (60 ms)
[----------] 2 tests from MaxUnpool2d (108 ms total)

[----------] 2 tests from Moments
<Skip the PASS Items. >
[----------] 2 tests from Moments (100 ms total)

[----------] 1 test from Equal
[ RUN      ] Equal.shape_1_uint8

(1:0) : error : Error(0,1) : Cannot find the header file cl_viv_vx_ext.h.
(7:0) : error : syntax error at 'VXC_512Bits'

ERROR: Failed to compile vx shader. (error: FFFFFFFF)
E [_gpu_register:476]Build program fail.
E [vsi_nn_kernel_create_node:631]Register client kernel com.vivantecorp.extension.evis.equal_U8U8toBOOL8_2D fail with -1.

[       OK ] Equal.shape_1_uint8 (89 ms)
[----------] 1 test from Equal (89 ms total)

[----------] 1 test from NotEqual
<Skip the PASS Items. >
[----------] 1 test from NotEqual (66 ms total)

[----------] 1 test from Less
<Skip the PASS Items. >
[----------] 1 test from Less (64 ms total)

[----------] 1 test from GreaterOrEqual
<Skip the PASS Items. >
[----------] 1 test from GreaterOrEqual (63 ms total)

[----------] 1 test from Greater
<Skip the PASS Items. >
[----------] 1 test from Greater (63 ms total)

[----------] 1 test from LessOrEqual
<Skip the PASS Items. >
[----------] 1 test from LessOrEqual (63 ms total)

[----------] 2 tests from Reorg
<Skip the PASS Items. >
[----------] 2 tests from Reorg (10 ms total)

[----------] 3 tests from Resize1d
<Skip the PASS Items. >
[ RUN      ] Resize1d.shape_4_2_1_uint8_nearest_whcn

(10:0) : error : Error(0,10) : Cannot find the header file cl_viv_vx_ext.h.
(255:0) : error : Error(0,255) : Cannot find the header file cl_viv_vx_ext.h.
(27:0) : error : undefined identifier: 'COPY'
(55:0) : error : undefined identifier: 'COPY'
(257:0) : error : syntax error at 'VXC_512Bits'

ERROR: Failed to compile vx shader. (error: FFFFFFFF)
E [_gpu_register:476]Build program fail.
E [vsi_nn_kernel_create_node:631]Register client kernel com.vivantecorp.extension.evis.resize_1d_nearest_U8toU8_op fail with -1.

[       OK ] Resize1d.shape_4_2_1_uint8_nearest_whcn (37 ms)
[ RUN      ] Resize1d.shape_5_1_1_float_bilinear_align_corners_whcn

[       OK ] Resize1d.shape_5_1_1_float_bilinear_align_corners_whcn (32 ms)
[----------] 3 tests from Resize1d (98 ms total)

[----------] 2 tests from ScatterND
[ RUN      ] ScatterND.shape_4_4_4

[       OK ] ScatterND.shape_4_4_4 (41 ms)
[ RUN      ] ScatterND.shape_9

(10:0) : error : Error(0,10) : Cannot find the header file cl_viv_vx_ext.h.
(255:0) : error : Error(0,255) : Cannot find the header file cl_viv_vx_ext.h.
(27:0) : error : undefined identifier: 'COPY'
(55:0) : error : undefined identifier: 'COPY'
(257:0) : error : syntax error at 'VXC_512Bits'

ERROR: Failed to compile vx shader. (error: FFFFFFFF)
E [_gpu_register:476]Build program fail.
E [vsi_nn_kernel_create_node:631]Register client kernel com.vivantecorp.extension.evis.scatter_nd_U8toU8 fail with -1.

[       OK ] ScatterND.shape_9 (25 ms)
[----------] 2 tests from ScatterND (66 ms total)

[----------] 1 test from Floor
[ RUN      ] Floor.shape_5_1_fp32
[       OK ] Floor.shape_5_1_fp32 (5 ms)
[----------] 1 test from Floor (5 ms total)

[----------] 1 test from Cast
[ RUN      ] Cast.shape_5_1_fp32_to_int32

[       OK ] Cast.shape_5_1_fp32_to_int32 (35 ms)
[----------] 1 test from Cast (35 ms total)

[----------] 1 test from SpatialTransformer
[ RUN      ] SpatialTransformer.shape_1_3_3_1_u8
(10:0) : error : Error(0,10) : Cannot find the header file cl_viv_vx_ext.h.
(23:0) : error : undefined identifier: 'vxc_ushort8'
(26:0) : error : undefined identifier: 'src0'
(27:0) : error : undefined identifier: 'src1'
(29:0) : error : undefined identifier: 'dst'
(31:0) : error : undefined identifier: 'dst'

ERROR: Failed to compile vx shader. (error: FFFFFFFF)
E [vsi_nn_RegisterVXKernel:251][/home/khadas/TIM-VX-1.1.32/src/tim/vx/internal/src/libnnext/vsi_nn_vxkernel.c : 251] vxBuildProgram() Error!

E [vsi_nn_InitKernel:108]Add parameter 0 to kernel com.vivantecorp.extension.vxcTransform_setupThres_F16toF16 fail. with -12.
E [vsi_nn_InitKernel:121]Finalize kernel com.vivantecorp.extension.vxcTransform_setupThres_F16toF16 fail with -12.
E [vsi_nn_InitKernel:126]Remove kernel com.vivantecorp.extension.vxcTransform_setupThres_F16toF16 fail with -10.
E [vsi_nn_RegisterClientKernelAndNewNode:415]Register client kernel com.vivantecorp.extension.vxcTransform_setupThres_F16toF16 fail with -10.
E [compute_node:379]Create node[0] SPATIAL_TRANSFORMER fail
/home/khadas/TIM-VX-1.1.32/src/tim/vx/ops/spatial_transformer_test.cc:74: Failure
Expected equality of these values:
 values_golden
   Which is: { '\x2' (2), '\x3' (3), '\x2' (2), '\x2' (2), '\x3' (3), '\x2' (2), '\x2' (2), '\x3' (3), '\x2' (2) }
 output_values
   Which is: { '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0' }
[  FAILED  ] SpatialTransformer.shape_1_3_3_1_u8 (22 ms)
[----------] 1 test from SpatialTransformer (22 ms total)

[----------] 2 tests from Tile
[ RUN      ] Tile.shape_3_2_float_multiples_2_1

[       OK ] Tile.shape_3_2_float_multiples_2_1 (45 ms)
[ RUN      ] Tile.shape_3_2_1_int8_multiples_2_2_1

(1:0) : error : Error(0,1) : Cannot find the header file cl_viv_vx_ext.h.
(59:0) : error : undefined identifier: 'vxc_uchar8'
(59:0) : error : undefined identifier: 'src'
(59:0) : error : undefined identifier: 'src'
(59:0) : error : undefined identifier: 'src'
(60:0) : error : undefined identifier: 'vxc_uchar8'
(60:0) : error : undefined identifier: 'src'
(60:0) : error : undefined identifier: 'src'
(60:0) : error : undefined identifier: 'src'
(61:0) : error : undefined identifier: 'vxc_uchar8'
(61:0) : error : undefined identifier: 'src'
(61:0) : error : undefined identifier: 'src'
(61:0) : error : undefined identifier: 'src'
(62:0) : error : undefined identifier: 'vxc_uchar8'
(62:0) : error : undefined identifier: 'src'
(62:0) : error : undefined identifier: 'src'
(62:0) : error : undefined identifier: 'src'
(63:0) : error : undefined identifier: 'vxc_uchar8'
(63:0) : error : undefined identifier: 'src'
(63:0) : error : undefined identifier: 'src'
(63:0) : error : undefined identifier: 'src'
(64:0) : error : undefined identifier: 'vxc_uchar8'
(64:0) : error : undefined identifier: 'src'
(64:0) : error : undefined identifier: 'src'
(64:0) : error : undefined identifier: 'src'
(65:0) : error : undefined identifier: 'vxc_uchar8'
(65:0) : error : undefined identifier: 'src'
(65:0) : error : undefined identifier: 'src'
(65:0) : error : undefined identifier: 'src'
(66:0) : error : undefined identifier: 'vxc_uchar8'
(66:0) : error : undefined identifier: 'src'
(66:0) : error : undefined identifier: 'src'
(66:0) : error : undefined identifier: 'src'
(68:0) : error : undefined identifier: 'vxc_short8'
(68:0) : error : undefined identifier: 'src'
(68:0) : error : undefined identifier: 'src'
(68:0) : error : undefined identifier: 'src'
(69:0) : error : undefined identifier: 'vxc_short8'
(69:0) : error : undefined identifier: 'src'
(69:0) : error : undefined identifier: 'src'
(69:0) : error : undefined identifier: 'src'
(70:0) : error : undefined identifier: 'vxc_short8'
(70:0) : error : undefined identifier: 'src'
(70:0) : error : undefined identifier: 'src'
(70:0) : error : undefined identifier: 'src'
(71:0) : error : undefined identifier: 'vxc_short8'
(71:0) : error : undefined identifier: 'src'
(71:0) : error : undefined identifier: 'src'
(71:0) : error : undefined identifier: 'src'
(72:0) : error : undefined identifier: 'vxc_short8'
(72:0) : error : undefined identifier: 'src'
(72:0) : error : undefined identifier: 'src'
(72:0) : error : undefined identifier: 'src'
(73:0) : error : undefined identifier: 'vxc_short8'
(73:0) : error : undefined identifier: 'src'
(73:0) : error : undefined identifier: 'src'
(73:0) : error : undefined identifier: 'src'
(74:0) : error : undefined identifier: 'vxc_short8'
(74:0) : error : undefined identifier: 'src'
(74:0) : error : undefined identifier: 'src'
(74:0) : error : undefined identifier: 'src'
(75:0) : error : undefined identifier: 'vxc_short8'
(75:0) : error : undefined identifier: 'src'
(75:0) : error : undefined identifier: 'src'
(75:0) : error : undefined identifier: 'src'
(115:0) : error : undefined identifier: 'vxc_uchar8'
(115:0) : error : undefined identifier: 'src'
(115:0) : error : undefined identifier: 'src'
(115:0) : error : undefined identifier: 'src'
(116:0) : error : undefined identifier: 'vxc_uchar8'
(116:0) : error : undefined identifier: 'src'
(116:0) : error : undefined identifier: 'src'
(116:0) : error : undefined identifier: 'src'
(117:0) : error : undefined identifier: 'vxc_uchar8'
(117:0) : error : undefined identifier: 'src'
(117:0) : error : undefined identifier: 'src'
(117:0) : error : undefined identifier: 'src'
(118:0) : error : undefined identifier: 'vxc_uchar8'
(118:0) : error : undefined identifier: 'src'
(118:0) : error : undefined identifier: 'src'
(118:0) : error : undefined identifier: 'src'
(119:0) : error : undefined identifier: 'vxc_uchar8'
(119:0) : error : undefined identifier: 'src'
(119:0) : error : undefined identifier: 'src'
(119:0) : error : undefined identifier: 'src'
(120:0) : error : undefined identifier: 'vxc_uchar8'
(120:0) : error : undefined identifier: 'src'
(120:0) : error : undefined identifier: 'src'
(120:0) : error : undefined identifier: 'src'
(121:0) : error : undefined identifier: 'vxc_uchar8'
(121:0) : error : undefined identifier: 'src'
(121:0) : error : undefined identifier: 'src'
(121:0) : error : undefined identifier: 'src'
(122:0) : error : undefined identifier: 'vxc_uchar8'
(122:0) : error : undefined identifier: 'src'
(122:0) : error : undefined identifier: 'src'
(122:0) : error : undefined identifier: 'src'
(124:0) : error : undefined identifier: 'vxc_short8'
(124:0) : error : undefined identifier: 'src'
(124:0) : error : undefined identifier: 'src'

ERROR: Failed to compile vx shader. (error: FFFFFFFF)
E [_gpu_register:476]Build program fail.
E [vsi_nn_kernel_create_node:631]Register client kernel com.vivantecorp.extension.evis.tile_remain3_U8toU8_2D fail with -1.

[       OK ] Tile.shape_3_2_1_int8_multiples_2_2_1 (80 ms)
[----------] 2 tests from Tile (125 ms total)

[----------] 14 tests from TransposeConv2d
<Skip the PASS Items. >
[ RUN      ] TransposeConv2d.shape_4_4_1_1_int8_QuantizedPerChannelOneTest
Segmentation fault
khadas@Khadas:~/TIM-VX-1.1.32/install/bin$

The output message after executing TVM test_operations.py at X86 Host side:

python3 test_operations.py 
Testing QNN pattern                                       1. press any key and continue...
make MOD Done!

conv2d NHWC layout is not optimized for x86 with autotvm.
#[version = "0.0.5"]
def @main(%data: Tensor[(1, 56, 56, 32), int8], %weight: Tensor[(1, 1, 32, 64), int8], %add: Tensor[(64), int32]) {
  %0 = qnn.conv2d(%data, %weight, 0, 77, 0.023528f, 0.045283f, padding=[0, 0, 0, 0], channels=64, kernel_size=[1, 1], data_layout="NHWC", kernel_layout="HWIO", out_dtype="int32");
  %1 = nn.bias_add(%0, %add, axis=3);
  qnn.requantize(%1, 0.00106542f, 0, 0.0235285f, 0, out_dtype="int8")
}

get_ref_result
get_vsi_result
get_vsi_model:before relay.build

vsi_npu.py --> qnn.requantize

This is important----> name_node.value() == tvmgen_default_vsi_npu_0
GraphMakerImpl::Create
TensorMakerImpl::InferCall: vsi_npu.qnn_conv2d

VsiNpuModule::GetFunction: get_symbol
VsiNpuModule::GetFunction: return early
VsiNpuModule::GetFunction: get_const_vars
VsiNpuModule::GetFunction: return early
VsiNpuModule::GetFunction: get_const_vars
VsiNpuModule::GetFunction: return early
VsiNpuModule::SaveToBinary
SaveToBinary: nbg size = 15552
SaveToBinary: input size = 1
SaveToBinary: output size = 1
VsiNpuModule : SerializeTensorSpec
VsiNpuModule : SerializeTensorSpec2
VsiNpuModule : SerializeTensorSpec
VsiNpuModule : SerializeTensorSpec2
VsiNpuModule::SaveToBinary2
/tmp/tmpamfs6yew/model.so
model.so
{'data': <tvm.nd.NDArray shape=(1, 56, 56, 32), cpu(0)>
array([[[[1, 1, 1, ..., 1, 1, 1],
         [1, 1, 1, ..., 1, 1, 1],
         [1, 1, 1, ..., 1, 1, 1],
         ...,
         [1, 1, 1, ..., 1, 1, 1],
         [1, 1, 1, ..., 1, 1, 1],
         [1, 1, 1, ..., 1, 1, 1]],

        [[1, 1, 1, ..., 1, 1, 1],
         [1, 1, 1, ..., 1, 1, 1],
         [1, 1, 1, ..., 1, 1, 1],
         ...,
         [1, 1, 1, ..., 1, 1, 1],
         [1, 1, 1, ..., 1, 1, 1],
         [1, 1, 1, ..., 1, 1, 1]],

        [[1, 1, 1, ..., 1, 1, 1],
         [1, 1, 1, ..., 1, 1, 1],
         [1, 1, 1, ..., 1, 1, 1],
         ...,
         [1, 1, 1, ..., 1, 1, 1],
         [1, 1, 1, ..., 1, 1, 1],
         [1, 1, 1, ..., 1, 1, 1]],

        ...,

        [[1, 1, 1, ..., 1, 1, 1],
         [1, 1, 1, ..., 1, 1, 1],
         [1, 1, 1, ..., 1, 1, 1],
         ...,
         [1, 1, 1, ..., 1, 1, 1],
         [1, 1, 1, ..., 1, 1, 1],
         [1, 1, 1, ..., 1, 1, 1]],

        [[1, 1, 1, ..., 1, 1, 1],
         [1, 1, 1, ..., 1, 1, 1],
         [1, 1, 1, ..., 1, 1, 1],
         ...,
         [1, 1, 1, ..., 1, 1, 1],
         [1, 1, 1, ..., 1, 1, 1],
         [1, 1, 1, ..., 1, 1, 1]],

        [[1, 1, 1, ..., 1, 1, 1],
         [1, 1, 1, ..., 1, 1, 1],
         [1, 1, 1, ..., 1, 1, 1],
         ...,
         [1, 1, 1, ..., 1, 1, 1],
         [1, 1, 1, ..., 1, 1, 1],
         [1, 1, 1, ..., 1, 1, 1]]]], dtype=int8)}
ref_out [[[[-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   ...
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]]

  [[-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   ...
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]]

  [[-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   ...
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]]

  ...

  [[-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   ...
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]]

  [[-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   ...
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]]

  [[-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   ...
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]]]]
vsi_out [[[[0 0 0 ... 0 0 0]
   [0 0 0 ... 0 0 0]
   [0 0 0 ... 0 0 0]
   ...
   [0 0 0 ... 0 0 0]
   [0 0 0 ... 0 0 0]
   [0 0 0 ... 0 0 0]]

  [[0 0 0 ... 0 0 0]
   [0 0 0 ... 0 0 0]
   [0 0 0 ... 0 0 0]
   ...
   [0 0 0 ... 0 0 0]
   [0 0 0 ... 0 0 0]
   [0 0 0 ... 0 0 0]]

  [[0 0 0 ... 0 0 0]
   [0 0 0 ... 0 0 0]
   [0 0 0 ... 0 0 0]
   ...
   [0 0 0 ... 0 0 0]
   [0 0 0 ... 0 0 0]
   [0 0 0 ... 0 0 0]]

  ...

  [[0 0 0 ... 0 0 0]
   [0 0 0 ... 0 0 0]
   [0 0 0 ... 0 0 0]
   ...
   [0 0 0 ... 0 0 0]
   [0 0 0 ... 0 0 0]
   [0 0 0 ... 0 0 0]]

  [[0 0 0 ... 0 0 0]
   [0 0 0 ... 0 0 0]
   [0 0 0 ... 0 0 0]
   ...
   [0 0 0 ... 0 0 0]
   [0 0 0 ... 0 0 0]
   [0 0 0 ... 0 0 0]]

  [[0 0 0 ... 0 0 0]
   [0 0 0 ... 0 0 0]
   [0 0 0 ... 0 0 0]
   ...
   [0 0 0 ... 0 0 0]
   [0 0 0 ... 0 0 0]
   [0 0 0 ... 0 0 0]]]]

Expected output: 
[[[[-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   ...
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]]

  [[-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   ...
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]]

  [[-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   ...
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]]

  ...

  [[-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   ...
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]]

  [[-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   ...
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]]

  [[-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   ...
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]
   [-128 -128 -128 ...  -67  -65  -64]]]]
Actual output: 

Not equal to tolerance rtol=0.001, atol=0.001

Mismatched elements: 200704 / 200704 (100%)
Max absolute difference: 127
Max relative difference: inf
 x: array([[[[-128, -128, -128, ...,  -67,  -65,  -64],
         [-128, -128, -128, ...,  -67,  -65,  -64],
         [-128, -128, -128, ...,  -67,  -65,  -64],...
 y: array([[[[0, 0, 0, ..., 0, 0, 0],
         [0, 0, 0, ..., 0, 0, 0],
         [0, 0, 0, ..., 0, 0, 0],...
FAIL

The output message after executing TVM test_operations.py at VIM3 Pro side:

python3 -m tvm.exec.rpc_server --host 0.0.0.0 --port=9090
INFO:root:If you are running ROCM/Metal, fork will cause compiler internal error. Try to launch with arg ```--no-fork```
INFO:RPCServer:bind to 0.0.0.0:9090
INFO:RPCServer:connection from ('XXX.XXX.XXX.XXX', 53076)
VsiNpuModule::LoadFromBinary
LoadFromBinary: nbg size = 15552
LoadFromBinary: input size = 1
LoadFromBinary: output size = 1
VsiNpuModule : DeSerializeTensorSpec
VsiNpuModule : DeSerializeTensorSpec2
VsiNpuModule : DeSerializeTensorSpec
VsiNpuModule : DeSerializeTensorSpec2
INFO:RPCServer:load_module /tmp/tmpa5luf_rw/model.so
VsiNpuModule::GetFunction: _lookup_linked_param
VsiNpuModule::GetFunction: return early
VsiNpuModule::GetFunction: _lookup_linked_param
VsiNpuModule::GetFunction: return early
VsiNpuModule::GetFunction: _lookup_linked_param
VsiNpuModule::GetFunction: return early
VsiNpuModule::GetFunction: _lookup_linked_param
VsiNpuModule::GetFunction: return early
VsiNpuModule::GetFunction: tvmgen_default_vsi_npu_0
[     1] PLS isn't existed
Process Graph: 2 ms or 2363 us
VsiNpuModule::GetFunction: size: 2
INFO:RPCServer:Finish serving ('XXX.XXX.XXX.XXX', 53076)

Test Functions Passed in test_operations.py

test_qnn_add()
test_float_add()
test_float_relu()
test_uint8_relu()
test_float_leaky_relu()
test_uint8_leaky_relu()
test_float_softmax()
test_float_reshape()
test_float_tranpose()
test_float_relu6()
test_uint8_relu6()
test_dequantize()
test_quantize()
test_uint8_avg_pool()
test_uint8_softmax()
test_uint8_reshape()
test_uint8_concatenation()
test_uint8_max_pool()
test_float_mean()?
test_uint8_argmax()
test_float_sigmoid()
test_uint8_sigmoid()
test_uint8_fullconnected()
test_uint8_argmin()
test_uint8_squeeze()
test_uint8_depthtospace()
test_qnn_sub()
test_qnn_multiply()
test_qnn_maximum()
test_qnn_minimum()
test_qnn_logical_and()
test_qnn_logical_or()
test_qnn_pad()
test_uint8_mean()
test_requantize()
test_uint8_transpose_conv2d_pattern()
test_uint8_transpose_conv2d_pattern2()
test_uint8_tanh()

Test Functions Failed in test_operations.py

test_float32_conv2d_permute() 
#vsi_out array elements value are all 0. ref_out!=vsi_out Mismatched elements: 100%
test_float32_depthwise_conv2d_permute() 
#vsi_out array elements value are all 0. ref_out!=vsi_out Mismatched elements: 100%
test_sample_model() 
#vsi_out array elements value are all 0. ref_out!=vsi_out Mismatched elements: 100%
test_float_avg_pool() 
#vsi_out array elements value are all 0. ref_out!=vsi_out Mismatched elements: 100%
test_float32_pattern() 
#ref_out!=vsi_out Mismatched elements: 100%
test_uint8_depthwiseconv2d_pattern() 
#ref_out!=vsi_out Mismatched elements: 515 / 864 (59.6%)
test_uint8_conv2d_pattern() 
#vsi_out array elements value are all 0. ref_out!=vsi_out Mismatched elements: 100%
test_uint8_resizeBilinear() 
#AttributeError: module 'tvm.relay.op.image' has no attribute 'resize'
#Because relay.op.image.resize was removed in the version
test_float_batch_norm() 
#std: :bad_alloc
test_uint8_resizeNear() 
#AttributeError: module 'tvm.relay.op.image' has no attribute 'resize'
#Because relay.op.image.resize was removed in the version

If you need more debug messages, please let me know.
Thanks.

Usage of vxFlushHandle and vxMapTensorPatch

Hello,

I tried to make tensor object with vxCreateTensorFromHandle2() to use host level memory block,
And I recently notice that not calling vxFlushHandle(), the data change is not correctly applied.
And I also see that vxMapTensorPatch() need to be called to access tensor data...
https://www.khronos.org/registry/OpenVX/specs/1.3/html/OpenVX_Specification_1_3.html#_vxmaptensorpatch
Ultimately, I want to create a handle at the host level and apply it every time I change it, and I wonder if it's okay to use only the vxFlushHandle.
Thank you!

Segmentation fault

Hi, I have A311D board with ubuntu on board (from khadas). My end goal is to compile tflite with vx-delegate support. What is the best way to do this?

Also Im trying to compile TIM-VX (libtim-vx.so) it seems Bazel building is broken, so I try CMake. CMake works fine: compiles and links all targets, but when I try to run unit tests (under gdb), I get segfault from *** _LoadStates() from libGAL.so what causes this?

Strange error on step of model validation

I have model, for example:

model = tf.keras.Sequential()
model.add(tf.keras.layers.Conv1D(256, 8, padding='same'))
model.add(tf.keras.layers.ReLU())
model.add(tf.keras.layers.Conv1D(256, 8, padding='same'))
model.add(tf.keras.layers.ReLU())
model.add(tf.keras.layers.Conv1D(256, 8, padding='same'))
model.add(tf.keras.layers.ReLU())
model.add(tf.keras.layers.Conv1D(256, 8, padding='same'))
model.add(tf.keras.layers.ReLU())
model.add(tf.keras.layers.Conv1D(256, 8, padding='same'))
model.add(tf.keras.layers.ReLU())
model.add(tf.keras.layers.Conv1D(256, 8, padding='same'))
model.add(tf.keras.layers.ReLU())
model.add(tf.keras.layers.Conv1D(64, 1))
model.add(tf.keras.layers.ReLU())
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(512))
model.add(tf.keras.layers.ReLU())
model.add(tf.keras.layers.Dense(512))
model.add(tf.keras.layers.ReLU())
model.add(tf.keras.layers.Dense(1))

input_shape = (1, 256, 40)
x = tf.random.normal(input_shape)
y = model(x)

But on device with A311D it's return errors, like that:
input_scale[0.000001000000] * weight_scale[0.000381198333] != bias_scale[0.000001000000]

Without dense layer on the end it's work good.

Can you help me solve this error please?

Code for quantization:

def representative_dataset():
    for _ in range(100):
        data = np.random.rand(1, 256, 40)
        yield [data.astype(np.float32)]

COOL_TFLITE_PATH = 'new_model/cool_tn.tflite'
converter = tf.lite.TFLiteConverter.from_saved_model(NEW_MODEL_PATH)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.experimental_new_converter = True
converter._experimental_disable_per_channel = True
converter.target_spec.supported_types = [tf.int8]
converter.inference_input_type = tf.int8 
converter.inference_output_type = tf.int8 
quantized_tflite_model = converter.convert()
with open(COOL_TFLITE_PATH, 'wb') as fout:
    fout.write(quantized_tflite_model)

is Conv2D with datalayout CWHN available?

Hello,
I tried to use Conv2D with data Layout CWHN using datalayout option,
and weights with WHIcOc Option,
And I wonder it is possible to run because the log shows me sth went wrong,, here's my source code,

auto conv = graph->CreateOperation<tim::vx::ops::Conv2d>(
  wTensor->shape[3], // weights
  pad, // padding
  std::array<uint32_t, 2>({wTensor->shape[0], wTensor->shape[1]}), // ksize
  std::array<uint32_t, 2>({layer->stride[0], layer->stride[1]}), // stride
  std::array<uint32_t, 2>({1, 1}), // dialation
  0, // multiplier 
  tim::vx::DataLayout::CWHN // inputLayout
);

and the log,

D [setup_node:441]Setup node id[0] uid[1] op[CONV2D]
D [validate_op_io_types:165]Validate [CONV2D]
D [print_tensor:147]in(0) : id[   0] vtl[0] const[0] shape[ 3, 192, 128, 1   ] fmt[u8 ] qnt[ASM zp=  0, scale=1.000000]
D [print_tensor:147]in(1) : id[  16] vtl[0] const[1] shape[ 3, 3, 3, 64      ] fmt[u8 ] qnt[ASM zp=  0, scale=1.000000]
D [print_tensor:147]in(2) : id[  17] vtl[0] const[1] shape[ 64               ] fmt[i32] qnt[ASM zp=  0, scale=1.000000]
D [print_tensor:147]out(0): id[   1] vtl[1] const[0] shape[ 3, 192, 64, 1    ] fmt[u8 ] qnt[ASM zp=  0, scale=1.000000]

expected output shape is, [64, 192, 128, 1],

additionally I wrote source code like below,

auto conv = graph->CreateOperation<tim::vx::ops::Conv2d>(
  wTensor->shape[3], // weights
  pad, // padding
  std::array<uint32_t, 2>({wTensor->shape[0], wTensor->shape[1]}), // ksize
  std::array<uint32_t, 2>({layer->stride[0], layer->stride[1]}), // stride
  std::array<uint32_t, 2>({1, 1}), // dialation
  0, // multiplier 
  tim::vx::DataLayout::CWHN, // inputLayout
  tim::vx::DataLayout::OcIcWH
);

it fails to compile graph and system goes panic.

Thank you!

Is compilation method of tim-vx based on tvm?

Hello,

As I know, after calling graph->compile(),
tim-vx invoke vxVerifyGraph() and internally calling compilation scheme,
And I wonder does the compilation method is based on tvm.

Thank you!!

TVM RPC test failed with message: "PLS isn't existed"

I don't know if it's the right place to ask questions about your TVM fork, but I cannot raise issues in that repo.

I followed the guide from README.VSI.md to build TVM (on host, using x86_64_linux simulation drivers provided here) and TVM runtime (on target, using vendor-provided VIP NPU drivers), and ran the tests in test_vsi_npu, but I got these results:

logs from TVM C++ RPC tool:

VsiNpuModule::LoadFromBinary
LoadFromBinary: nbg size = 593344
LoadFromBinary: input size = 1
LoadFromBinary: output size = 1
VsiNpuModule : DeSerializeTensorSpec
VsiNpuModule : DeSerializeTensorSpec2
VsiNpuModule : DeSerializeTensorSpec
VsiNpuModule : DeSerializeTensorSpec2
[22:31:35] /home/nullko/Documents/tvm-vsi_npu/apps/cpp_rpc/rpc_env.cc:130: Load module from /home/ubuntu/workspace/rpc/model.so ...
VsiNpuModule::GetFunction: _lookup_linked_param
VsiNpuModule::GetFunction: return early
VsiNpuModule::GetFunction: _lookup_linked_param
VsiNpuModule::GetFunction: return early
VsiNpuModule::GetFunction: _lookup_linked_param
VsiNpuModule::GetFunction: return early
VsiNpuModule::GetFunction: _lookup_linked_param
VsiNpuModule::GetFunction: return early
VsiNpuModule::GetFunction: tvmgen_default_vsi_npu_0
[     1] PLS isn't existed
E [compute_node:379]Create node[0] NBG fail
Process Graph: 0 ms or 0 us

It seemed that TVM is able to compile the NBG on the host, but the target runtime cannot execute it.
I wonder what caused the "PLS isn't existed" issue, is it because I didn't set some environment variables on the target platform?

Or maybe your tvm fork is still under development and is not ready to be used yet?

如何查看VIM3Pro 版子在运行模型的时候 NPU资源的使用情况呢，类似 nvidia-smi 这种。

如题。
谢谢

enable NN/TP use FP32 IO

NNRT已经支持NN/TP 支持FP32 IO. 可以用BF16 计算。

想透過 TIM-VX推論 Conv F32 op ，開啟 profiling 看到的 target 是在 SHD上，
並沒有發生 FP32 IO所說的優化行為。想確認
TIM-VX如何可以使用 NNE來推導 fp32 的conv。

cannot run lenet example

I'm trying to run sample/lenet in release1.1.34 package, getting errors as below:
`trtuser@c3b840ead8a7:/workspace/TIM-VX-1.1.34.fix/build/samples/lenet$ ./lenet
(190:0) : error : Error(0,190) : Cannot find the header file cl_viv_vx_ext.h.
(18:0) : error : undefined identifier: 'COPY'
(46:0) : error : undefined identifier: 'COPY'
(196:0) : error : syntax error at 'VXC_512Bits'

ERROR: Failed to compile vx shader. (error: FFFFFFFF)
Compile graph fail.
`

Is this something wrong with my environment?
I'm building/running this sample is a docker environment with gcc7.5:
trtuser@c3b840ead8a7:/workspace/TIM-VX-1.1.34.fix/build/samples/lenet$ uname -a
Linux c3b840ead8a7 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
trtuser@c3b840ead8a7:/workspace/TIM-VX-1.1.34.fix/build/samples/lenet$ gcc --version
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Add a ops.h header to include all ops header

Could the timvx add a head file like ops.h to include all ops header?
It's very convenient for exporting timvx cases using other tools.

Thanks.

is tim-vx available to run graph with virtual tensors as i/o tensor?

Hello,

I am willing to modify tim-vx as the entire graph can be divided into subgraphs so that i can run the entire neural network seperated. The implementation of tim-vx shows that the input output tensor of the graph has a handle(memory allocated). According to my understanding of openVX, it seems that graph can have virtual tensor as input output, and I wonder if this is possible at the vsi_ level.

Thank you!

libOpenVX.so

When I make -j4 using tim-vx, I have encounter the following issue

Tengine/3rdparty/tim-vx/lib/x86_64/libOpenVX.so: file format not recognized; treating as linker script

Any idea how to solve it? Many thanks in advance for your answer.

insmod: ERROR: could not insert module galcore.ko: Invalid module format

A311D,Fenix 0.8.2 Ubuntu 18.04.3 LTS Linux 5.5.0-rc2
sudo insmod galcore.ko
insmod: ERROR: could not insert module galcore.ko: Invalid module format

Could not access tensor attribute such as scale, zp by tensor

In demo file: samples/lenet/lenet_asymu8.cc add line 315 to access scale of tensor [fc4_output]：
std::cout<< "fc4_output quantize:" << fc4_output->GetQuantization().Scales()[0] << std::endl;

Get error as follow when compile:

/wksp/github/TIM-VX/samples/lenet/lenet_asymu8.cc:315:78: error: passing ‘const tim::vx::Quantization’ as ‘this’ argument discards qualifiers [-fpermissive]
315 | std::cout<< "fc4_output quantize:" << fc4_output->GetQuantization().Scales()[0] << std::endl;
| ^
In file included from /wksp/github/TIM-VX/include/tim/vx/operation.h:28,
from /wksp/github/TIM-VX/samples/lenet/lenet_asymu8.cc:33:
/wksp/github/TIM-VX/include/tim/vx/tensor.h:63:23: note: in call to ‘std::vector& tim::vx::Quantization::Scales()’
63 | std::vector& Scales() { return this->scales_; }
| ^~~~~~
make[2]: *** [samples/lenet/CMakeFiles/lenet.dir/build.make:63: samples/lenet/CMakeFiles/lenet.dir/lenet_asymu8.cc.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:289: samples/lenet/CMakeFiles/lenet.dir/all] Error 2
make: *** [Makefile:130: all] Error 2

Minimal tflite with vx-delegate example

Hi, could you provide minimal example, for tflite inference with vx-delegate applied?
Im not sure how to create and apply vx-delegate to tflite model. Thanks!

Add interface to create handle tensor using user-allocated buffer

Right now the Graph::CreateTensor method can only copy the buffer over instead of using the buffer as the underlying I/O memory of the tensor.

However, the ovxlib did provide the API to do so, but which is not utilized by the wrapper class:

TIM-VX/src/tim/vx/tensor.cc

Lines 184 to 188 in 633075f

 if ((spec_.attr_ & TensorAttribute::INPUT) || 

 (spec_.attr_ & TensorAttribute::OUTPUT)) { 

 id_ = vsi_nn_AddTensorFromHandle(graph_->graph(), VSI_NN_TENSOR_ID_AUTO, 

 &attr, nullptr); 

 } else {

where a null pointer is passed.

Is it possible to pass the buffer pointer to the underlying interface to avoid unnecessary copy?

Handle tensor double free error on x86_64 simulator driver

I try to use the vsi_nn_AddTensorFromHandle to create a tensor using buffer allocated by cv::Mat, every time the program exits, there will be a double free error issued by OpenCV. It seems that the passed buffer is freed by the OpenVX driver (OpenVX driver is not supposed to free the buffer since it's a handle ) when the context is deinitialized, so when the OpenCV try to free the same buffer, it causes a double free error.

I only encountered this problem with the x86_64 simulator driver, when the program runs on the target device using vendor-provided NPU driver, everything works well.

Here is a short program to reproduce this error:

#include <iostream>
#include <opencv2/core.hpp>
#include <vsi_nn_pub.h>

int main(int argc, char* argv[]) {
    vsi_status err = VSI_SUCCESS;

    auto matIn = cv::Mat(4, 4, CV_32F);
    auto matOut = cv::Mat(4, 4, CV_32F);

    cv::randu(matIn, -1.0F, 1.0F);
    matOut.setTo(0.0F);

    auto context = vsi_nn_CreateContext();
    auto graph = vsi_nn_CreateGraph(context, 2, 1);
    vsi_nn_SetGraphInputs(graph, nullptr, 1);
    vsi_nn_SetGraphOutputs(graph, nullptr, 1);

    vsi_nn_tensor_attr_t attr = {};
    attr.dtype.fmt = VSI_NN_DIM_FMT_NCHW;
    attr.dim_num = 4;
    attr.size[0] = 4;
    attr.size[1] = 4;
    attr.size[2] = 1;
    attr.size[3] = 1;
    attr.dtype.vx_type = VSI_NN_TYPE_FLOAT32;
    attr.dtype.qnt_type = VSI_NN_QNT_TYPE_NONE;
    attr.is_const = 0;
    attr.vtl = 0;

    auto tensorIn = vsi_nn_AddTensorFromHandle(
        graph, VSI_NN_TENSOR_ID_AUTO, &attr, matIn.data);
    auto tensorOut = vsi_nn_AddTensorFromHandle(
        graph, VSI_NN_TENSOR_ID_AUTO, &attr, matOut.data);

    auto nodeReLU = vsi_nn_AddNode(graph, VSI_NN_OP_RELU, 1, 1, nullptr);
    nodeReLU->uid = 100;
    nodeReLU->input.tensors[0] = tensorIn;
    nodeReLU->output.tensors[0] = tensorOut;

    graph->input.tensors[0] = tensorIn;
    graph->output.tensors[0] = tensorOut;

    err = vsi_nn_SetupGraph(graph, vx_false_e);
    err = vsi_nn_VerifyGraph(graph);
    err = vsi_nn_rnn_RunGraph(graph);

    std::cout << "[Input]\n" << matIn << std::endl;
    std::cout << "[Output]\n" << matOut << std::endl;

    vsi_nn_ReleaseGraph(&graph);
    vsi_nn_ReleaseContext(&context);

    return err;
}

callstack:

raise (raise:49)
abort (abort:60)
__libc_message (__libc_message:173)
malloc_printerr (malloc_printerr:0)
_int_free (_int_free:455)
__libc_free (__libc_free:28)
cv::StdMatAllocator::deallocate(cv::UMatData*) const (cv::StdMatAllocator::deallocate(cv::UMatData*) const:17)
cv::Mat::~Mat() (cv::Mat::~Mat():26)
main (/home/nullko/Documents/tim-vx/samples/ncc/test_handle_tensor.cpp:54)
__libc_start_main (__libc_start_main:53)
_start (_start:13)

Does this software only support SOC for VeriSilicon's IP ?

Does this software only support SOC for VeriSilicon's IP ?
Or can it run as long as it supports the openvx API, for example TI TDA4 soc

vsi_npu tvm compilation issue

I am trying to cross-compile a pytorch model for imx8 with vsi_npu with below code but I am getting the errors noted below, I have also attached a working code for normal linux targets

compilation code:

mod, params = relay.frontend.from_pytorch(scripted_model, [(input_name, input_shape)])
target_string = "llvm  -mtriple=aarch64-linux-gnu"              # imx8 host-triple
kwargs = {"cc": "aarch64-linux-gnu-gcc", 'fcompile': False}
disabled_passes = ["AlterOpLayout"]                                     # same error with None
with tvm.transform.PassContext(opt_level=3, disabled_pass=disabled_passes):
        mod = vsi_npu.partition_for_vsi_npu(mod, params)           # runs fine
        lib = relay.build(mod, target_string, params=params)        # error #
lib.export_library(build_dir / 'deploy.so',  **kwargs)

error at compile step: (relay taken from quantized pytorch trace)

This is important----> name_node.value() == tvmgen_default_vsi_npu_593
GraphMakerImpl::Create
TensorMakerImpl::InferCall: qnn.quantize
E [GetMapedTensor:140]Tensor has not beed inserted in tensor map.
python: /code/tim-vx-lib/TIM-VX/src/tim/transform/layout_inference.cc:141: std::shared_ptr<tim::vx::Tensor> tim::transform::layout_inference_impl::LayoutInferContext::GetMapedTensor(const std::shared_ptr<tim::vx::Tensor>&) const: Assertion `false' failed.
Aborted (core dumped)

error at compile step: (relay taken from simple pytorch trace)

This is important----> name_node.value() == tvmgen_default_vsi_npu_57
GraphMakerImpl::Create
TensorMakerImpl::InferCall: nn.conv2d
TensorMakerImpl::InferCall: add
TensorMakerImpl::InferCall: image.resize2d
TensorMakerImpl::InferCall: add
TensorMakerImpl::InferCall: image.resize2d
TensorMakerImpl::InferCall: add
TensorMakerImpl::InferCall: image.resize2d
W [vsi_nn_SortGraphNode:1378]Unprocessed node 7
W [vsi_nn_SetupGraph:706]Sort graph nodes failure.
Fatal error: compile to binary failed

Working Code without vsi_npu: (quantized/normal/tuned/notune all cases)

mod, params = relay.frontend.from_pytorch(scripted_model, [(input_name, input_shape)])
target_string = "llvm -mtriple=x86_64-linux-gnu"              # linux host-triple
with tvm.transform.PassContext(opt_level=3, disabled_pass=None):
        lib = relay.build(mod, target_string, params=params)      # no error #
lib.export_library(build_dir / 'deploy.so')

I have installed using cmake instructions from https://github.com/VeriSilicon/tvm/blob/vsi_npu/README.VSI.md
Please guide me on the error

Elemwise not work well

Hello, I wrote a basic program to verify npu is working, however it seems not work.
Could you notice me what i missed with it? Thank you!
I am working on khadas vim3

int main(int argc, char** argv) {
  (void)argc, (void)argv;
  auto context = tim::vx::Context::Create();
  auto graph = context->CreateGraph();

  tim::vx::ShapeType input_shape({5});
  tim::vx::TensorSpec input_spec(tim::vx::DataType::FLOAT32, input_shape,
                                 tim::vx::TensorAttribute::INPUT);
  auto input = graph->CreateTensor(input_spec);

  tim::vx::ShapeType output_shape({5});
  tim::vx::TensorSpec output_spec(tim::vx::DataType::FLOAT32, output_shape,
                                  tim::vx::TensorAttribute::OUTPUT);
  auto output = graph->CreateTensor(output_spec);

  float temp_data[] = {1.0, 2.0, 3.0, 4.0, 5.0};

  tim::vx::ShapeType conv1_bias_shape({5});
  tim::vx::TensorSpec conv1_bias_spec(tim::vx::DataType::FLOAT32,
                                      conv1_bias_shape,
                                      tim::vx::TensorAttribute::CONSTANT);
  auto conv1_bias = graph->CreateTensor(conv1_bias_spec, temp_data);

  auto mul = graph->CreateOperation<tim::vx::ops::Multiply>();
  (*mul).BindInputs({input, conv1_bias}).BindOutputs({output});

  float* temp_input = (float*)malloc(sizeof(float) * 5);
  float* temp_output = (float*)malloc(sizeof(float) * 5);

  for (int i = 0; i < 5; i++) {
    temp_input[i] = 2;
  }

  for (int i = 0; i < 5; i++) {
    temp_output[i] = 0;
  }

  if (!input->CopyDataToTensor(temp_input, sizeof(float) * 5)) {
    std::cout << "copy to tensor failed" << std::endl;
  }
  if (!graph->Run()) {
    std::cout << "Failed to run" << std::endl;
  }

  if (!output->CopyDataFromTensor(temp_output)) {
    std::cout << "Copy from tensor failed" << std::endl;
  }

  printf("after output.\n");
  for (int i = 0; i < 5; i++) printf("%f,", temp_output[i]);

  return 0;
}

I expect 2, 4, 5, 8, 10 but it returns all 0 :(

is bitwise_xor op supported?

i want to know : is bitwise_xor op be supported ?

Programmablility of PPU in NPU of a311d(vim3)

Hello,
firstly, I'm always grateful for your kind answers.

As I know, vim3's npu contains PPU that programmable with openCL base, however I failed to find any usability of PPU in source code, and I wonder the PPU can be handled by openCL or other method.

Thank you!

Some queries about running tflite

What is the format of "TIM-VX/main/samples/lenet/lenet_asymu8_weights.h" this data, can it be considered as tflite file.
What is the best way to run tflite on verisilicon board, should we convert tflite to .nbg file and then use tim::lite to run that .nbg file.If yes, how to convert tflite to .nbg file?
Or using tflite library should we extract the weigths, and type of network ourseleves and using that information, should we build and run tflite using tim::vx.

Does TIM-VX utilizes NPU of vim3(A311D)?

Hello,
I'm going to use TIM-VX to use the VIM3 development board.
I have some questions because it's my first time dealing with applications related to ovxlib and npu.
I'm going to use TIM-VX for research purposes. The goal of the study is utilizing NPU effectively.
I wonder that,

if the NPU of VIM3 can be used through TIM-VX,
if it can be used, do all layers defined at TIM-VX can be run on NPU,
Is there any method to check to know that I am utilizing NPU when run time,
and how much cpu usage of tim-vx is.
TIM-VX seems like a very interesting subject.
Thank you for reading it!!

Tensors connection between graphs

Hello!

I wonder I can connect two graphs and share Tensors between.

for example,
Tensor1 --> Graph1 --> Tensor2 --> Graph2 --> Tensor3

In this case, Tensor1 would be INPUT type, and Tensor3 would be OUTPUT type,
As I know, when running graph1 solely, Tensor2 would be OUTPUT type, and need to call CopyDatafromTensor(),
and call CopyDatatoTensor() to run graph2..

And I want to share Tensor2 without copying overhead.
Could I get some advice of this situation.

Additionally I wonder could i utilize TRANSIENT type and wonder what it is for.
Thank you for your favorable answers.

deconv group parameter

Could TIMVX update the "deconv" ops with a "group" parameter so that it could match the Tengine deconv configuration?

Looking forward to your answer.

Best

android for arm 32 bit

Do you have any plans to support Android (arm 32 bit)

Is SpatialTransformer on rv1126 supported？

I try to run SpatialTransformer(mxnet) operator on rv1126 using Tengine, but the Status of "SpatialTransformer" operator in tim-vx is "InternalOnly", and no TIM-VX API implementation.So i can't find a way to add SpatialTransformer npu support to Tengine, i wondered that：

Is SpatialTransformer(mxnet) operator on rv1126 npu supported?
If it is supported, how can i add it to Tengine?

Is CWHN(NCHW) data layout natively supported by the hardware?

Hi, I've been working on my TVM BOYC(bring your own codegen) backend to adapt the TIM-VX (I found your TVM vsi_npu backend implementation failed to meet my needs, since the host-compiled NBG is not working on my target platform so I implement my backend that uses the JIT compilation on the target).

I wonder if Ops with CWHN layout can be directly offloaded to the NPU/GPU, or there will be layout conversions to WHCN happening under the hood? The document says Conv2d supports CWHN input layout, does that mean the NPU can process the data in CWHN without any memory reordering?

I noticed that there are layout inference utilities in TIM-VX, it did transform some Ops like Pool2d from CWHN to WHCN. I'd like to know if this transform is needed for all Ops with CWHN layout？If that's the case, I can implement the runtime adapter to support WHCN layout only, and transform the data layout in the codegen with TVM's ConvertLayout pass.

tf.transpose segfault

It seems, that if I try to run an INT8 graph that has tf.transpose op in it, I get segfault.

device: A311D
backend: TFLite with vx-delegate

bazel compile Error

i install bazel [lastest-version 4.2.1], and run "bazel build libtim-vx.so"

This is just a doubt

This is regarding a sample you provided in lenet_lite folder,
What is the format of the data that is present in lenet_lite_asymu8_executable.h file.
Is it .tflite file or .nbg type

请问这一套是否能搭建在rockchip的板子上呢？

比如rk1808或者rv1109

Does vim3 NPU support VXU?

Hello timvx developers
I am so grateful that I can easily use openVX and quickly create ai application using tim-vx on vim3.
And I knew that OpenVX support additional utility library called VXU which is available run simple operations such as tensorMultiplication only without graph compilation.
I wonder TIM-VX or vim3 NPU support VXU.

Thank you!

Dequantization implementation for VSI_NN_OP_CAST Node?

Currently, When cast a UINT8 | VSI_NN_QNT_TYPE_AFFINE_ASYMMETRIC tensor to FLOAT32 tensor using VSI_NN_OP_CAST node, it can only do output_value = (float32)input_value instead of output_value = (float32)(input_value - zero_point) * scale.

I noticed that there is a D_U8|Q_ASYM -> D_F32 op check in the /src/tim/vx/internal/src/ops/vsi_nn_op_cast.c source file, so why does this dequantization conversion is not available in the VSI_NN_OP_CAST Node?

Can I create multi Contexts and graphs?

Hello,

I am using TIM-VX for running multi NN, and I am using VIM3 board.
I'm referring to the sample code and I wonder what is the best solution for creating multiple nn in terms of performance?
I think about

creating single Context multi graph,
creating multi graph and context each.

In addition, I want to run NN with layerwisely,
thus I tried creating multi graphs for each layer.
I wonder is there the limit for # of graphs ( and context)..
I want to create graphs like 100 or more because I want to check the hidden activations,
I wonder this possibly done with tim-vx.

Thank you!

Graph verification errors

Retina face quant model, verification failure:

D [compute_node:375]Instance node[108] "RESHAPE" ...
D [compute_node:375]Instance node[109] "CONV2D" ...
D [compute_node:375]Instance node[110] "CONV2D" ...
D [compute_node:375]Instance node[111] "RESHAPE" ...
D [compute_node:375]Instance node[112] "CONV2D" ...
D [compute_node:375]Instance node[114] "RESHAPE" ...
D [compute_node:398]compute node finish with ret 0
E [vsi_nn_SetupGraph:752]after compute_node
E [vsi_nn_SetupGraph:759]now will vsi_nn_TrySetupCompleteSignalNode
E [vsi_nn_SetupGraph:764]now will vsi_nn_setup_binary_graph_inputs_outputs
E [vsi_nn_SetupGraph:777]vsi_nn_SetupGraph return with ret = 0
E [vsi_nn_VerifyGraph:795]vsi_nn_VerifyGraph return with ret = -1
Tengine Fatal: Pre-run subgraph(0) on TIMVX failed.
Tengine: Scheduler(sync) prerun failed.
Pre-run graph failed

root@localhost:/tmp# ldd ./tm_retinaface_timvx
        linux-vdso.so.1 (0xae88d000)
        libtengine-lite.so => /usr/lib/libtengine-lite.so (0xa6a0d000)
        libdl.so.2 => /lib/libdl.so.2 (0xa69fa000)
        libm.so.6 => /lib/libm.so.6 (0xa697b000)
        libCLC.so => /usr/lib/libCLC.so (0xa67c6000)
        libGAL.so => /usr/lib/libGAL.so (0xa6629000)
        libOpenVX.so.1.2 => /usr/lib/libOpenVX.so.1.2 (0xa641b000)
        libOpenVXU.so => /usr/lib/libOpenVXU.so (0xa6405000)
        libVSC.so => /usr/lib/libVSC.so (0xa55bc000)
        libArchModelSw.so => /usr/lib/libArchModelSw.so (0xa557b000)
        libNNArchPerf.so => /usr/lib/libNNArchPerf.so (0xa553a000)
        libstdc++.so.6 => /lib/libstdc++.so.6 (0xa53d1000)
        libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xa53a2000)
        libc.so.6 => /lib/libc.so.6 (0xa5260000)
        libpthread.so.0 => /lib/libpthread.so.0 (0xa5237000)
        libgomp.so.1 => /usr/lib/libgomp.so.1 (0xa51ff000)
        /lib/ld-linux-armhf.so.3 (0xa6ef8000)
        librt.so.1 => /lib/librt.so.1 (0xa51e8000)
        libVSC_Lite.so => /usr/lib/libVSC_Lite.so (0xa51c3000)

Any idea?

Is there the way to deal with graph verification errors?

Hello, I am using tim-vx for tensorflow lite delegate and tengine,
And I encountered graph verification fail error.
What can I try to do for verification error? what may I check for this error
Thank you!

How to use NNAPI for NPU?

Hi, I've learned that it's possible to use Tensorflow Lite NNAPI on vim3 Android device,

I have /system/lib/libneuralnetworks.so file in my Android 9 OS.
How to make sure NPU is used? I benchmarked my model and it seems NPU is not used during TFLite 8bit inference, because speed is 10x slower and there is no difference between channel and tensor quantised models.

also in dmesg after I run my bench:

[ 4907.441064] type=1400 audit(1293888104.720:419): avc: denied { read } for pid=7157 
comm="benchmar" path="/data/local/tmp/build/model/model.tflite" 
dev="mmcblk0p20" ino=261748 scontext=u:r:hal_neuralnetworks_default:s0 tcontext=u:object_r:shell_data_file:s0 
tclass=file permissive=1

Doesn't TIMVX support conv group=2?

depth2space init mode?

depth2space 初始化的mode是crd？还是dcr？

Where can I find the binding order of a specified `Op.BindInputs and Op.BindOutputs`?

Hi,

For Example, the BindInputs in Op of conv have three input tensor, {intput, weight, bias}.
And they must have the order of {intput, weight, bias}, and {intput, bias, weight} can not work.

  (*conv1)
      .BindInputs({input, conv1_weight, conv1_bias})
      .BindOutputs({conv1_output});

Where can I find the binding order of each Op supported by TIM-VX?

Best.

Is there a tool to convert pretrained NN model into executable binary？

I noticed that there is an example lenet_lite in which the NN graph is loaded from binary data directly without building it from cpp code descriptions.

So how did the executable NN binary get generated? Is this feature part of OpenVX extensions? Where can I find the tool to convert pretrained NN models (e.g., in ONNX format) into executable binary that can be used by TIM-VX?

can the repo build on target：rockchip 1109|1126|1808 ???

I see the Tengine, but the download-address is unreachable

API for per layer time (computation cost)

Does TIMVX has some exposed APIs to retrieve the computational cost information?

The usage scenario:

We have converted a HRNET and LiteHRNET and run using Tengine+TIMVX.

	HRNET	LiteHRNET
CPU	1600ms	400ms
NPU	33ms	80ms

We want to have a further diagnosis of which modules are actually the most time consuming ones.
Hence, I am writing to inquire the API for retrieving the computational cost (time-wise) if there is one in TIMVX.

Does lenet sample works fine?

Hello,
I am trying to use tim-vx on vim3, and I build lenet example and run on vim3.
This just success to start program but fail on graph->Compile() that returns -1.
Does It works fine on other environment?
Thank you.

StridedSlice Op doesn't support 5-dimension tensors.

I'm trying to use the StridedSlice Op to handle the slice operation in the YOLOv5 post-processing.

It involves the tensor of 5 dimensions, however, it seems that StridedSlice Op can only support up to 4 dimensions.

Here is some example code:

#include <algorithm>
#include <array>
#include <cstdio>
#include <memory>
#include <vector>

#include <tim/vx/context.h>
#include <tim/vx/graph.h>
#include <tim/vx/operation.h>
#include <tim/vx/ops/elementwise.h>
#include <tim/vx/ops/stridedslice.h>

/* Using tensors of 5 dimensions. */
static constexpr std::array<int, 5> BEGIN = {0, 0, 0, 0, 2};
static constexpr std::array<int, 5> END = {1, 3, 10, 10, 4};
static constexpr std::array<int, 5> STRIDES = {1, 1, 1, 1, 1};

static constexpr int MASK_BEGIN = 0b11110;
static constexpr int MASK_END = 0b11110;
static constexpr int MASK_SHRINK = 0b00000;

static constexpr std::array<size_t, 5> SHAPE_INPUT = {1, 3, 10, 10, 85};
static constexpr std::array<size_t, 5> SHAPE_OUTPUT = {1, 3, 10, 10, 2};
static constexpr size_t SLICE_AXIS = 4;

/* Using tensors of 4 dimensions. */
// static constexpr std::array<int, 4> BEGIN = {0, 0, 0, 2};
// static constexpr std::array<int, 4> END = {3, 10, 10, 4};
// static constexpr std::array<int, 4> STRIDES = {1, 1, 1, 1};

// static constexpr int MASK_BEGIN = 0b1110;
// static constexpr int MASK_END = 0b1110;
// static constexpr int MASK_SHRINK = 0b0000;

// static constexpr std::array<size_t, 4> SHAPE_INPUT = {3, 10, 10, 85};
// static constexpr std::array<size_t, 4> SHAPE_OUTPUT = {3, 10, 10, 2};
// static constexpr size_t SLICE_AXIS = 3;

static constexpr size_t LEN_DETECTION_FULL = 85;
static constexpr size_t LEN_DETECTION_SLICED = 2;
static constexpr size_t NUM_ELEMENTS_INPUT = 25500; // 1 * 3 * 10 * 10 * 85
static constexpr size_t NUM_ELEMENTS_OUTPUT = 600;  // 1 * 3 * 10 * 10 * 2
static constexpr size_t NUM_DETECTIONS = 300;       // 1 * 3 * 10 * 10

int main(int argc, char* argv[]) {
    auto context = tim::vx::Context::Create();
    auto graph = context->CreateGraph();

    tim::vx::ShapeType vxShapeInput;
    tim::vx::ShapeType vxShapeOutput;

    std::reverse_copy(SHAPE_INPUT.cbegin(),
                      SHAPE_INPUT.cend(),
                      std::back_inserter(vxShapeInput));
    std::reverse_copy(SHAPE_OUTPUT.cbegin(),
                      SHAPE_OUTPUT.cend(),
                      std::back_inserter(vxShapeOutput));

    // Create TIM-VX tensors.
    auto specInput = tim::vx::TensorSpec(tim::vx::DataType::FLOAT32,
                                         vxShapeInput,
                                         tim::vx::TensorAttribute::INPUT);

    auto specOutput = tim::vx::TensorSpec(tim::vx::DataType::FLOAT32,
                                          vxShapeOutput,
                                          tim::vx::TensorAttribute::OUTPUT);

    auto tensorInput = graph->CreateTensor(specInput);
    auto tensorOutput = graph->CreateTensor(specOutput);

    std::vector<int> begin;
    std::vector<int> end;
    std::vector<int> strides;

    std::reverse_copy(BEGIN.cbegin(), BEGIN.cend(), std::back_inserter(begin));
    std::reverse_copy(END.cbegin(), END.cend(), std::back_inserter(end));
    std::reverse_copy(
        STRIDES.cbegin(), STRIDES.cend(), std::back_inserter(strides));

    // Create StridedSlice Op.
    /* input: [1, 3, 10, 10, 85] -> slice(range=[..., 2:4], stride=1) -> output:
     * [1, 3, 10, 10, 2] */
    auto opStridedSlice = graph->CreateOperation<tim::vx::ops::StridedSlice>(
        begin, end, strides, MASK_BEGIN, MASK_END, MASK_SHRINK);

    opStridedSlice->BindInput(tensorInput);
    opStridedSlice->BindOutput(tensorOutput);

    // Compile graph.
    bool ret = false;
    ret = graph->Compile();
    if (!ret) {
        std::exit(1);
    }

    std::array<float, NUM_ELEMENTS_INPUT> bufferInput;
    std::array<float, NUM_ELEMENTS_OUTPUT> bufferOutput;

    // Prepare input tensor data.
    bufferInput.fill(0.0F);
    for (size_t k = 0; k < NUM_DETECTIONS; k++) {
        float* dataPtr = bufferInput.data() + k * LEN_DETECTION_FULL;
        for (size_t i = BEGIN[SLICE_AXIS]; i < END[SLICE_AXIS]; i++) {
            dataPtr[i] = static_cast<float>(i);
        }
    }

    // Run graph.
    ret = tensorInput->CopyDataToTensor(bufferInput.data());
    ret = graph->Run();
    ret = tensorOutput->CopyDataFromTensor(bufferOutput.data());

    // Print output tensor data.
    for (size_t k = 0; k < NUM_DETECTIONS; k++) {
        const float* dataPtr = bufferOutput.data() + k * LEN_DETECTION_SLICED;
        for (size_t i = 0; i < LEN_DETECTION_SLICED; i++) {
            std::printf("%.1F, ",
                        dataPtr[i]); // Expected to be [begin, end-1] per line.
        }
        std::printf("\n");
    }

    return static_cast<int>(!ret);
}

If I use 4-dimension tensors, it runs as expected, but when I use 5-dimension tensors, it failed to compile the graph and gives me the error:

Failed to initialize Kernel "vivante.nn.tensor.stride_slice" of Node 0x55555564c0e0 (status = -8)

How to profile network on vx-delegate?

It seems with my graph I get following numbers (on A311D):

./benchmark_model --graph=model.tflite --use_vxdelegate=true                  : Inference (avg): **191704**
./benchmark_model --graph=model.tflite --use_vxdelegate=false --num_threads=4 : Inference (avg): **48060.7**

the model is converted with converter.optimizations = [tf.lite.Optimize.DEFAULT] and with representative_dataset. Model size is around 10 Mb.

That's quite a difference, I expected much higher speedup. Is it possible Im doing something wrong? How to profile each op in the graph?

btw ./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite --use_vxdelegate=true is Inference (avg): 6424.51

could it be memory limit? my model memory footprint is around 120 mb

	if ((spec_.attr_ & TensorAttribute::INPUT) \|\|
	(spec_.attr_ & TensorAttribute::OUTPUT)) {
	id_ = vsi_nn_AddTensorFromHandle(graph_->graph(), VSI_NN_TENSOR_ID_AUTO,
	&attr, nullptr);
	} else {

verisilicon / tim-vx Goto Github PK

tim-vx's People

Contributors

Stargazers

Watchers

Forkers

tim-vx's Issues

Recommend Projects

Recommend Topics

Recommend Org