arm-software / computelibrary Goto Github PK

The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.

License: MIT License

Python 0.46% C++ 91.70% C 7.38% Go 0.01% Shell 0.01% Starlark 0.21% CMake 0.23%

neon opencl computer-vision arm armv7 armv8 aarch64 machine-learning simd android

computelibrary's Introduction

⚠ Deprecation Notice 24.01 announcement: NCHW data format specific optimizations will gradually be removed from the code base in future releases. The implication of this is that the user is expected to translate NCHW models into NHWC in order to benefit from the optimizations.

Compute Library

The Compute Library is a collection of low-level machine learning functions optimized for Arm® Cortex®-A, Arm® Neoverse® and Arm® Mali™ GPUs architectures.

The library provides superior performance to other open source alternatives and immediate support for new Arm® technologies e.g. SVE2.

Key Features:

Open source software available under a permissive MIT license
Over 100 machine learning functions for CPU and GPU
Multiple convolution algorithms (GeMM, Winograd, FFT, Direct and indirect-GeMM)
Support for multiple data types: FP32, FP16, INT8, UINT8, BFLOAT16
Micro-architecture optimization for key ML primitives
Highly configurable build options enabling lightweight binaries
Advanced optimization techniques such as kernel fusion, Fast math enablement and texture utilization
Device and workload specific tuning using OpenCL tuner and GeMM optimized heuristics

Repository	Link
Release	https://github.com/arm-software/ComputeLibrary
Development	https://review.mlplatform.org/#/admin/projects/ml/ComputeLibrary

Documentation

Note: The documentation includes the reference API, changelogs, build guide, contribution guide, errata, etc.

Pre-built binaries

All the binaries can be downloaded from here or from the tables below.

Platform	Operating System	Release archive (Download)
Raspberry Pi 4	Linux® 32bit
Raspberry Pi 4	Linux® 64bit
Odroid N2	Linux® 64bit
HiKey960	Linux® 64bit

Architecture	Operating System	Release archive (Download)
armv7	Linux®
arm64-v8a	Android™
arm64-v8a	Linux®
arm64-v8.2-a	Android™
arm64-v8.2-a	Linux®

Please refer to the following link for more pre-built binaries:

Pre-build binaries are generated with the following security / good coding practices related flags:

-Wall, -Wextra, -Wformat=2, -Winit-self, -Wstrict-overflow=2, -Wswitch-default, -Woverloaded-virtual, -Wformat-security, -Wctor-dtor-privacy, -Wsign-promo, -Weffc++, -pedantic, -fstack-protector-strong

Supported Architectures/Technologies

Arm® CPUs:
- Arm® Cortex®-A processor family using Arm® Neon™ technology
- Arm® Neoverse® processor family
- Arm® Cortex®-R processor family with Armv8-R AArch64 architecture using Arm® Neon™ technology
- Arm® Cortex®-X1 processor using Arm® Neon™ technology
Arm® Mali™ GPUs:
- Arm® Mali™-G processor family
- Arm® Mali™-T processor family
x86

Supported Systems

Android™
Bare Metal
Linux®
OpenBSD®
macOS®
Tizen™

Resources

Experimental builds

⚠ Important Bazel and CMake builds are experimental CPU only builds, please see the documentation for more details.

How to contribute

Contributions to the Compute Library are more than welcome. If you are interested on contributing, please have a look at our how to contribute guidelines.

Developer Certificate of Origin (DCO)

Before the Compute Library accepts your contribution, you need to certify its origin and give us your permission. To manage this process we use the Developer Certificate of Origin (DCO) V1.1 (https://developercertificate.org/)

To indicate that you agree to the the terms of the DCO, you "sign off" your contribution by adding a line with your name and e-mail address to every git commit message:

Signed-off-by: John Doe <[email protected]>

You must use your real name, no pseudonyms or anonymous contributions are accepted.

Public mailing list

For technical discussion, the ComputeLibrary project has a public mailing list: [email protected] The list is open to anyone inside or outside of Arm to self subscribe. In order to subscribe, please visit the following website: https://lists.linaro.org/mailman3/lists/acl-dev.lists.linaro.org/

License and Contributions

The software is provided under MIT license. Contributions to this project are accepted under the same license.

Other Projects

This project contains code from other projects as listed below. The original license text is included in those source files.

The OpenCL header library is licensed under Apache License, Version 2.0, which is a permissive license compatible with MIT license.
The half library is licensed under MIT license.
The libnpy library is licensed under MIT license.
The stb image library is either licensed under MIT license or is in Public Domain. It is used by this project under the terms of MIT license.

Trademarks and Copyrights

Android is a trademark of Google LLC.

Arm, Cortex, Mali and Neon are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere.

Bazel is a trademark of Google LLC., registered in the U.S. and other countries.

CMake is a trademark of Kitware, Inc., registered in the U.S. and other countries.

Linux® is the registered trademark of Linus Torvalds in the U.S. and other countries.

Mac and macOS are trademarks of Apple Inc., registered in the U.S. and other countries.

Tizen is a registered trademark of The Linux Foundation.

Windows® is a trademark of the Microsoft group of companies.

computelibrary's People

Contributors

Stargazers

Watchers

Forkers

mdupuy chiahungtai kunyi liuguoyou chenguoguo zstang chop2 lukeiwanski loliod ardacoskunses carloslema benjamesbabala mylearning2017 mo-morikawa mornydew fucevin chunde thunderbob luzhongtong bfdream oumoxiluoyi tonyfy jc1da chagge lunasdejavu jay2002 lyk125 jlovejoy dtmoodie vybhavk hit2sjtu waterxt lygstate yxftju wilming shiwf lovelyboy1 neiljonlouie bhavesh2g tuyenth samsun639 10imaging xmchen1987 algpower shangma jackdonghai mdigiorgio enterstudio chenkaiidy blmousee winson1990 dyingdance rvincke aichunyu subashp nukes palcode duhuanxianzi hiker2046 hexiangquan jimmkimoon hanfeng0114 giorgio-arena quxiaofeng turtlepa myungjoo jialiang19 lilian92 jammyzhou davidsunny86 xiaomaol zhencang yutaka329 ezineo ab2005 fireae licaiyuren xlicz cbjhong nopluz wonderzy numericcal allonli liubinyijia keyky jameslinus densityco soledad89 patrickzhu8 cometzero liangdong-xjtu gplhegde barongeng bigcargo xshhhm zyh329 enricmcalvo templeblock mincore ploverpang

computelibrary's Issues

pure virtual method called

I build the project with the follwing command on a raspberry2 Model B:

scons Werror=1 -j8 debug=1 asserts=1 arch=armv7a os=linux build=native opencl=0 neon=1

Then,

cd build
LD_LIBRARY_PATH=. ./neon_convolution

But got output like this:

terminate called without an active exception
Aborted

I found that it crashed in conv3x3.run().

gcc/g++ version:

gcc version 4.8.4 (Raspbian 4.8.4-1)

What's wrong?
Thank you.

How to copy an image ptr to a tensor?

e.g., copy opencv's mat ptr to a tensor (image)

Missing a correctness test-suite and a performance benchmark

This library needs a correctness suite of tests and a performance benchmark. If ARM expects contributions from the community, please make sure the community has the right tools to not regress in correctness and in performance.

Are there any design documents of the implementation and framework?

Are there any documents about the details of how the image processing algorithm is implemented using opencl and neon or about the framework architecture?
Do the community plan for this thing? I'm in big hungry for this.

Thanks

FC->RELU->FC sequence issue in Deep Learning

Hi,

As you already know I am trying to implement the Alexnet . I am almost at the end. But the Sequence FC->RELU->FC is yielding the wrong results for Last FC layer. Dimensions are mentioned below.

1st FC Layer output which is input to Relu :
LayerInputTensor.allocator()->init(TensorInfo(TensorShape((long unsigned int)4096), 1, DataType::F32));

Relu Layer:
input : Passing the above FCLayer tensor output as input to this layer
output: LayerTensor.allocator()->init(TensorInfo(TensorShape((long unsigned int)1,(long unsigned int)1,(long unsigned int)4096), 1, DataType::F32)); //(for generic approach width,height are placed to 1)

FCLayer:
input :Passing the above Relu Layer tensor output as input to this layer
output: FcTensor.allocator()->init(TensorInfo(TensorShape((long unsigned int)4096), 1, DataType::F32));

With this configurations kept in sequential we are facing the functionlity issues.
Additional Info: After Relu Configure is done the output tensor changed to 4096 * 4096 after padding.Not sure why padding increased so huge.

Below is the standalone code for you reference:
`main()
{
// The src tensor which will contain the input image
Tensor LayerInputTensor;

// The weights and biases tensors should be initialized with the values inferred with the training
Tensor LayerTensor;
Tensor FcTensor;

NEActivationLayer ActLayer;
NEFullyConnectedLayer FcLayer;

int i;
float *input_buffer=(float*)malloc(4096*sizeof(float));
/* [Initialize tensors] */

FILE *fp;    
fp = fopen("relu_inp.bin","rb");
fread(input_buffer,4,4096,fp);
fclose(fp);

LayerInputTensor.allocator()->init(TensorInfo(TensorShape((long unsigned int)4096), 1, DataType::F32));
LayerTensor.allocator()->init(TensorInfo(TensorShape((long unsigned int)1,(long unsigned int)1,(long unsigned int)4096), 1, DataType::F32));


printf("*********Relu input :: size %d start %d\n",LayerInputTensor.info()->total_size(),LayerInputTensor.info()->offset_element_in_bytes(Coordinates(0, 0)));
printf("*********Relu output :: size %d start %d\n",LayerTensor.info()->total_size(),LayerTensor.info()->offset_element_in_bytes(Coordinates(0, 0)));

ActLayer.configure(&LayerInputTensor, &LayerTensor,ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::RELU));

printf("#########Relu input :: size %d start %d\n",LayerInputTensor.info()->total_size(),LayerInputTensor.info()->offset_element_in_bytes(Coordinates(0, 0)));
printf("#########Relu output :: size %d start %d\n",LayerTensor.info()->total_size(),LayerTensor.info()->offset_element_in_bytes(Coordinates(0, 0)));


FcLayerWgtTensor.allocator()->init(TensorInfo(TensorShape((long unsigned int)4096,(long unsigned int)4096), 1, DataType::F32));
FcLayerBiasTensor.allocator()->init(TensorInfo(TensorShape((long unsigned int)4096), 1, DataType::F32));
FcTensor.allocator()->init(TensorInfo(TensorShape((long unsigned int)4096), 1, DataType::F32));

printf("*********FC input :: size %d start %d\n",LayerTensor.info()->total_size(),LayerTensor.info()->offset_element_in_bytes(Coordinates(0, 0)));
printf("*********FC output :: size %d start %d\n",FcTensor.info()->total_size(),FcTensor.info()->offset_element_in_bytes(Coordinates(0, 0)));

FcLayer.configure(&LayerTensor, &FcLayerWgtTensor, &FcLayerBiasTensor, &FcTensor);

printf("#########FC input :: size %d start %d\n",LayerTensor.info()->total_size(),LayerTensor.info()->offset_element_in_bytes(Coordinates(0, 0)));
printf("#########FC output :: size %d start %d\n",FcTensor.info()->total_size(),FcTensor.info()->offset_element_in_bytes(Coordinates(0, 0)));



/* Allocating the memories to Tensors */
LayerInputTensor.allocator()->allocate();
LayerTensor.allocator()->allocate();
FcTensor.allocator()->allocate();
FcLayerWgtTensor.allocator()->allocate();
FcLayerBiasTensor.allocator()->allocate();

std::copy_n((unsigned char*)input_buffer,LayerInputTensor.info()->total_size(),(unsigned char*)LayerInputTensor.buffer());
loadWeights("fc_w");
loadBias("fc_b");
printf("Loaded values\n");

ActLayer.run();


FcLayer.run();

 
fp = fopen("FcLayer.txt","w");
for(i=0;i<FcTensor.info()->total_size();i++)
{
   fprintf(fp,"%f\n",*(reinterpret_cast<float*>(FcTensor.ptr_to_element(Coordinates(i,0)))));
}

fclose(fp);
free(input_buffer);

Thanks,
G. Praveen.

Problem with GEMM (NEGEMMLowp function, 8-bit) - very low performance

Hello,
I tried to benchmark NEGEMMLowp (8-bit) for different Matrix parameters on Xiaomi Redmi Pro (10 cores). The performance is very low compared to Google's gemmlowp optimzied for ARM NEON. Can anyone provide an example code showing the way to efficiently use NEGEMMLowp function? I provided my code below. Please, suggest me if there is any issue with the code, and/or gemm function, and/or optimziation.

I ran the "benchmark_meta_gemm.cc (21.8ms)" and "benchmark.cc (9.95ms)" from "https://github.com/google/gemmlowp" for Matrix size M=3136; N=192; K=576. Whereas, it takes 2575.92 ms when NEGEMMLowp is used for the same matrix size.

`
double time() {
timespec t;
clock_gettime(CLOCK_REALTIME, &t);
return t.tv_sec + 1e-9 * t.tv_nsec;
}

void main_neon_gemm(int argc, const char **argv)
{

Tensor    src1, src2, out1;

constexpr uint16_t M[] = { 49, 128, 160, 196, 196, 196, 3136, 12544, 1000, 1000, 1000, 1000};
constexpr uint16_t N[] = { 256, 508, 832, 160, 204, 256, 192, 64, 1000, 1000, 1000, 1000};
constexpr uint16_t K[] = { 832, 1, 49, 528, 864, 1152, 576, 147, 1, 10, 100, 1000};

int pointer =0;

if (argc > 1) {
  pointer= atoi(argv[1]);
  std::cout << "Argv(int)=" << pointer <<"\n\n";
}
else {
    pointer=2;  
    std:: cout << "Argv(int) Default=" << pointer << "\n\n";
}

const TensorShape src1_shape(M[pointer],K[pointer]);
const TensorShape src2_shape(K[pointer],N[pointer]);
const TensorShape out1_shape(M[pointer],N[pointer]);

src1.allocator()->init(TensorInfo(src1_shape, Format::U8));
src2.allocator()->init(TensorInfo(src2_shape, Format::U8));
out1.allocator()->init(TensorInfo(out1_shape, Format::U8));

// Initialize just the dimensions and format of the temporary and destination images:
NEGEMMLowp gemm;

gemm.configure(&src2, &src1, &out1, 0, 0, 0, 1, 24);

// Now that the padding requirements are known we can allocate the images:
src1.allocator()->allocate();
src2.allocator()->allocate();
out1.allocator()->allocate();

float avg =3;
double totalTime = 0.0;
for ( int i=0; i<avg; i++)
{
    double st = time();
    gemm.run();
    double totalTemp = (time() - st);
    totalTime = totalTime + totalTemp;
}

double tT= totalTime*1000;
tT = tT/avg;
std::cout << "Matrix size= " << M[pointer] << " * " << N[pointer] << " * " << K[pointer] << "\t" << tT << "ms\n\n";

}

int main(int argc, const char **argv)
{
return test_helpers::run_example(argc, argv, main_neon_gemm);
}
`

Using ACL in Anroid Studio (cmake/ndk)

I import libarm_compute.so to my project, but there is app can't run~
please help. thanks a lot

gradle:
apply plugin: 'com.android.application'

android {
signingConfigs {
yuda {
keyAlias 'yuda'
keyPassword '000000'
storeFile file('/home/user/Android/keys/Key.jks')
storePassword '000000'
}
}
compileSdkVersion 25
buildToolsVersion "25.0.2"
defaultConfig {
applicationId "com.example.user.demo_acl"
minSdkVersion 22
targetSdkVersion 25
versionCode 1
versionName "1.0"
testInstrumentationRunner "android.support.test.runner.AndroidJUnitRunner"
externalNativeBuild {
cmake {
cppFlags "-std=c++11 -frtti -fexceptions"
abiFilters 'arm64-v8a'
}
}
sourceSets {
main {
jniLibs.srcDirs = ['libs']
}
}
}
buildTypes {
release {
minifyEnabled false
proguardFiles getDefaultProguardFile('proguard-android.txt'), 'proguard-rules.pro'
signingConfig signingConfigs.yuda
}
}
externalNativeBuild {
cmake {
path "CMakeLists.txt"
}
}
}

dependencies {
compile fileTree(include: ['*.jar'], dir: 'libs')
androidTestCompile('com.android.support.test.espresso:espresso-core:2.2.2', {
exclude group: 'com.android.support', module: 'support-annotations'
})
compile 'com.android.support:appcompat-v7:25.1.0'
testCompile 'junit:junit:4.12'
}

CMakeLists.txt:

cmake_minimum_required(VERSION 3.4.1)

set(CMAKE_VERBOSE_MAKEFILE on)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=gnu++11")

set(pathToProject /home/user/Documents/MyApplication/app)

include_directories(${pathToProject}/src/main/cpp/include)

#compute:
add_library(lib_armcompute SHARED IMPORTED)
set_target_properties(lib_armcompute PROPERTIES IMPORTED_LOCATION ${pathToProject}/libs/${ANDROID_ABI}/libarm_compute.so)

#opencv:
add_library(lib_opencv SHARED IMPORTED )
set_target_properties(lib_opencv PROPERTIES IMPORTED_LOCATION ${pathToProject}/libs/${ANDROID_ABI}/libopencv_java3.so)

#opencl:
add_library(lib_opencl SHARED IMPORTED)
set_target_properties(lib_opencl PROPERTIES IMPORTED_LOCATION ${pathToProject}/libs/${ANDROID_ABI}/libOpenCL.so)

add_library(
native-lib
SHARED
src/main/cpp/native-lib.cpp )
find_library( log-lib log )
target_link_libraries( # Specifies the target library.
native-lib
# Links the target library to the log library
# included in the NDK.
${log-lib}
lib_opencv
# lib_opencl
lib_armcompute
)

log:
04-19 15:22:21.109 14420-14420/? E/AndroidRuntime: FATAL EXCEPTION: main
Process: com.example.user.demo_acl, PID: 14420
java.lang.UnsatisfiedLinkError: dlopen failed: library "../../../../libs/arm64-v8a/libarm_compute.so" not found
at java.lang.Runtime.loadLibrary0(Runtime.java:994)
at java.lang.System.loadLibrary(System.java:1533)
at com.example.user.demo_acl.MainActivity.(MainActivity.java:14)
at java.lang.Class.newInstance(Native Method)
at android.app.Instrumentation.newActivity(Instrumentation.java:1083)
at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:2682)
at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:2864)
at android.app.ActivityThread.-wrap12(ActivityThread.java)
at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1567)
at android.os.Handler.dispatchMessage(Handler.java:105)
at android.os.Looper.loop(Looper.java:156)
at android.app.ActivityThread.main(ActivityThread.java:6524)
at java.lang.reflect.Method.invoke(Native Method)
at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:941)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:831)

Issue with Softmaxlayer

Hi,
We are trying to use the softmax layer with input sizes 8,10,16. We are unable to get the desired output.
Below is the code for input size 10.Please point us if we are missing out anything:

main()
{
// The src tensor which will contain the input image
Tensor src;
Tensor dup_src;

// The weights and biases tensors should be initialized with the values inferred with the training
Tensor dst;

NESoftmaxLayer    SoftmaxLayer;
float input_buffer[10] = {-5.174491
                          ,1.911436
                          ,-9.226623
                          ,0.341730
                          ,-6.581529
                          ,16.006193
                          ,4.922042
                          ,-5.961459
                          ,2.368166
                          ,1.662867};

/* [Initialize tensors] */


src.allocator()->init(TensorInfo(TensorShape(64), 1, DataType::F32));
dup_src.allocator()->init(TensorInfo(TensorShape(10), 1, DataType::F32));
dst.allocator()->init(TensorInfo(TensorShape(10), 1, DataType::F32));

printf("SoftmaxLayer :: %d %d\n",dup_src.info()->total_size(),dst.info()->total_size());
SoftmaxLayer.configure(&dup_src, &dst);
printf("SoftmaxLayer :: %d %d\n",dup_src.info()->total_size(),dst.info()->total_size());

src.allocator()->allocate();
dup_src.allocator()->allocate();
dst.allocator()->allocate();
memcpy(src.buffer(),input_buffer,40);

if(src.info()->total_size() == dup_src.info()->total_size())
{
  std::copy_n((unsigned char*)src.buffer(), 10*sizeof(float), (unsigned char*)dup_src.buffer());
}
else if(src.info()->total_size() > dup_src.info()->total_size())
{
   printf("Entered the elseif %d %d %d\n",src.info()->offset_element_in_bytes(Coordinates(0, 0)),dup_src.info()->offset_element_in_bytes(Coordinates(0, 0)),10*sizeof(float));
   std::copy_n((unsigned char*)src.buffer(), 10*sizeof(float), (unsigned char*)dup_src.buffer()+dup_src.info()->offset_element_in_bytes(Coordinates(0, 0)));
}
else
{
   printf("WARNING ::: Padding added during the Configure reconsider the copying !!!\n");
   printf("%d %d %d %d \n",src.info()->total_size(),dup_src.info()->total_size(),src.info()->offset_element_in_bytes(Coordinates(0, 0)),dup_src.info()->offset_element_in_bytes(Coordinates(0, 0)));
   std::copy_n((unsigned char*)src.buffer() +  src.info()->offset_element_in_bytes(Coordinates(0, 0)), 10*sizeof(float), (unsigned char*)dup_src.buffer() + dup_src.info()->offset_element_in_bytes(Coordinates(0, 0)));
}

SoftmaxLayer.run();
float * out = (float*)src.buffer();
for(int i=0;i<10;i++)
{
   printf("%d :%f\n",i,dst.buffer()+dst.info()->offset_element_in_bytes(Coordinates(0, 0)));
}

}
Thanks,
G.Praveen.

Build failure on ppc64le

Reusing arch=x86, it builds okay for openSUSE Tumbleweed on ppc64 and s390x, but it fails for ppc64le, due to apparent redefinition of the bool type via altivec.h from include/CL/cl_platform.h.

GCC bug
Khronos bug
Mesa bug

The solution appears to be using -std=gnu++11 instead of -std=c++11 for this architecture. Relates to issue #19.

benchmark numbers

Hello

I am implementing a neural network (similar to Alexnet) on a ARM Neon based platform using the ARM Compute library.
I wanted to compare the performance numbers with the expected numbers -
Do you have any performance bench mark numbers (or execution times) for different layers of neural n/w(Convolution layer, Fully connected etc.).

Thanks
kamrao

README

Can this library work on AllwinnerTech's or RochChip's Chip? like A64 and RK3288.

with this library, can I run the demo on RK3288 or some other ARM based chips?

what if they use a unknow GPU,like Mali400 / GC2000 or other ? do I need some more resources to get start my work?

clconvolution example issues

hi all,
i try to run the example of clconvolution.cpp, but i meet something problem like bug.
if the climage format is u8, the result ppm image is fine.
but when the format is rgb888, it return a wrong result contain stripes .
any one meet the same problem? th.

Deep Neural Networks on ComputeLibrary

I found several data structures that are components of deep neural networks, in the library. Examples are:

CLConvolutionLayer( )
CLActivationLayer()
CLPoolingLayer()

However, I could not find out a way to either combine these layers to form a network, or to load a previously saved network (architecture and/or weights) to the memory. Is this supported at the moment? If so, is there any sample code that I can use to get started.

Thanks in advance!

How do I implement a LRN layer in ACL supporting multiple channels?

I've found out that NENormalizationLayer requires its input tensor having exactly one channel. How could I extend NENormalizationLayer so that its input tensor could have more than one channels? Does ACL team has any plan to support this feature in the future?

Obtaining libOpenCL.so

Hi,
I'm trying to use the library on mali GPU. I'm wondering what's the best way to obtaining a real implementation of libOpenCL.so that is compatible to the openCL header files used in this library? I didn't find libOpenCL.so in /system on the device.

Docs: Unexpected See also for valid_region

The manual padding bullet point has a probably unintended "See also" heading in the middle of the sentence.

Alexnet implementation using the Compute Library

Hi,

In the optimized version of Convolution for ARM Neon - it appears that the size is restricted to 3x3,5x5,7x7,9x9 for U8 input. How do we go about other sizes such as 11x11 for F32 (needed for Alexnet) - do you have any suggestion?
Are there any stand alone test setups for checking individual deep learning kernels optimized for ARM Neon? If they are already available - could you please point us on the usage. Thanks in advance.

G.Praveen

Absence of g++ results in "successful" build

If sconscript catches an exception trying to execute g++, it prints an error but does not Exit(1) as done elsewhere. That leads to scons returning success (0) despite not actually building anything, which is confusing in an unattended package build scenario such as OBS.

try:
    compiler_ver = subprocess.check_output( [env['CXX'] , "-dumpversion"] ).strip()
except OSError:
    print "ERROR: Compiler not found"
    compiler_ver = ""

if compiler_ver != "":
    ...

How do we restrict OpenCL to use ARM CPU only, vs. Mali only?

If we want to benchmark performance that we get from OpenCL kernels making only the CPU, and then only the GPU, or both, any suggestions on that?

PS: Great job! The embedded world needed this.

Need way to add custom compiler flags

Our package build processes require to add RPM macro %{optflags} to CXXFLAGS.
It seems that flags is assembled in sconscript without any way to extend it with custom options?

Group Convolution and batch processing

Hello

I am trying to implement a neural n/w (similar to Alexnet) using the Convolution layer APIs for ARM Neon. Does the APIs in the Compute lib support Grouped Convolution and batch processing?
If so, can you clarify how to configure for grouped convolution and batch processing using the API.
Thank you.

Kamrao

Load new images parallel with computation current image on GPU

I want to load new images parallel with computation current image on GPU Mali T628 in OdroidXU4 to reduce an overhead time ( time spend on copying a new image to GPU for computation). Do you know how to do that? Whether It can do?

How to reshape a tensor?

hi, I configured a dnn with computeLibrary.
I found that the full connected layer only accept a two dims tensor.
so, I need to reshape the three(four) dims tensors to be a two dims tensor.

Is there an exiting reshape methods in this library?

NDK-Build Error

when I use ndk-build in jni programming，I have got some error when I want to make a DLL;

THE erroe message show as below
[armeabi-v7a] Compile++ thumb: gesture_opencv <= Interface.cpp
jni/mtcnn/Interface.cpp: In function 'void Func()':
jni/mtcnn/Interface.cpp:103:50: error: expected primary-expression before '(' token
src.allocator()->init(arm_compute::TensorInfo(src_shape, 1, DataType::F32) );
^
jni/mtcnn/Interface.cpp:103:65: error: reference to 'DataType' is ambiguous
src.allocator()->init(arm_compute::TensorInfo(src_shape, 1, DataType::F32) );
^
In file included from jni/include/arm_compute/core/IKernel.h:27:0,
from jni/include/arm_compute/core/CPP/ICPPKernel.h:27,
from jni/include/arm_compute/core/NEON/INEKernel.h:27,
from jni/include/arm_compute/runtime/NEON/INESimpleFunction.h:27,
from jni/include/arm_compute/runtime/NEON/functions/NEAbsoluteDifference.h:27,
from jni/include/arm_compute/runtime/NEON/NEFunctions.h:28,
from jni/mtcnn/Interface.cpp:8:
jni/include/arm_compute/core/Types.h:59:12: note: candidates are: enum class arm_compute::DataType
enum class DataType
^
In file included from jni/include/opencv2/core.hpp:56:0,
from jni/include/opencv2/opencv.hpp:46,
from jni/mtcnn/Interface.cpp:3:
jni/include/opencv2/core/traits.hpp:106:30: note: template class cv::DataType
template class DataType
^
make: *** [obj/local/armeabi-v7a/objs/gesture_opencv/mtcnn/Interface.o] Error 1

My Code Show as Below
void Func()
{

Tensor src;
constexpr unsigned int width_src_image = 32;
constexpr unsigned int height_src_image = 32;
constexpr unsigned int ifm_src_img = 3;
const TensorShape src_shape(width_src_image, height_src_image, ifm_src_img);
src.allocator()->init(arm_compute::TensorInfo(src_shape, 1, DataType::F32) );

}

My Application.mk show as below
APP_PLATFORM := android-14
APP_ABI := armeabi-v7a arm64-v8a
APP_STL :=gnustl_static
NDK_TOOLCHAIN_VERSION :=4.9
APP_CPPFLAGS += -std=c++11

What Should I do?

SCons ERROR: Compiler not found on Jetson TX1 when compiling 32bit ComputeLibrary

My brief installaiton steps below :

I installed SCons by downloading Prior stable (2.5.0) from dowload page (SCons Downloads: http://www.scons.org/pages/download.html).
I git clone --recursive this repo to local and then executed below command:

ysh329@tegra-ubuntu:~/sdcard/code/ComputeLibrary$ scons Werror=1 -j8 debug=1 neon=1 opencl=0 os=linux arch=armv7a
scons: Reading SConscript files ...
ERROR: Compiler not found
scons: done reading SConscript files.
scons: Building targets ...
scons: building associated VariantDir targets: build
scons: `.' is up to date.
scons: done building targets.

Scons showed ERROR: Compiler not found. However, following steps are normal, is there anything wrong with Scons? it doesn't matter ?

AArch64 build failure on gcc 4.8

For Linux, the library was successfully built and tested using the following Linaro GCC toolchain: gcc-linaro-arm-linux-gnueabihf-4.8-2014.02_linux and gcc-linaro-6.1.1-2016.08-x86_64_arm-linux-gnueabihf

While v17.03.1 builds fine on openSUSE Tumbleweed with our gcc 6.3.1 for both armv7a and arm64-v8a, it fails to build on openSUSE Leap 42.{1,2,3} (SLES 12 SP{1,2,3}) with our gcc 4.8.5 for arm64-v8a:

[  150s] g++ -o build/src/core/NEON/kernels/NEConvolutionKernel.o -c -D_GLIBCXX_USE_NANOSLEEP -Wno-deprecated-declarations -Wall -DARCH_ARM -Wextra -Wno-unused-parameter -pedantic -Wdisabled-optimization -Wformat=2 -Winit-self -Wmissing-include-dirs -Wstrict-overflow=2 -Wswitch-default -fpermissive -std=c++11 -Wno-vla -Woverloaded-virtual -Wctor-dtor-privacy -Wsign-promo -Weffc++ -Wno-format-nonliteral -Wno-overlength-strings -Wno-strict-overflow -Wlogical-op -Wnoexcept -Wstrict-null-sentinel -march=armv8-a -O3 -ftree-vectorize -Ibuild -I. -Iinclude src/core/NEON/kernels/NEConvolutionKernel.cpp
[  156s] In file included from src/core/NEON/kernels/NEConvolutionKernel.cpp:37:0:
[  156s] /usr/lib64/gcc/aarch64-suse-linux/4.8/include/arm_neon.h: In member function 'void arm_compute::NESeparableConvolutionHorKernel<matrix_size>::convolve(const arm_compute::Window&) [with OutputType = int; unsigned int matrix_size = 5u]':
[  156s] /usr/lib64/gcc/aarch64-suse-linux/4.8/include/arm_neon.h:12175:32: error: 'asm' operand requires impossible reload
[  156s]             : /* No clobbers */);
[  156s]                                 ^
[  156s] /usr/lib64/gcc/aarch64-suse-linux/4.8/include/arm_neon.h:9852:32: error: 'asm' operand requires impossible reload
[  156s]             : /* No clobbers */);
[  156s]                                 ^
[  156s] /usr/lib64/gcc/aarch64-suse-linux/4.8/include/arm_neon.h:9852:32: error: 'asm' operand requires impossible reload
[  156s]             : /* No clobbers */);
[  156s]                                 ^
[  156s] /usr/lib64/gcc/aarch64-suse-linux/4.8/include/arm_neon.h:9852:32: error: 'asm' operand requires impossible reload
[  156s]             : /* No clobbers */);
[  156s]                                 ^
[  156s] /usr/lib64/gcc/aarch64-suse-linux/4.8/include/arm_neon.h:9852:32: error: 'asm' operand requires impossible reload
[  156s]             : /* No clobbers */);
[  156s]                                 ^
[  156s] /usr/lib64/gcc/aarch64-suse-linux/4.8/include/arm_neon.h:12175:32: error: 'asm' operand requires impossible reload
[  156s]             : /* No clobbers */);
[  156s]                                 ^
[  156s] /usr/lib64/gcc/aarch64-suse-linux/4.8/include/arm_neon.h:9852:32: error: 'asm' operand requires impossible reload
[  156s]             : /* No clobbers */);
[  156s]                                 ^
[  156s] /usr/lib64/gcc/aarch64-suse-linux/4.8/include/arm_neon.h:9852:32: error: 'asm' operand requires impossible reload
[  156s]             : /* No clobbers */);
[  156s]                                 ^
[  156s] /usr/lib64/gcc/aarch64-suse-linux/4.8/include/arm_neon.h:9852:32: error: 'asm' operand requires impossible reload
[  156s]             : /* No clobbers */);
[  156s]                                 ^
[  156s] /usr/lib64/gcc/aarch64-suse-linux/4.8/include/arm_neon.h:9852:32: error: 'asm' operand requires impossible reload
[  156s]             : /* No clobbers */);
[  156s]                                 ^
[  162s] scons: *** [build/src/core/NEON/kernels/NEConvolutionKernel.o] Error 1

Which compilers did you build your 64-bit binaries with?

No soname set

The shared libraries libarm_compute_core.so and libarm_compute.so are not versioned. The version would need to be set in per-library linker flags and could initially be 0. The idea is to have libarm_compute.so.0, with libarm_compute.so as symlink for development.

The build_library function in sconscript may need to be extended to accommodate this in the static=False case?

simply build the examples error with test_helpers

MattZhz@ubuntu:~/libraries/ComputeLibrary-master/build$ aarch64-linux-android-g++ examples/cl_convolution.cpp -I. -Iinclude -std=c++11 -larm_compute-static -L. -o cl_convolution_aarch64 -static-libstdc++ -pie -lOpenCL
/tmp/cc5TU2e0.o: In function main': cl_convolution.cpp:(.text+0x660): undefined reference to test_helpers::run_example(int, char const**, void (&)(int, char const**))'
/tmp/cc5TU2e0.o: In function test_helpers::PPMLoader::open(std::string const&)': cl_convolution.cpp:(.text._ZN12test_helpers9PPMLoader4openERKSs[_ZN12test_helpers9PPMLoader4openERKSs]+0x88): undefined reference to test_helpers::parse_ppm_header(std::basic_ifstream<char, std::char_traits >&)'

directory showed by tree -L 2
├── arm_compute
│   ├── core
│   └── runtime
├── cl_convolution
├── cl_events
├── examples
│   ├── cl_convolution.cpp
│   ├── cl_convolution.o
│   ├── cl_events.cpp
│   ├── cl_events.o
│   ├── neoncl_scale_median_gaussian.cpp
│   ├── neon_convolution.cpp
│   ├── neon_copy_objects.cpp
│   ├── neon_scale.cpp
│   └── test_helpers
├── include
│   └── CL
├── libarm_compute_core.so
├── libarm_compute_core-static.a
├── libarm_compute.so
├── libarm_compute-static.a
├── libOpenCL.so
├── src
│   ├── core
│   └── runtime
└── test_helpers
├── Utils.cpp
├── Utils.h
└── Utils.o

Please help see if there is anything wrong.

Thanks,
MattZhz

Maxpool padding issue

Hi,

Im implementing the Alexnet using the armcomputelib for Neon.For the last Maxpooling(13,13,256)
the output is not as expected with the other platforms(GPU).

For Maxpooling the previous layer output is Relu Layer(output actual dimension:13,13,256 output Padded dims by armcompute: 16,13,256) in which there is padding of 3 floats(hence 16 width) in row wise .Im feeding the same tensor to Maxpooling with configuration as below

PoolingLayerInfo(PoolingType::MAX, 3,PadStrideInfo(2,2,0,0,DimensionRoundingType::FLOOR))

Can you check what could be the issue for this.
Im attaching the input ,output,ref_output for maxpooling in the below google drive link.

https://drive.google.com/open?id=0B8m_q7acVVbbTl9kbGJEUkVVN3M

Note: I'm using the latest version of armcompute.

Thanks,
G.Praveen.

something wrong inupdate_window_and_padding

I have a piece of code like this:

NEActivationLayer activation;
activation.configure(input, output, activation_info);
activation.run();

I found that if the shape of input is [14,14,192], it will failed with a output like this:

ERROR in run src/core/NEON/kernels/NEActivationLayerKernel.cpp:204: This kernel hasn't been configured. Resource temporarily unavailable

But if the shape of input is [16,16,192], everything will be OK.

I found that it crashed because of an assert which says that

    ARM_COMPUTE_ERROR_ON_LOC_MSG((kernel->window().x().start() == kernel->window().x().end()) && (kernel->window().x().end() == 0),
                                 function, file, line,
                                 "This kernel hasn't been configured.");

After tracing with gdb, I found that in ICPPSimpleKernel::configure(),
update_window_and_padding() overrode kernel->window().x().end() with zero. So, the kernel will look like not configured.

Is this a bug?

No image for examples

Hi , I think supply an image for examples will more friendly.

Norm output Precision issues

Hi,
I'm trying to use the compute library for implementing the Alexnet .
For 1st Normalization layer we are facing huge precision issues when compared to output from other platform(Eg: GPU) for a given input .
Download the input and outputs from the below link to test the norm layer https://drive.google.com/open?id=0B8m_q7acVVbbQU5aMFhtblFSd1E .Below is the code I'm using in arm-compute library

main()
{
   // The src tensor which will contain the input image
    Tensor dup_src;

    // The weights and biases tensors should be initialized with the values inferred with the training
    Tensor dst;

    NENormalizationLayer    NormLayer;

    float *inp = (float *)malloc(55*55*96*sizeof(float));

    FILE *fp;    
    fp = fopen("norm_inp.bin","rb");
    fread(inp,4,55*55*96,fp);
    fclose(fp);
    
    /* [Initialize tensors] */
    dup_src.allocator()->init(TensorInfo(TensorShape(55*55*96), 1, DataType::F32));
    dst.allocator()->init(TensorInfo(TensorShape(55*55*96), 1, DataType::F32));

    NormLayer.configure(&dup_src, &dst, NormalizationLayerInfo(NormType::CROSS_MAP,5, 2.0E-5, 0.75, 1.0));

    dup_src.allocator()->allocate();
    dst.allocator()->allocate();
    
    std::copy_n((unsigned char*)inp , 55*55*96*sizeof(float), (unsigned char*)dup_src.buffer() +  dup_src.info()->offset_element_in_bytes(Coordinates(0, 0)));

    NormLayer.run();
   
    float *out = (float*)(dst.buffer() + dst.info()->offset_element_in_bytes(Coordinates(0, 0)));    
    
    fp = fopen("norm_out.txt","w");
    
    for(int i=0;i<55*55*96;i++)
       fprintf(fp,"%f\n",out[i]);
    fclose(fp);
}

Can you check it and let us know if it is expected or some inherent bug is there.

Thanks,
G.Praveen.

ERROR building the compute library

Executed the following command :
scons Werror=1 -j8 debug=0 asserts=1 neon=0 opencl=1 embed_kernels=1 os=android arch=arm64-v8a

Error:
scons: Reading SConscript files ...
fatal: ambiguous argument 'HEAD': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git [...] -- [...]'
ERROR: Compiler not found
scons: done reading SConscript files.
scons: Building targets ...
scons: building associated VariantDir targets: build
scons: `.' is up to date.
scons: done building targets.

How do I load weights of a pre-trained model?

With the recent update and also your support, I can now build and run deep neural network models. How do I load weights, biases etc. from a file so that I can use the network for classification?

FC Layer using the GEMM

Hi,

I am trying to implement the FCLayer using the GEMM below are the dimensions and configuration im using. Please let me know if this is correct Since im facing below error while running/.
"What(): in Iterator ./arm_compute/core/Helpers.inl:211: Maximum number of dimensions expected 1 but dimension 1 is not empty"

NEGEMM FcLayer;

LayerTensor.allocator()->init(TensorInfo(TensorShape((long unsigned int)4096), 1, DataType::F32));
FcLayerWgtTensor.allocator()->init(TensorInfo(TensorShape((long unsigned int)4096,(long unsigned int)4096), 1, DataType::F32));
FcLayerBiasTensor.allocator()->init(TensorInfo(TensorShape((long unsigned int)4096), 1, DataType::F32));
FcTensor.allocator()->init(TensorInfo(TensorShape((long unsigned int)4096), 1, DataType::F32));

/*Done the allocations of the above */
.......................
........................

FcLayer.configure(&LayerTensor, &FcLayerWgtTensor, &FcLayerBiasTensor, &FcTensor,1.0,1.0);
FcLayer.run()

Are there any transpose need to be done before passing to run.
Thanks.

Issue with Normalization layer

Hi,
When I am trying to use the Normalization layer. Output is coming as expected but during the function exit I' getting the below error

*** Error in `./norm_layer': double free or corruption (!prev): 0x000000001182d040 ***
Aborted

Code I'm using is as below.

main()
{
   // The src tensor which will contain the input image
    Tensor src;
    Tensor dup_src;

    // The weights and biases tensors should be initialized with the values inferred with the training
    Tensor dst;

    NENormalizationLayer    NormLayer;

    /* [Initialize tensors] */

   
    src.allocator()->init(TensorInfo(TensorShape(32000), 1, DataType::F32));
    dup_src.allocator()->init(TensorInfo(TensorShape(500), 1, DataType::F32));
    dst.allocator()->init(TensorInfo(TensorShape(500), 1, DataType::F32));

    printf("MWNormLayer :: %d %d\n",dup_src.info()->total_size(),dst.info()->total_size());
    NormLayer.configure(&dup_src, &dst, NormalizationLayerInfo(NormType::CROSS_MAP,1,0.0001, 0.75, 1.0));
    printf("MWNormLayer :: %d %d\n",dup_src.info()->total_size(),dst.info()->total_size());

    src.allocator()->allocate();
    dup_src.allocator()->allocate();
    dst.allocator()->allocate();
    
    if(src.info()->total_size() == dup_src.info()->total_size())
    {
      std::copy_n((unsigned char*)src.buffer(), 500*sizeof(float), (unsigned char*)dup_src.buffer());
    }
    else if(src.info()->total_size() > dup_src.info()->total_size())
    {
       printf("Entered the elseif %d %d %d\n",src.info()->offset_element_in_bytes(Coordinates(0, 0)),dup_src.info()->offset_element_in_bytes(Coordinates(0, 0)),500*sizeof(float));
       std::copy_n((unsigned char*)src.buffer(), 500*sizeof(float), (unsigned char*)dup_src.buffer()+dup_src.info()->offset_element_in_bytes(Coordinates(0, 0)));
    }
    else
    {
       printf("WARNING ::: Padding added during the Configure reconsider the copying !!!\n");
       printf("%d %d %d %d \n",src.info()->total_size(),dup_src.info()->total_size(),src.info()->offset_element_in_bytes(Coordinates(0, 0)),dup_src.info()->offset_element_in_bytes(Coordinates(0, 0)));
       std::copy_n((unsigned char*)src.buffer() +  src.info()->offset_element_in_bytes(Coordinates(0, 0)), 500*sizeof(float), (unsigned char*)dup_src.buffer() + dup_src.info()->offset_element_in_bytes(Coordinates(0, 0)));
    }

    NormLayer.run();
}

Can you please point out what is the wrong with the code.

Thanks,
G. Praveen.

git required for building from source tarball

If the git executable is not installed on the build host, scons errors out:

...
[   49s]   File "/usr/lib/python2.7/site-packages/SCons/Script/SConscript.py", line 604:
[   49s]     return method(*args, **kw)
[   49s]   File "/usr/lib/python2.7/site-packages/SCons/Script/SConscript.py", line 541:
[   49s]     return _SConscript(self.fs, *files, **subst_kw)
[   49s]   File "/usr/lib/python2.7/site-packages/SCons/Script/SConscript.py", line 250:
[   49s]     exec _file_ in call_stack[-1].globals
[   49s]   File "/home/abuild/rpmbuild/BUILD/ComputeLibrary-17.03.1/sconscript", line 65:
[   49s]     git_hash = subprocess.check_output(["git", "rev-parse","HEAD"])
[   49s]   File "/usr/lib/python2.7/subprocess.py", line 212:
[   49s]     process = Popen(stdout=PIPE, *popenargs, **kwargs)
[   49s]   File "/usr/lib/python2.7/subprocess.py", line 390:
[   49s]     errread, errwrite)
[   49s]   File "/usr/lib/python2.7/subprocess.py", line 1024:
[   49s]     raise child_exception

That should not be necessary since this is what its presence results in:

[  164s] scons: Reading SConscript files ...
[  165s] fatal: Not a git repository (or any of the parent directories): .git
[  167s] scons: done reading SConscript files.
[  167s] scons: Building targets ...

I.e., git does not get any information from the tarball, so it is rather useless for release builds. Requiring to install unneeded packages is an inconvenience in an OBS build environment, as it adds to build time for each package rebuild.

Probably this could be prevented by adding some except clause below?

git_hash="unknown"
try:
    git_hash = subprocess.check_output(["git", "rev-parse","HEAD"])
except subprocess.CalledProcessError:
    pass

Error when build ComputeLibrary-master library on Odroid XU4 (Mali GPU T628, armv7a processor)

When I use command: scons Werror=1 -j8 debug=1 neon=0 opencl=1 os=linux arch=armv7a
to build "OpenCL ComputeLibrary" on Odroid XU4. it happen some error like attached image. Anyone know what is problem?

How can I integrate this project API to TensorFlow or Caffe?

Is there some tips or docs? Thanks a lot!

how to extract the data of hog descriptor in tensor?

Hello everyone,
Now I have trouble with the NEHOGDescriptor function, the function runs ok, but I can't find the way to extract it's result data. Could you help me? Thanks.

best wishes!

Benchmark result against the current TensorFlow runtime

From this post, the following is mentioned,

our experience optimizing machine learning frameworks such as Google TensorFlow.

Is there benchmark results against the default TensorFlow implementations(Eigen/gemmlowp)?

I guess there should be close integration with TensorFlow, for example, implemented as a Device/Kernel. Will be this integration work also open sourced or even integrated into TensorFlow master branch?

Detect ISA features at runtime

Linux distros can't tell at build time whether the ARMv7-A hardware a user will run it on will support NEON (e.g., tegra20 did not). Therefore the library should drop the neon=1 build-time option and instead detect availability or absence of NEON at initialization time.

Same for arch=arm64-v8a vs. arch=arm64-v8.2-a probably.

Why does extern ne10_foo(ne10_t bar) asm ("ne10_foo"); generate compiler error?

I have downloaded and built the library but I cannot successfully compile an application even from the sample programs because of an error I get in the NE10_*.h header files. The simplest way I can display the error is to just try to compile it with gcc in its home directory. The header files called by NE10.h are located one level above in ../inc. The compiler seems to find them okay:

root@ls2088ardb:/home/chuck/ne10/projectNe10-Ne10-0c9f576/samples# gcc -std=c99 -I../inc NE10_sample_matrix_multiply.c
In file included from ../inc/NE10.h:175:0,
from NE10_sample_matrix_multiply.c:30:
../inc/NE10_math.h: In function 'ne10_addc_float_neon':
../inc/NE10_math.h:76:139: error: expected declaration specifiers before 'asm'
extern ne10_result_t ne10_addc_float_neon (ne10_float32_t * dst, ne10_float32_t * src, const ne10_float32_t cst, ne10_uint32_t count) asm ("ne10_addc_float_neon");

My question is:

What is asm in this function prototype. I’m thinking this line of code declares an external assembly language function named “ne10_addc_float_neon” but why is that name repeated after the asm (macro?).

Is this supposed to work on the A72? The assembly instructions appear to be for the ARMv7 neon.

I’m sure I have some simple fundamental mistake with the build or link.

Matthew DuPuy suggested I post for Joe Savage here.

Thanks

arm

ndk-build compile error?

I have an error when I compile example with .so,The error message show as below

In file included from jni/test.cpp:25:
In file included from jni/arm_compute/core/Types.h:27:
In file included from jni/arm_compute/core/Coordinates.h:27:
In file included from jni/arm_compute/core/Dimensions.h:27:
jni/arm_compute/core/Error.h:123:2: error: expected expression
[[noreturn]] void error(const char *function, const char *file, const in...
^
jni/arm_compute/core/Error.h:123:14: error: expected unqualified-id
[[noreturn]] void error(const char *function, const char *file, const in...
^
In file included from jni/test.cpp:25:
In file included from jni/arm_compute/core/Types.h:27:
In file included from jni/arm_compute/core/Coordinates.h:27:
jni/arm_compute/core/Dimensions.h:29:10: fatal error: 'algorithm' file not found
#include
^
3 errors generated.

How to use NESoftmaxLayer

as this code

Tensor in,out;
in.allocator()->init(TensorInfo(10,1,Format::F32));
out.allocator()->init(*in.info());
printf("before softmax.configure\n");
printf("in.info()->total_size()=%d\n",in.info()->total_size());
printf("out.info()->total_size()=%d\n",out.info()->total_size());

NESoftmaxLayer softmax;
softmax.configure(&in,&out);

in.allocator()->allocate();
out.allocator()->allocate();

printf("after softmax.configure\n");
printf("in.info()->total_size()=%d\n",in.info()->total_size());
printf("out.info()->total_size()=%d\n",out.info()->total_size());

float dst[10];
for(int i=0;i<10;i++) dst[i]=i;
if(!in.info()->has_padding()){
        std::copy_n((char*)dst, in.info()->total_size(), in.buffer());
}

softmax.run();

in my environment the output are

before softmax.configure
in.info()->total_size()=40
out.info()->total_size()=40

after softmax.configure
in.info()->total_size()=1800
out.info()->total_size()=1800

is that normal?

Inconsistent output between ACL's NEConvolutionLayer and Caffe's ConvolutionLayer.

Hi, everyone. Recently I'm trying to replace Caffe's ConvolutionLayer with ACL's NEConvolutionLayer. I've half done with it and the result is inspiring. However, although most of the time NEConvolutionLayer has the same result as Caffe's, there're still some bad cases. Could anyone please help me out?

I've selected one test case that could reproduce the inconsistency in the forward process of a single layer. For simplicity, I've dumped the input data, weights and biases of the layer into text files. These files are generated by dumping the mutable_cpu_data of the corresponding blobs for the input data, weights and biases. I've written two handler source files to load these data for both Caffe and ACL. One of them is "caffe_main.cpp" which could be compiled and run directly. Another is "acl_snippet.cpp" which needs a little more refactoring to work properly.

And I've run them on my Android phone (Xiaomi MI 5). The result is as follows (output.txt for ACL; output_caffe.txt for Caffe):

You can get all the data and source code in my file attachments. And the two output files are also included in it:

demo.zip

Do we need to manually extend ITensor for floating point support?

I'm trying to integrate ComputeLibrary into my Caffe model. The first thing I've come up is to replace Caffe's gemm function with ComputeLibrary's gemm function.

I've found out that there's a NEGEMM class for floating point matrix multiplication. However, the input parameter refers to an ITensor class which seems to require the underlying data type to be uint8_t.

Do we need to implement another Tensor class for floating point support? Or is there something wrong with my understanding that I can bypass this problem by applying other approaches?

ps: Does the "I" in "ITensor" indicate it is a Tensor for integer matrix?

Some wrong in building the library

CXX=clang++ CC=clang scons Werror=1 -j8 debug=0 asserts=1 neon=0 opencl=1 embed_kernels=1 os=android arch=armv7a

and then:

test_helpers/Utils.cpp:89:56: error: use of undeclared identifier 'errno'
<< "ERROR " << err.what() << " " << (errno ? strerror(errno) : "") << std::endl;
^
test_helpers/Utils.cpp:89:73: error: use of undeclared identifier 'errno'
<< "ERROR " << err.what() << " " << (errno ? strerror(errno) : "") << std::endl;
^
2 errors generated.
scons: *** [build/test_helpers/Utils.o] Error 1
scons: building terminated because of errors.

please help~
Thanks a lot