Git Product home page Git Product logo

ruy's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ruy's Issues

Question about relationship between row/cols and order

I am trying to figure out the whole flow of ruy.
I use example at below:

  const float lhs_data[] = {1, 2, 3, 4, 5 ,6, 1, 2, 3, 4, 5, 6};
  const float rhs_data[] = {1, 2, 3, 4, 5, 6};
  float dst_data[8];
  ruy::Matrix<float> lhs;
  ruy::MakeSimpleLayout(4, 3, ruy::Order::kRowMajor, lhs.mutable_layout());
  lhs.set_data(lhs_data);
  ruy::Matrix<float> rhs;
  ruy::MakeSimpleLayout(3, 2, ruy::Order::kColMajor, rhs.mutable_layout());
  rhs.set_data(rhs_data);
  ruy::Matrix<float> dst;
  ruy::MakeSimpleLayout(4, 2, ruy::Order::kColMajor, dst.mutable_layout());
  dst.set_data(dst_data);

I have questions between rows/cols & order.
On my case
<style type="text/css"></style>

I run on arm_v8, so the FixedKernel is 1, 8, row-major

  packed_matrix = {elem_ = {{data_type = {is_signed = true, is_floating_point = true, size = 4 '\004'},
        data = 0x0, sums_type = {is_signed = true, is_floating_point = true, size = 4 '\004'}, sums = 0x0,
        layout = {rows = 3, cols = 8, stride = 3, order = ruy::Order::kColMajor, kernel = {
            order = ruy::Order::kRowMajor, rows = 1 '\001', cols = 8 '\b'}}, zero_point = 0}, {data_type = {
          is_signed = true, is_floating_point = true, size = 4 '\004'}, data = 0x0, sums_type = {is_signed = true,
          is_floating_point = true, size = 4 '\004'}, sums = 0x0, layout = {rows = 3, cols = 8, stride = 3,
          order = ruy::Order::kColMajor, kernel = {order = ruy::Order::kRowMajor, rows = 1 '\001',
            cols = 8 '\b'}}, zero_point = 0}}}, is_prepacked = {elem_ = {false, false}},
  mul_params_bytes = "\300\206H", '\000' <repeats 11 times>
SRC               Dst       Packed_matrix              
LHS       RHS       Dst       LHS       RHS      
rows cols stride order rows cols stride order rows cols stride order rows cols stride order rows cols stride order
3 4 3 C 3 2 3 C 4 2 4 C 3 8 3 C 3 8 3 C
  1. Why LHS only transpose for meta-data?
  2. After RunPack is completed, I feel confused about order & rows/cols.
    I get packed_src[LHS] = {1,4,1,4,0,0,0,0,2,5,2,5,0,0,0,0,3,6,3,6,0,0,0,0} and the order is col-major.
    If it is actual col-major
    1 4 0 5 0 0 3 0
    4 0 0 2 0 3 6 0
    1 0 2 5 0 6 0 0.
    However, I think it might be (row-major):
    1 4 1 4 0 0 0 0
    2 5 2 5 0 0 0 0
    3 6 3 6 0 0 0 0
    Why does it show col-major in packed_matrix[LHS] order?

How do I install this library on an arm_32 machine?

I've gone through the build process: bazel build :all and looks like it completed successfully. Now, though, how do I install the output of the build? I was expecting something like a .so file, but I don't see one so I'm not sure how to install? I don't see a bazel "install" option...

block_map compilation error (VS2019)

I'm seeing a strange compiler error when building the face_mesh_cpu example from mediapipe using VS2019.

ERROR: C:/users/will/_bazel_will/mvh33bjd/external/ruy/ruy/BUILD:295:11: Compiling ruy/block_map.cc failed: (Exit 2): cl.exe failed: error executing command C:/Program Files (x86)/Microsoft Visual Studio/2019/Community/VC/Tools/MSVC/14.28.29910/bin/HostX64/x64/cl.exe /nologo /DCOMPILER_MSVC /DNOMINMAX /D_WIN32_WINNT=0x0601 /D_CRT_SECURE_NO_DEPRECATE ... (remaining 28 argument(s) skipped)
cl : Command line warning D9002 : ignoring unknown option '-O3'
external/ruy/ruy/block_map.cc(334): error C2059: syntax error: ')'
external/ruy/ruy/block_map.cc(334): error C2676: binary '==': 'const ruy::CpuCacheParams' does not define this operator or a conversion to a type acceptable to the predefined operator

At block_map.cc:334, if I remove the newline before cpu_cache_params and put the return statement all on one line, the code compiles fine. Why on earth would whitespace cause this compilation error to happen?

For reference, the command used to build the mediapipe example is bazel build -c opt --define MEDIAPIPE_DISABLE_GPU=1 --action_env PYTHON_BIN_PATH="[path to python3.exe]" //mediapipe/examples/desktop/face_mesh:face_mesh_cpu

How to statically link with ruy?

I've built the TensorFlow Lite C API as static lib.
I need to link with sub-dependend libs too, one of which is Ruy.
However, there are ~30 .a libs for Ruy.
Do I need to link with all of those? What to do?

How to check the best pack data's memory layout?

Take matmul for example, will ruy bind each kernel with specific data format?
The knowledge I know is that, the LHS has transpose to COL Major. So both LHS, RHS and Dest are COL-Major.
The LHS and RHS should be packed into a specific memory layout to accelerate computation.
So what's the exactly memory layout for a specific kernel?

Why is the result zero?

I have two matrices: A=[[1,2], [3,4]] B=[[1,3],[2,4]], but the result AB is zero matrix [[0,0], [0,0]].
I don't know how ruy::Mul works. Is there any avalibe information? Thanks!!

void ExampleMulInt8PerChannelQuantized(ruy::Context *context) {
  const std::int8_t lhs_data[] = {1, 2, 3, 4};
  const std::int8_t rhs_data[] = {1, 2, 3, 4};
  std::int8_t dst_data[4];

  ruy::Matrix<std::int8_t> lhs;
  ruy::MakeSimpleLayout(2, 2, ruy::Order::kRowMajor, lhs.mutable_layout());
  lhs.set_data(lhs_data);
  ruy::Matrix<std::int8_t> rhs;
  ruy::MakeSimpleLayout(2, 2, ruy::Order::kColMajor, rhs.mutable_layout());
  rhs.set_data(rhs_data);
  ruy::Matrix<std::int8_t> dst;
  ruy::MakeSimpleLayout(2, 2, ruy::Order::kColMajor, dst.mutable_layout());
  dst.set_data(dst_data);

  ruy::MulParams<std::int32_t, std::int8_t> mul_params;
  ruy::Mul(lhs, rhs, mul_params, context, &dst);

  std::cout << "Example Mul, int8 quantized with per-channel multipliers\n";
  std::cout << "LHS:\n" << lhs;
  std::cout << "RHS:\n" << rhs;
  std::cout << "Result:\n" << dst << "\n";
}

Error Compiling, Impossible Assembly Constraints

Using pycoral's build.sh

Compile command:
/usr/bin/arm-linux-gnueabihf-gcc -fPIC -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer '-march=armv7-a' '-mfpu=neon-vfpv4' -g0 -O3 -DNDEBUG '-D_FORTIFY_SOURCE=2' -ffunction-sections -fdata-sections -funsafe-math-optimizations -ftree-vectorize '-std=c++14' -MD -MF bazel-out/armv7a-opt/bin/external/ruy/ruy/_objs/pack_arm/pack_arm.d '-frandom-seed=bazel-out/armv7a-opt/bin/external/ruy/ruy/_objs/pack_arm/pack_arm.o' -iquote external/ruy -iquote bazel-out/armv7a-opt/bin/external/ruy -iquote external/cpuinfo -iquote bazel-out/armv7a-opt/bin/external/cpuinfo -iquote external/clog -iquote bazel-out/armv7a-opt/bin/external/clog -Ibazel-out/armv7a-opt/bin/external/cpuinfo/_virtual_includes/cpuinfo -Ibazel-out/armv7a-opt/bin/external/clog/_virtual_includes/clog '-DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION' '-ffp-contract=off' -Wall -Wextra -Wc++14-compat -Wundef '-mfpu=neon' -O3 -no-canonical-prefixes -fno-canonical-system-headers -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' -c external/ruy/ruy/pack_arm.cc -o bazel-out/armv7a-opt/bin/external/ruy/ruy/_objs/pack_arm/pack_arm.o)
Error:
external/ruy/ruy/pack_arm.cc:469:72: error: 'asm' operand has impossible constraints
(line) 469 | "q4", "q5", "q6", "q7", "q8", "q9", "q10", "q11", "q12", "q13");

AVX support on Apple MacOS

Hi,

I am using ruy as a dependency of tensorflow. And their new quantized conv2d implementation relies on ruy::TrMul. The issue is that ruy does not activate X86_ENHANCEMENTS on MacOS by default. And when I tried forcing it with -DRUY_FORCE_ENABLE_X86_ENHANCEMENTS, it runs faster, but the output is wrong. A similar result can be observed if I run with RUY_PATHS=0x20 environment variable (suggested by @talumbau in a tensorflow issue).

My computer is MacBook Pro 2018 having Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz (ark page), hence it supports AVX2.

platform.h has a comment about disabling it on Apple, but I cannot access the mentioned comment under b/138922878. My question is that: Is it possible to build tensorflow with AVX instructions enabled for the ruy backend on Apple?

CMake install missing

Please add support for cmake install.
This will allow easier packaging of this library with Conan.

about thread pool

would you consider to add support for reenterable task?

currently, the tasks are not reenterable, in ThreadPool::ExecuteImpl,each can run in one thread;
_void ThreadPool::ExecuteImpl(int task_count, int stride, Task* tasks) ;_
if use reenterable task with atomic variable, maybe higher performance, the overhead is light, because there is only one task. _the code:
`void ThreadPool::ExecuteReenterableImpl(int thread_count, Task* reenterableTask) {
_RUY_DCHECK_GE(thread_count, 1);

// Case of 1 thread: just run the single task on the current thread.
if (thread_count == 1) {
(reenterableTask)->Run();
return;
}

// Task #0 will be run on the current thread.
CreateThreads(thread_count- 1);
counter_to_decrement_when_ready_.Reset(thread_count- 1);
for (int i = 1; i < thread_count; i++) {
auto task_address = reinterpret_caststd::uintptr_t( reenterableTask) ;//+ i * stride;
threads_[i - 1]->StartWork(reinterpret_cast<Task*>(task_address));
}

// Execute task #0 immediately on the current thread.
(reenterableTask )->Run();

// Wait for the threads submitted above to finish.
counter_to_decrement_when_ready_.Wait(spin_duration_);
}___`

Are there any documentations?

Hi. Is there any documentation for ruy somewhere?

I'm having trouble understanding how the ruy works. I'm trying to compare the performance of different GEMM libraries (like ruy) on mobile devices using tflite, but I'm having trouble understanding ruy and how to replace it.

Can you point me to any documentation or any guide for ruy?

Compilation fails with GCC on ARM CortexA72 because of asm impossible constraints

Trying to build Chromium with NEON on Raspberry pi 4 with Yocto and GCC (using mcpu=cortex-a7, mfpu=neon-vfpv4, mthumb), compilation of ruy fails:

arm-poky-linux-gnueabi-g++  -mthumb -mfpu=neon-vfpv4 -mfloat-abi=hard -mcpu=cortex-a7 -fstack-protector-strong   -D_FORTIFY_SOURCE=2 -Wformat -Wformat-security -Werror=format-security -Wdate-time --sysroot=/home/dape/Development/rpi/poky-warrior/build/tmp/work/cortexa7t2h
f-neon-vfpv4-poky-linux-gnueabi/chromium-dev/97.0.4682.3-r0/recipe-sysroot -MMD -MF obj/third_party/ruy/ruy/pack_arm.o.d -DUSE_UDEV -DUSE_AURA=1 -DUSE_GLIB=1 -DUSE_NSS_CERTS=1 -DUSE_OZONE=1 -DUSE_X11=1 -DOFFICIAL_BUILD -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LARGEFI
LE64_SOURCE -DNO_UNWIND_TABLES -DNDEBUG -DNVALGRIND -DDYNAMIC_ANNOTATIONS_ENABLED=0 -I../chromium-97.0.4682.3 -Igen -I../chromium-97.0.4682.3/third_party/ruy/src -fno-ident -fno-strict-aliasing --param=ssp-buffer-size=4 -fstack-protector -fno-unwind-tables -fno-asynchrono
us-unwind-tables -fPIC -pipe -pthread -march=armv7ve -mfloat-abi=hard -mtune=generic-armv7-a -mfpu=neon -mthumb -O2 -fdata-sections -ffunction-sections -fno-omit-frame-pointer -g1 -fvisibility=hidden -Wno-inline-asm -Wno-psabi -Wno-unused-local-typedefs -Wno-maybe-uniniti
alized -Wno-deprecated-declarations -Wno-comments -Wno-packed-not-aligned -Wno-missing-field-initializers -Wno-unused-parameter -std=gnu++14 -fno-exceptions -fno-rtti -fvisibility-inlines-hidden -Wno-narrowing -Wno-class-memaccess   -feliminate-unused-debug-types -fmacro-
prefix-map=/home/dape/Development/rpi/poky-warrior/build/tmp/work/cortexa7t2hf-neon-vfpv4-poky-linux-gnueabi/chromium-dev/97.0.4682.3-r0=/usr/src/debug/chromium-dev/97.0.4682.3-r0                      -fdebug-prefix-map=/home/dape/Development/rpi/poky-warrior/build/tmp/wo
rk/cortexa7t2hf-neon-vfpv4-poky-linux-gnueabi/chromium-dev/97.0.4682.3-r0=/usr/src/debug/chromium-dev/97.0.4682.3-r0                      -fdebug-prefix-map=/home/dape/Development/rpi/poky-warrior/build/tmp/work/cortexa7t2hf-neon-vfpv4-poky-linux-gnueabi/chromium-dev/97.0
.4682.3-r0/recipe-sysroot=                      -fdebug-prefix-map=/home/dape/Development/rpi/poky-warrior/build/tmp/work/cortexa7t2hf-neon-vfpv4-poky-linux-gnueabi/chromium-dev/97.0.4682.3-r0/recipe-sysroot-native=  -fvisibility-inlines-hidden -c ../chromium-97.0.4682.3/
third_party/ruy/src/ruy/pack_arm.cc -o obj/third_party/ruy/ruy/pack_arm.o
../chromium-97.0.4682.3/third_party/ruy/src/ruy/pack_arm.cc: In function 'void ruy::Pack8bitColMajorForNeon4Cols(const ruy::PackParams8bit&)':
../chromium-97.0.4682.3/third_party/ruy/src/ruy/pack_arm.cc:264:3: error: 'asm' operand has impossible constraints
  264 |   asm volatile(
      |   ^~~
At global scope:

cmake_minimum_required is called after project()

According to the CMake documentation, cmake_minimum_required needs to be called before the first call to project() (see notes here: https://cmake.org/cmake/help/latest/command/cmake_minimum_required.html, and at the bottom of this page: https://cmake.org/cmake/help/latest/command/project.html)

This can cause problems if users wish to use CMake functionality like setting their own policy defaults, or if code inside CMake toolchain files (e.g. when cross-building) or using project code injection is used (https://cmake.org/cmake/help/latest/command/project.html#code-injection). Both are useful to set up C++ package managers to provide dependencies.

Documents about the design

Is there any documents about the design of Ruy? Maybe it can help us to understand the source code.
Thanks in advance.

Compilation error when Scalar is Eigen::half and not float

Since class Matrix is templatized

template <typename Scalar>
class Matrix final {
......
 private:
...........
  // The zero_point, i.e. which Scalar value is to be interpreted as zero.
  // When Scalar is floating-point, this must be 0.
  Scalar zero_point_ = 0;
};

I could have something like: Matrix<Eigen::half> myMatrix;

but then at compilation I get error: no viable conversion from 'int' to 'Eigen::half' Scalar zero_point_ = 0; since the zero isn't templatized

integer 0 and float zero are interchangeable so the above code works; but it's not the generic case for templates. I believe an implementation like Scalar zero_point_ = Scalar{0}; is more generic

Similar fixes to other classes and parts of the code?

Question about performance comparison

Hi.
I was looking for a performance comparison between ruy and OpenBLAS and I came across this.
But when I benchmark the ruy (almost for any shape with single thread execution and on raspberry pi 4), my results are far behind the reported results.
For example, for the 512x512x512 Int8 benchmark, I can only get ~10 GOPs but excel reported 40 GOPs.
I know Raspberry Pi 4 CPU frequency can be maxed out to 1.5 GHz while Pixel 4 max frequency is 2.84 GHz, but it does not justify the 30 GOPs gap.
So I thought it might be better to ask it here.
How did you measure GOPs for ruy?
I calculate the GOPs for the method with the ((2 * N * K * M * iterations) / time) / 10e+9 formula (time is the sum of the execution time of ruy::Mul for each iteration) (I pack the RHS matrix beforehand).
Am I doing anything wrong?

Failed to build benchmark with GCC9.3.1

When I am trying to build benchmark with the command:

 bazel --output_user_root=$build_dir build -c dbg --copt=-march=native //ruy:benchmark_f32_f32_f32_f32

It fails with the error:

./ruy/test.h:166: error: undefined reference to 'std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >::basic_ostringstream()'
./ruy/test.h:166: error: undefined reference to 'std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >::basic_ostringstream()'
./ruy/test.h:2210: error: undefined reference to 'std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >::basic_ostringstream()'
collect2: error: ld returned 1 exit status

The above commands works perfect with GCC8.3
I upgrade to GCC9.3 to use the AVX512 as defined in:

ruy/ruy/platform.h

Lines 117 to 121 in be065e4

#elif defined(__GNUC__) && (__GNUC__ >= 9)
// Enable on recent versions of GCC. Might be possible
// to relax this version requirement.
#define RUY_PLATFORM_X86_ENHANCEMENTS 1
#else

BTW: I am using bazel 3.1

performance data

Is there any performance data about ruy?
Compare with gemmlowp or other optimize lib. I see that tflite use gemmlowp and ruy both. Which is better in performance.

Build examples

I think there is a problem with building the example applications.
I've tried calling the bazel build //ruy/example:example from different paths inside the project but I've always received and error.
This is how it looks like when I am calling it from example directory:
image

Eigen Matrix-vector multiplication shows better performance

I'm testing the performance of Eigen vs ruy on an intel machine and on a raspberry pi, and in my benchmarking tests I consistently get Eigen to perform much faster than ruy. Is there something I'm doing wrong in these benchmarks.

I'm pasting my test code below :

class RuyMultiplier {
public:
RuyMultiplier(size_t stateSize, size_t outputSize, const std::vector<float>& weightData, int numThreads)
    :_weight(weightData) {
        context.set_max_num_threads(numThreads);
        ruy::MakeSimpleLayout(1, stateSize, ruy::Order::kColMajor, A.mutable_layout());
        ruy::MakeSimpleLayout(outputSize, stateSize, ruy::Order::kColMajor, B.mutable_layout());
        ruy::MakeSimpleLayout(1, outputSize, ruy::Order::kColMajor, C.mutable_layout());
        B.set_data(_weight.data());
        B.set_cache_policy(ruy::CachePolicy::kAlwaysCache);
    }

    void multiply(const float* state, std::vector<float>& output) {
        A.set_data(state);
        C.set_data(output.data());
        ruy::Mul(A, B, mul_params, &context, &C);
    }

private:
    std::vector<float> _weight;
    ruy::Matrix<float> A;
    ruy::Matrix<float> B;
    ruy::Matrix<float> C;
    ruy::MulParams<float, float> mul_params;
    ruy::Context context;
};

My function for benchmarking (using google benchmark) is:

static void RuyBenchmark(benchmark::State& state) {
    std::random_device random_device;
    auto rng = std::mt19937(random_device());
    auto f32rng = std::bind(std::uniform_real_distribution<float>(-1.0f, +1.0f), std::ref(rng));

    size_t inputSize = state.range(0);
    size_t outputSize = state.range(1);
    std::vector<float> weight(inputSize * outputSize);

    std::generate(weight.begin(), weight.end(), std::ref(f32rng));
    RuyMultiplier testMul(inputSize, outputSize, weight, 4);

    std::vector<float> input(inputSize, 0.0f);
    std::vector<float> output(outputSize);

    float sampleValue(.0f);
    for (auto _ : state) {
        std::generate(input.begin(), input.end(), std::ref(f32rng));
        testMul.multiply(input.data(), output);
        sampleValue = output[0]; 
    }
}

The Eigen setup is similar but i get at much faster results. In the sample run here, my input vector has size 80, and the output vector has size 128 (so a 1x80x128 multiplication .. or a 128 x 80 x 1 multiplication, depending on whether i treat is as row major or column major)

-----------------------------------------------------------------------------------------------
Benchmark                                                     Time             CPU   Iterations
-----------------------------------------------------------------------------------------------
RuyBenchmark/I:80/H:128/process_time/real_time         62.8 us         62.9 us        10469
EigenBenchmark/I:80/H:128/process_time/real_time       1.66 us         1.66 us       416765

RUY won't compile

This repo currently contains compile-time errors.


  1. The following code results in a compile-time error

child_to_add_to = node->children.emplace_back(new TreeView::Node).get();

According to https://en.cppreference.com/w/cpp/container/vector/emplace_back
Screen Shot 2020-04-08 at 17 27 51

emplace_back returns void until c++17. So you cannot call get on a void type.

There is no instruction in the repo saying the project requires c++17 support.


  1. The header string.h containing memcpy is not included which results in again a compile-time error for the following code

memcpy(dst, &stack.id, sizeof(stack.id));


  1. The namespace for the profiler is no longer called profiling; the readme.md
    is not updated and still uses profiling.

Screen Shot 2020-04-08 at 17 32 23

namespace profiler {

Broken computation when running under nodejs on armv7

Fuller investigation has been documented on tensorflow/tensorflow#39509.

It looks like q7 was removed from the clearage in tensorflow/tensorflow@2359c4e#diff-ca44636122d5fd4fe9600903ebf461b9L665.

I honestly don't know why it would expose this behavior only under NodeJS and only on ARMv7 platform, but re-instating q7 as in tensorflow/tensorflow#39951 fixes the issue.

Since q7 is cleared at other places q6-q15 are cleared, and since there's no specific comment regarding the removal of q7 at this place, is it possible it's just a slight typo and I have been lucky in finding it?

Performance benchmarks

Are there any reliable benchmark results comparing ruy with other GEMM libraries such as gemmlowp and Eigen? I am really interested in this (in the context of Tensorflow Lite performance), but the only tiny piece of information I found so far is a blog post in Tensorflow blog mentioning that TF Lite with ruy enabled outperforming regular TF Lite (Better CPU performance section) when inferring on a single CPU core.

Some discussion about memalign on SystemAlignAlloc?

Following the TF issue, we found the code here, calling the memalign as SystemAlignedAlloc without any assert, is a little bit dangerous and really unfriendly for developers to debug, if some phones have no warnings as W/libc: memalign(64, 411042816) failed: returning null pointer by their Android, we will be crazy. BTW, memory alignment is factually a good way for efficiency, but we may need to consider about some other ways for a big memory alignment problem. Looking for your reply, thx.

Compile failure on armv7a

Hi, I'm trying to build PyCoral with Tensorflow 2.7.0 for arm7a and it fails as follows:

$ git clone https://github.com/oberluz/pycoral.git
$ git checkout 2_7_0
$ git submodule update --init --recursive 
$ DOCKER_CPUS="armv7a" ./scripts/build.sh --python_versions "310"
...
(cd /home/eyeot-demo/.cache/bazel/_bazel_eyeot-demo/eab0d61a99b6696edb3d2aff87b585e8/execroot/pycoral && \
  exec env - \
    PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin \
    PWD=/proc/self/cwd \
  /usr/bin/arm-linux-gnueabihf-gcc -fPIC -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer '-march=armv7-a' '-mfpu=neon-vfpv4' -g0 -O3 -DNDEBUG '-D_FORTIFY_SOURCE=2' -ffunction-sections -fdata-sections -funsafe-math-optimizations -ftree-vectorize '-std=c++17' -MD -MF bazel-out/armv7a-opt/bin/external/org_tensorflow/tensorflow/lite/_objs/minimal_logging/minimal_logging_default.d '-frandom-seed=bazel-out/armv7a-opt/bin/external/org_tensorflow/tensorflow/lite/_objs/minimal_logging/minimal_logging_default.o' -iquote external/org_tensorflow -iquote bazel-out/armv7a-opt/bin/external/org_tensorflow '-DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION' '-ffp-contract=off' -Wall -DFARMHASH_NO_CXX_STRING -Wno-sign-compare -O3 -fno-exceptions -no-canonical-prefixes -fno-canonical-system-headers -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' -c external/org_tensorflow/tensorflow/lite/minimal_logging_default.cc -o bazel-out/armv7a-opt/bin/external/org_tensorflow/tensorflow/lite/_objs/minimal_logging/minimal_logging_default.o)
INFO: From Compiling tensorflow/lite/minimal_logging_default.cc:
external/org_tensorflow/tensorflow/lite/minimal_logging_default.cc:28: warning: ignoring '#pragma clang diagnostic' [-Wunknown-pragmas]
   28 | #pragma clang diagnostic push
      | 
external/org_tensorflow/tensorflow/lite/minimal_logging_default.cc:29: warning: ignoring '#pragma clang diagnostic' [-Wunknown-pragmas]
   29 | #pragma clang diagnostic ignored "-Wformat-nonliteral"
      | 
external/org_tensorflow/tensorflow/lite/minimal_logging_default.cc:31: warning: ignoring '#pragma clang diagnostic' [-Wunknown-pragmas]
   31 | #pragma clang diagnostic pop
      | 
SUBCOMMAND: # @org_tensorflow//tensorflow/lite:minimal_logging [action 'Linking external/org_tensorflow/tensorflow/lite/libminimal_logging.a', configuration: 11cae9684ea823b3911201bf8ead584031fd2c07b224432188557a50a44d29ae, execution platform: @local_execution_config_platform//:platform]
(cd /home/eyeot-demo/.cache/bazel/_bazel_eyeot-demo/eab0d61a99b6696edb3d2aff87b585e8/execroot/pycoral && \
  exec env - \
    PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin \
    PWD=/proc/self/cwd \
  /usr/bin/arm-linux-gnueabihf-ar @bazel-out/armv7a-opt/bin/external/org_tensorflow/tensorflow/lite/libminimal_logging.a-2.params)
SUBCOMMAND: # @com_google_absl//absl/time/internal/cctz:civil_time [action 'Compiling absl/time/internal/cctz/src/civil_time_detail.cc', configuration: 11cae9684ea823b3911201bf8ead584031fd2c07b224432188557a50a44d29ae, execution platform: @local_execution_config_platform//:platform]
(cd /home/eyeot-demo/.cache/bazel/_bazel_eyeot-demo/eab0d61a99b6696edb3d2aff87b585e8/execroot/pycoral && \
  exec env - \
    PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin \
    PWD=/proc/self/cwd \
  /usr/bin/arm-linux-gnueabihf-gcc -fPIC -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer '-march=armv7-a' '-mfpu=neon-vfpv4' -g0 -O3 -DNDEBUG '-D_FORTIFY_SOURCE=2' -ffunction-sections -fdata-sections -funsafe-math-optimizations -ftree-vectorize '-std=c++17' -MD -MF bazel-out/armv7a-opt/bin/external/com_google_absl/absl/time/internal/cctz/_objs/civil_time/civil_time_detail.d '-frandom-seed=bazel-out/armv7a-opt/bin/external/com_google_absl/absl/time/internal/cctz/_objs/civil_time/civil_time_detail.o' -iquote external/com_google_absl -iquote bazel-out/armv7a-opt/bin/external/com_google_absl '-DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION' '-ffp-contract=off' -no-canonical-prefixes -fno-canonical-system-headers -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' -c external/com_google_absl/absl/time/internal/cctz/src/civil_time_detail.cc -o bazel-out/armv7a-opt/bin/external/com_google_absl/absl/time/internal/cctz/_objs/civil_time/civil_time_detail.o)
ERROR: /home/eyeot-demo/.cache/bazel/_bazel_eyeot-demo/eab0d61a99b6696edb3d2aff87b585e8/external/ruy/ruy/BUILD:585:11: Compiling ruy/pack_arm.cc failed: (Exit 1): arm-linux-gnueabihf-gcc failed: error executing command 
  (cd /home/eyeot-demo/.cache/bazel/_bazel_eyeot-demo/eab0d61a99b6696edb3d2aff87b585e8/sandbox/processwrapper-sandbox/375/execroot/pycoral && \
  exec env - \
    PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin \
    PWD=/proc/self/cwd \
  /usr/bin/arm-linux-gnueabihf-gcc -fPIC -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer '-march=armv7-a' '-mfpu=neon-vfpv4' -g0 -O3 -DNDEBUG '-D_FORTIFY_SOURCE=2' -ffunction-sections -fdata-sections -funsafe-math-optimizations -ftree-vectorize '-std=c++17' -MD -MF bazel-out/armv7a-opt/bin/external/ruy/ruy/_objs/pack_arm/pack_arm.d '-frandom-seed=bazel-out/armv7a-opt/bin/external/ruy/ruy/_objs/pack_arm/pack_arm.o' -iquote external/ruy -iquote bazel-out/armv7a-opt/bin/external/ruy -iquote external/cpuinfo -iquote bazel-out/armv7a-opt/bin/external/cpuinfo -iquote external/clog -iquote bazel-out/armv7a-opt/bin/external/clog -Ibazel-out/armv7a-opt/bin/external/cpuinfo/_virtual_includes/cpuinfo -Ibazel-out/armv7a-opt/bin/external/clog/_virtual_includes/clog '-DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION' '-ffp-contract=off' -Wall -Wextra -Wc++14-compat -Wundef '-mfpu=neon' -O3 -no-canonical-prefixes -fno-canonical-system-headers -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' -c external/ruy/ruy/pack_arm.cc -o bazel-out/armv7a-opt/bin/external/ruy/ruy/_objs/pack_arm/pack_arm.o)
Execution platform: @local_execution_config_platform//:platform

Use --sandbox_debug to see verbose messages from the sandbox arm-linux-gnueabihf-gcc failed: error executing command 
  (cd /home/eyeot-demo/.cache/bazel/_bazel_eyeot-demo/eab0d61a99b6696edb3d2aff87b585e8/sandbox/processwrapper-sandbox/375/execroot/pycoral && \
  exec env - \
    PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin \
    PWD=/proc/self/cwd \
  /usr/bin/arm-linux-gnueabihf-gcc -fPIC -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer '-march=armv7-a' '-mfpu=neon-vfpv4' -g0 -O3 -DNDEBUG '-D_FORTIFY_SOURCE=2' -ffunction-sections -fdata-sections -funsafe-math-optimizations -ftree-vectorize '-std=c++17' -MD -MF bazel-out/armv7a-opt/bin/external/ruy/ruy/_objs/pack_arm/pack_arm.d '-frandom-seed=bazel-out/armv7a-opt/bin/external/ruy/ruy/_objs/pack_arm/pack_arm.o' -iquote external/ruy -iquote bazel-out/armv7a-opt/bin/external/ruy -iquote external/cpuinfo -iquote bazel-out/armv7a-opt/bin/external/cpuinfo -iquote external/clog -iquote bazel-out/armv7a-opt/bin/external/clog -Ibazel-out/armv7a-opt/bin/external/cpuinfo/_virtual_includes/cpuinfo -Ibazel-out/armv7a-opt/bin/external/clog/_virtual_includes/clog '-DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION' '-ffp-contract=off' -Wall -Wextra -Wc++14-compat -Wundef '-mfpu=neon' -O3 -no-canonical-prefixes -fno-canonical-system-headers -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' -c external/ruy/ruy/pack_arm.cc -o bazel-out/armv7a-opt/bin/external/ruy/ruy/_objs/pack_arm/pack_arm.o)
Execution platform: @local_execution_config_platform//:platform

Use --sandbox_debug to see verbose messages from the sandbox
In file included from external/ruy/ruy/pack_arm.cc:16:
external/ruy/ruy/pack_arm.h:492:9: warning: multi-line comment [-Wcomment]
  492 | #endif  // (RUY_PLATFORM_NEON_64 || RUY_PLATFORM_NEON_32) && \
      |         ^
external/ruy/ruy/pack_arm.cc: In function 'void ruy::Pack8bitColMajorForNeon4Cols(const ruy::PackParams8bit&)':
external/ruy/ruy/pack_arm.cc:264:3: error: 'asm' operand has impossible constraints
  264 |   asm volatile(
      |   ^~~
Target //src:_pywrap_coral failed to build
INFO: Elapsed time: 234.588s, Critical Path: 88.61s
INFO: 576 processes: 204 internal, 372 processwrapper-sandbox.
FAILED: Build did NOT complete successfully
make: *** [Makefile:152: pybind] Error 1
make: Leaving directory '/workspace'

Building for other versions of python (36 37 38 39, see scripts/build.sh) words. But it fails for 310 which uses ubuntu 22.04

Any ideas?

Some question about PMU

Hi ! Recently, I focus on Performance Profiling of Android and learned that PMU could record some useful information about cache, instruction, memory and so on.
I have a question that should I compile linux kernel code with CONFIG_HW_PERF_EVENTS=ON / CONFIG_ARM_SPE_PMU=ON, if I want to get PMU work?

Compile without warnings on reasonably recent Clang and GCC

Because the toolchains inside Google are set up to ignore many warnings, Google-owned projects tend to compile with many warnings for opensource users. Ruy is currently no exception. We should fix that, at least for some recent enough Clang and GCC versions, either by changing code or by adding warning-disabling flags to ruy_copts.

QuantizeMultiplier public API

I would like to determine which multiplier fixedpoint and exponent to use for multiplication with a particular scale, and the QuantizeMultiplier function seems to be exactly what I need. But I noticed it is located in test.h and not somewhere better for public exposure.
What is the proper way for me to determine multiplier fixedpoints and exponents? If it is to call QuantizeMultiplier, should QuantizeMultiplier be moved out of test.h or be given some public API?

+= output operation?

Maybe I'm missing something, but is there support for C += AB as opposed to C = AB? Trying to make a full sgemm replacement, which would mainly be useful for fine tuning models on device.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.