Git Product home page Git Product logo

idlf's Introduction

The Intel® Deep Learning Framework

This project is end of life and not supported by former maintainers.

idlf's People

Contributors

pmajchrzak avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

idlf's Issues

IDLF as a library

Since IDLF may have done much optimization for convolution/pooling/etc operations, can we use IDLF as a library?
If yes, how can we call the convolution::foward and convolution::backward functions, i.e., the API?
Thanks.

Installation instructions and documentation

I tried to install the software and I got the following errors in the Release stage (and in other stages as well):

FreeImage_LIBRARIES (ADVANCED)
    linked by target "tester" in directory /home/user_name/idlf/idlf/tester
    linked by target "demo_device" in directory /home/user_name/idlf/idlf/demo/device
    linked by target "demo_primitives" in directory /home/user_name/idlf/idlf/demo/primitives
OPENCL_INCLUDE_DIRS
   used as include directory in directory /home/user_name/idlf/idlf/device/gpu
OPENCL_LIBRARIES
    linked by target "device_gpu" in directory /home/user_name/idlf/idlf/device/gpu

I am trying to install on an Ubuntu machine with a single Intel processor. Also, I am trying to install on a Linux server (with AMD processors) but I get the same errors. I have downloaded the FreeImage library but I do not know what to do after that.

I will provide a documentation after I install the idlf but I have not done that yet. Could you please tell me if this software depends on a specific hardware infrastructure and what software dependencies does it have? I searched for documentation on other website but did not manage to find anything.

I also get the following later on:

cxx Flags: 
-- ULTS_WORKING_DIR ...................... /home/user_name/idlf/idlf/bin/tester/g_ult/gcc/DebugULT
-- ULTS_RUN_DIR ...................... /home/user_name/idlf/idlf/bin/tester/g_ult/gcc/DebugULT
-- Looking for include file pthread.h
-- Looking for include file pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  

Questions about details in the code

Hi~ I am studying the code now , I have some questions about the code. I hope you may give me some answers. But if it's inconvenient for you to answer my question, just ignore it and close this issue.
I am currently study the convolution kernel. For a kernel who is responsible to compute a 4_3 output region from a 19_24 input feature map. It seems this kernel computes 16 different feature maps for this input region using 16 different kernels. My question is that why do you divide the image onto 16 SIMD lanes(30 floats per lane)? It doesn't makes any sense to me.
The related code is:

__attribute__((reqd_work_group_size(1, 1, SIMD_SIZE)))
kernel void
convolve_AlexNet_C1 (
    __global float* outputs,
    __global float* inputs,
    filter_qualifier float* weights,
    __global float* biases,
    unsigned int batch_output_offset)  //__global float *inputs, __global float* weights, __global float* outputs)
{
    uint oc  = get_global_id(0) * OUT_BLOCK_WIDTH;  // oc = Output Column
    uint or  = get_global_id(1) * OUT_BLOCK_HEIGHT; // or = Output Row
    uint fm  = get_global_id(2);                    // fm = Feature Map = od = Output Depth
    uint fmg = get_group_id(2);
    uint lid = get_local_id(2);
    //float w;
    // 19 x 24 = 456; 456 / 16 = 29 floats per lane for SIMD16, but padding out to 30 to simplify the load loop.
    float in[30];   // this holds a 19x24 block of the input data, enough to compute 4x3 outputs, simd_shuffle is used so that all work-items have access to all 19x24 locations.
    float out[12]; // 4x3 block of outputs that is SIMD_SIZE deep (along the Feature Map dimension).
    for(int i=0;i<12;i++)
    { 
        // we need this address calculation for biases because we support views and batching
        out[i] = biases[(fm - get_global_offset(2)) % _OD];
    }

    uint in_addr;
    uint weight_addr = (fmg % (_OD/SIMD_SIZE)) * _ID * KERNEL * KERNEL * SIMD_SIZE + lid;

    uint input_batch_offset = (fm / _OD) * (_IH + IHPAD) * (_IW + IWPAD) * _ID;

    for(int kd = 0; kd < _ID; kd++)  // _ID = 3, RGB
    {
        in_addr = input_batch_offset + (kd + INPUT_START_Z) * (_IH + IHPAD) * (_IW + IWPAD) + (or*K_STRIDE + INPUT_START_Y) * (_IW + IWPAD) + (oc*K_STRIDE + INPUT_START_X)  + lid;

        // read 24x19 block into registers.
        // This is ugly, we really need to fix the programming model.
        for(uint reg = 0; reg < 30; reg+=3) {
            in[reg] = inputs[in_addr];// read 16 elements
            // might be better to adjust the addrs, then do single load.
            if(lid < 8) in[reg + 1] = inputs[in_addr + 16];// read 8 elements in lower portion, for total of 24 from input row.
            in_addr += (_IW + IWPAD);  // move to next row down
            if(lid >= 8) in[reg + 1] = inputs[in_addr - 8];  // read 8 elements into upper portion
            in[reg + 2] = inputs[in_addr + 8]; // read 16 elements
            in_addr += (_IW + IWPAD);  // move to next row down
        }

        float w[5];
        int w_idx=0;
        w[0] = weights[weight_addr]; weight_addr += SIMD_SIZE; // weights must be stored in just the right SIMD swizzled format for this to work, see host code for details.
        w[1] = weights[weight_addr]; weight_addr += SIMD_SIZE;
        w[2] = weights[weight_addr]; weight_addr += SIMD_SIZE;
        w[3] = weights[weight_addr]; weight_addr += SIMD_SIZE;
        w[4] = weights[weight_addr]; weight_addr += SIMD_SIZE;

        int kr = 0; // kr = Kernel Row
        LOOP(10, kr,  // LOOP is a macro that unrolls the loop.
        {
            int kc = 0; // kc = Kernel Column
            LOOP(KERNEL, kc,
            {
                for(int br=0; br<OUT_BLOCK_HEIGHT; br++) {
                    for(int bc=0; bc<OUT_BLOCK_WIDTH; bc++) {
                        //if we fix the programming model, then we could use a nice simple 2d array: val = in[br * K_STRIDE + kr][bc * K_STRIDE + kc];
                        float val = intel_sub_group_shuffle( in[(((br*K_STRIDE+kr)*24)+(bc * K_STRIDE + kc)) / SIMD_SIZE], (((br*K_STRIDE+kr)*24)+(bc * K_STRIDE + kc)) & (SIMD_SIZE - 1));
                        out[br * OUT_BLOCK_WIDTH + bc] = mad(w[w_idx%5], val, out[br * OUT_BLOCK_WIDTH + bc]);
                    }
                }

                w[w_idx%5] = weights[weight_addr]; weight_addr += SIMD_SIZE;
                w_idx++;

            });
        });

        // last kr loop split in two parts
        int kc = 0; // kc = Kernel Column
        LOOP(6, kc,
        {
            for(int br=0; br<OUT_BLOCK_HEIGHT; br++) {
                for(int bc=0; bc<OUT_BLOCK_WIDTH; bc++) {
                    //if we fix the programming model, then we could use a nice simple 2d array: val = in[br * K_STRIDE + kr][bc * K_STRIDE + kc];
                    float val = intel_sub_group_shuffle( in[(((br*K_STRIDE+kr)*24)+(bc * K_STRIDE + kc)) / SIMD_SIZE], (((br*K_STRIDE+kr)*24)+(bc * K_STRIDE + kc)) & (SIMD_SIZE - 1));
                    out[br * OUT_BLOCK_WIDTH + bc] = mad(w[w_idx%5], val, out[br * OUT_BLOCK_WIDTH + bc]);
                }
            }

            w[w_idx%5] = weights[weight_addr]; weight_addr += SIMD_SIZE;
            w_idx++;
         });

        // last 5 kc loops don't prefetch weights
        LOOP(5, kc,
        {
            for(int br=0; br<OUT_BLOCK_HEIGHT; br++) {
                for(int bc=0; bc<OUT_BLOCK_WIDTH; bc++) {
                    //if we fix the programming model, then we could use a nice simple 2d array: val = in[br * K_STRIDE + kr][bc * K_STRIDE + kc];
                    float val = intel_sub_group_shuffle( in[(((br*K_STRIDE+kr)*24)+(bc * K_STRIDE + kc)) / SIMD_SIZE], (((br*K_STRIDE+kr)*24)+(bc * K_STRIDE + kc)) & (SIMD_SIZE - 1));
                    out[br * OUT_BLOCK_WIDTH + bc] = mad(w[w_idx%5], val, out[br * OUT_BLOCK_WIDTH + bc]);
                }
            }
            w_idx++;
         });

    }

    // write the 4x3 (and 16 feature maps deep) output tile to memory
    uint out_addr = OUT_BUFF_OFFSET + fm * (_OW + OWPAD) * (_OH + OHPAD); // out_addr indexes into start of 16 feature maps.
    out_addr += or * (_OW + OWPAD) + oc;  // offset for the 4x3 block that this workitem is working on;
#ifndef WRITE_PADDED_VALUES
    if(get_global_id(0) != (get_global_size(0)-1) &&
        get_global_id(1) != (get_global_size(1)-1)  )
    {
#endif
        for(uint r = 0; r < OUT_BLOCK_HEIGHT; r++) {
            for(uint c = 0; c < OUT_BLOCK_WIDTH; c++) {
            // this does a scattered write to 16 different feature maps, so that data within one map is contiguous, thus ready for input to next layer.
              outputs[out_addr + r * (_OW + OWPAD) + c] = activation_function(out[r * OUT_BLOCK_WIDTH + c]);
            }
        }
#ifndef WRITE_PADDED_VALUES
    }else if ( get_global_id(1) != (get_global_size(1)-1) )
    {
        for(uint r = 0; r < OUT_BLOCK_HEIGHT; r++) {
            for(uint c = 0; c < LAST_BLOCK_WIDTH; c++) {
                  outputs[out_addr + r * (_OW + OWPAD) + c] = activation_function(out[r * OUT_BLOCK_WIDTH + c]);
            }
        }
    }
    else if ( get_global_id(0) != (get_global_size(0)-1) )
    {
        for(uint r = 0; r < LAST_BLOCK_HEIGHT; r++) {
            for(uint c = 0; c < OUT_BLOCK_WIDTH; c++) {
                  outputs[out_addr + r * (_OW + OWPAD) + c] = activation_function(out[r * OUT_BLOCK_WIDTH + c]);
            }
        }
    }
    else
    {
     for(uint r = 0; r < LAST_BLOCK_HEIGHT; r++) {
        for(uint c = 0; c < LAST_BLOCK_WIDTH; c++) {
                outputs[out_addr + r * (_OW + OWPAD) + c] = activation_function(out[r * OUT_BLOCK_WIDTH + c]);
            }
        }
    }
#endif //#ifndef WRITE_PADDED_VALUES
}

#endif  //#ifdef INCLUDE_convolve_AlexNet_C1
)";

Also, why there are two sub convolution layers in Alexnet conv layer 2,4,5, does this implementation helps to improve the efficiency and why?

error:regex_error

when I try to run the demo using command below:
./visual_cloud_demo --config=gpu_caffenet.config
I got this error:
error: regex_error
My gpu_caffenet.config is as follows:
--model=caffenet_float
--device=device_gpu
--batch=32
--input=/home/images/
What does 'regex_error' mean?
Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.