intel / idlf Goto Github PK

View Code? Open in Web Editor NEW

312.0 56.0 89.0 3.28 MB

Intel® Deep Learning Framework

License: BSD 3-Clause "New" or "Revised" License

Shell 0.09% C++ 96.87% CMake 0.75% C 1.55% Batchfile 0.10% Python 0.12% CSS 0.28% JavaScript 0.24%

idlf's Introduction

The Intel® Deep Learning Framework

This project is end of life and not supported by former maintainers.

idlf's People

Contributors

Stargazers

Watchers

Forkers

pombreda basicsbeauty jingyijin hvisionsensing gaoxw126 michaelxin zhoujialinmumu jperezrua wangdongfrank yongchn ndres xyy19920105 nagyistoce tempbottle yiiwood craftsliu vingorilla lu839684437 deeplearningindia agibsonccc gujunli dfrsg matrixplayer henrikojelund netaz cnikolaou codeaudit xqwang1 cdliao davenso ls-manila jdanecki thurday cynthiazheng caomw cadcom seppun anguoyang qianglan liyuming1978 fun-alex-alex2006hw matrixcr poseidon1214 v-italy transformersprimeabcxyz liyancas yyzreal jebtang hemmingway cczu211 robi56 azuredsky feng124 magania eval-printer yeongjae ds-hwang mfzhang virusme sungsoo arasharchor lackofentropy jpdoyle xikug fireword heappl sunxingxingtf jreniecki skair39 venkatarajasekhar curioustauseef liuguoyou mkolod spencerx batermj limkokholefork jkopinsky rjoursler yang123vc pawelpiotrowicz isabella232 arlesniak dualword

idlf's Issues

IDLF as a library

Since IDLF may have done much optimization for convolution/pooling/etc operations, can we use IDLF as a library?
If yes, how can we call the convolution::foward and convolution::backward functions, i.e., the API?
Thanks.

How to train this network?

Is it possible to train new cnn using this framework? I didn't see any instructions on this.

Installation instructions and documentation

I tried to install the software and I got the following errors in the Release stage (and in other stages as well):

FreeImage_LIBRARIES (ADVANCED)
    linked by target "tester" in directory /home/user_name/idlf/idlf/tester
    linked by target "demo_device" in directory /home/user_name/idlf/idlf/demo/device
    linked by target "demo_primitives" in directory /home/user_name/idlf/idlf/demo/primitives
OPENCL_INCLUDE_DIRS
   used as include directory in directory /home/user_name/idlf/idlf/device/gpu
OPENCL_LIBRARIES
    linked by target "device_gpu" in directory /home/user_name/idlf/idlf/device/gpu

I am trying to install on an Ubuntu machine with a single Intel processor. Also, I am trying to install on a Linux server (with AMD processors) but I get the same errors. I have downloaded the FreeImage library but I do not know what to do after that.

I will provide a documentation after I install the idlf but I have not done that yet. Could you please tell me if this software depends on a specific hardware infrastructure and what software dependencies does it have? I searched for documentation on other website but did not manage to find anything.

I also get the following later on:

cxx Flags: 
-- ULTS_WORKING_DIR ...................... /home/user_name/idlf/idlf/bin/tester/g_ult/gcc/DebugULT
-- ULTS_RUN_DIR ...................... /home/user_name/idlf/idlf/bin/tester/g_ult/gcc/DebugULT
-- Looking for include file pthread.h
-- Looking for include file pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE

Questions about details in the code

Hi~ I am studying the code now , I have some questions about the code. I hope you may give me some answers. But if it's inconvenient for you to answer my question, just ignore it and close this issue.
I am currently study the convolution kernel. For a kernel who is responsible to compute a 4_3 output region from a 19_24 input feature map. It seems this kernel computes 16 different feature maps for this input region using 16 different kernels. My question is that why do you divide the image onto 16 SIMD lanes(30 floats per lane)? It doesn't makes any sense to me.
The related code is:

__attribute__((reqd_work_group_size(1, 1, SIMD_SIZE)))
kernel void
convolve_AlexNet_C1 (
    __global float* outputs,
    __global float* inputs,
    filter_qualifier float* weights,
    __global float* biases,
    unsigned int batch_output_offset)  //__global float *inputs, __global float* weights, __global float* outputs)
{
    uint oc  = get_global_id(0) * OUT_BLOCK_WIDTH;  // oc = Output Column
    uint or  = get_global_id(1) * OUT_BLOCK_HEIGHT; // or = Output Row
    uint fm  = get_global_id(2);                    // fm = Feature Map = od = Output Depth
    uint fmg = get_group_id(2);
    uint lid = get_local_id(2);
    //float w;
    // 19 x 24 = 456; 456 / 16 = 29 floats per lane for SIMD16, but padding out to 30 to simplify the load loop.
    float in[30];   // this holds a 19x24 block of the input data, enough to compute 4x3 outputs, simd_shuffle is used so that all work-items have access to all 19x24 locations.
    float out[12]; // 4x3 block of outputs that is SIMD_SIZE deep (along the Feature Map dimension).
    for(int i=0;i<12;i++)
    { 
        // we need this address calculation for biases because we support views and batching
        out[i] = biases[(fm - get_global_offset(2)) % _OD];
    }

    uint in_addr;
    uint weight_addr = (fmg % (_OD/SIMD_SIZE)) * _ID * KERNEL * KERNEL * SIMD_SIZE + lid;

    uint input_batch_offset = (fm / _OD) * (_IH + IHPAD) * (_IW + IWPAD) * _ID;

    for(int kd = 0; kd < _ID; kd++)  // _ID = 3, RGB
    {
        in_addr = input_batch_offset + (kd + INPUT_START_Z) * (_IH + IHPAD) * (_IW + IWPAD) + (or*K_STRIDE + INPUT_START_Y) * (_IW + IWPAD) + (oc*K_STRIDE + INPUT_START_X)  + lid;

        // read 24x19 block into registers.
        // This is ugly, we really need to fix the programming model.
        for(uint reg = 0; reg < 30; reg+=3) {
            in[reg] = inputs[in_addr];// read 16 elements
            // might be better to adjust the addrs, then do single load.
            if(lid < 8) in[reg + 1] = inputs[in_addr + 16];// read 8 elements in lower portion, for total of 24 from input row.
            in_addr += (_IW + IWPAD);  // move to next row down
            if(lid >= 8) in[reg + 1] = inputs[in_addr - 8];  // read 8 elements into upper portion
            in[reg + 2] = inputs[in_addr + 8]; // read 16 elements
            in_addr += (_IW + IWPAD);  // move to next row down
        }

        float w[5];
        int w_idx=0;
        w[0] = weights[weight_addr]; weight_addr += SIMD_SIZE; // weights must be stored in just the right SIMD swizzled format for this to work, see host code for details.
        w[1] = weights[weight_addr]; weight_addr += SIMD_SIZE;
        w[2] = weights[weight_addr]; weight_addr += SIMD_SIZE;
        w[3] = weights[weight_addr]; weight_addr += SIMD_SIZE;
        w[4] = weights[weight_addr]; weight_addr += SIMD_SIZE;

        int kr = 0; // kr = Kernel Row
        LOOP(10, kr,  // LOOP is a macro that unrolls the loop.
        {
            int kc = 0; // kc = Kernel Column
            LOOP(KERNEL, kc,
            {
                for(int br=0; br<OUT_BLOCK_HEIGHT; br++) {
                    for(int bc=0; bc<OUT_BLOCK_WIDTH; bc++) {
                        //if we fix the programming model, then we could use a nice simple 2d array: val = in[br * K_STRIDE + kr][bc * K_STRIDE + kc];
                        float val = intel_sub_group_shuffle( in[(((br*K_STRIDE+kr)*24)+(bc * K_STRIDE + kc)) / SIMD_SIZE], (((br*K_STRIDE+kr)*24)+(bc * K_STRIDE + kc)) & (SIMD_SIZE - 1));
                        out[br * OUT_BLOCK_WIDTH + bc] = mad(w[w_idx%5], val, out[br * OUT_BLOCK_WIDTH + bc]);
                    }
                }

                w[w_idx%5] = weights[weight_addr]; weight_addr += SIMD_SIZE;
                w_idx++;

            });
        });

        // last kr loop split in two parts
        int kc = 0; // kc = Kernel Column
        LOOP(6, kc,
        {
            for(int br=0; br<OUT_BLOCK_HEIGHT; br++) {
                for(int bc=0; bc<OUT_BLOCK_WIDTH; bc++) {
                    //if we fix the programming model, then we could use a nice simple 2d array: val = in[br * K_STRIDE + kr][bc * K_STRIDE + kc];
                    float val = intel_sub_group_shuffle( in[(((br*K_STRIDE+kr)*24)+(bc * K_STRIDE + kc)) / SIMD_SIZE], (((br*K_STRIDE+kr)*24)+(bc * K_STRIDE + kc)) & (SIMD_SIZE - 1));
                    out[br * OUT_BLOCK_WIDTH + bc] = mad(w[w_idx%5], val, out[br * OUT_BLOCK_WIDTH + bc]);
                }
            }

            w[w_idx%5] = weights[weight_addr]; weight_addr += SIMD_SIZE;
            w_idx++;
         });

        // last 5 kc loops don't prefetch weights
        LOOP(5, kc,
        {
            for(int br=0; br<OUT_BLOCK_HEIGHT; br++) {
                for(int bc=0; bc<OUT_BLOCK_WIDTH; bc++) {
                    //if we fix the programming model, then we could use a nice simple 2d array: val = in[br * K_STRIDE + kr][bc * K_STRIDE + kc];
                    float val = intel_sub_group_shuffle( in[(((br*K_STRIDE+kr)*24)+(bc * K_STRIDE + kc)) / SIMD_SIZE], (((br*K_STRIDE+kr)*24)+(bc * K_STRIDE + kc)) & (SIMD_SIZE - 1));
                    out[br * OUT_BLOCK_WIDTH + bc] = mad(w[w_idx%5], val, out[br * OUT_BLOCK_WIDTH + bc]);
                }
            }
            w_idx++;
         });

    }

    // write the 4x3 (and 16 feature maps deep) output tile to memory
    uint out_addr = OUT_BUFF_OFFSET + fm * (_OW + OWPAD) * (_OH + OHPAD); // out_addr indexes into start of 16 feature maps.
    out_addr += or * (_OW + OWPAD) + oc;  // offset for the 4x3 block that this workitem is working on;
#ifndef WRITE_PADDED_VALUES
    if(get_global_id(0) != (get_global_size(0)-1) &&
        get_global_id(1) != (get_global_size(1)-1)  )
    {
#endif
        for(uint r = 0; r < OUT_BLOCK_HEIGHT; r++) {
            for(uint c = 0; c < OUT_BLOCK_WIDTH; c++) {
            // this does a scattered write to 16 different feature maps, so that data within one map is contiguous, thus ready for input to next layer.
              outputs[out_addr + r * (_OW + OWPAD) + c] = activation_function(out[r * OUT_BLOCK_WIDTH + c]);
            }
        }
#ifndef WRITE_PADDED_VALUES
    }else if ( get_global_id(1) != (get_global_size(1)-1) )
    {
        for(uint r = 0; r < OUT_BLOCK_HEIGHT; r++) {
            for(uint c = 0; c < LAST_BLOCK_WIDTH; c++) {
                  outputs[out_addr + r * (_OW + OWPAD) + c] = activation_function(out[r * OUT_BLOCK_WIDTH + c]);
            }
        }
    }
    else if ( get_global_id(0) != (get_global_size(0)-1) )
    {
        for(uint r = 0; r < LAST_BLOCK_HEIGHT; r++) {
            for(uint c = 0; c < OUT_BLOCK_WIDTH; c++) {
                  outputs[out_addr + r * (_OW + OWPAD) + c] = activation_function(out[r * OUT_BLOCK_WIDTH + c]);
            }
        }
    }
    else
    {
     for(uint r = 0; r < LAST_BLOCK_HEIGHT; r++) {
        for(uint c = 0; c < LAST_BLOCK_WIDTH; c++) {
                outputs[out_addr + r * (_OW + OWPAD) + c] = activation_function(out[r * OUT_BLOCK_WIDTH + c]);
            }
        }
    }
#endif //#ifndef WRITE_PADDED_VALUES
}

#endif  //#ifdef INCLUDE_convolve_AlexNet_C1
)";

Also, why there are two sub convolution layers in Alexnet conv layer 2,4,5, does this implementation helps to improve the efficiency and why?

error:regex_error

when I try to run the demo using command below:
./visual_cloud_demo --config=gpu_caffenet.config
I got this error:
error: regex_error
My gpu_caffenet.config is as follows:
--model=caffenet_float
--device=device_gpu
--batch=32
--input=/home/images/
What does 'regex_error' mean?
Thanks!

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.