intel / idlf Goto Github PK
View Code? Open in Web Editor NEWIntel® Deep Learning Framework
License: BSD 3-Clause "New" or "Revised" License
Intel® Deep Learning Framework
License: BSD 3-Clause "New" or "Revised" License
Since IDLF may have done much optimization for convolution/pooling/etc operations, can we use IDLF as a library?
If yes, how can we call the convolution::foward and convolution::backward functions, i.e., the API?
Thanks.
Is it possible to train new cnn using this framework? I didn't see any instructions on this.
I tried to install the software and I got the following errors in the Release
stage (and in other stages as well):
FreeImage_LIBRARIES (ADVANCED)
linked by target "tester" in directory /home/user_name/idlf/idlf/tester
linked by target "demo_device" in directory /home/user_name/idlf/idlf/demo/device
linked by target "demo_primitives" in directory /home/user_name/idlf/idlf/demo/primitives
OPENCL_INCLUDE_DIRS
used as include directory in directory /home/user_name/idlf/idlf/device/gpu
OPENCL_LIBRARIES
linked by target "device_gpu" in directory /home/user_name/idlf/idlf/device/gpu
I am trying to install on an Ubuntu machine with a single Intel processor. Also, I am trying to install on a Linux server (with AMD processors) but I get the same errors. I have downloaded the FreeImage library but I do not know what to do after that.
I will provide a documentation after I install the idlf but I have not done that yet. Could you please tell me if this software depends on a specific hardware infrastructure and what software dependencies does it have? I searched for documentation on other website but did not manage to find anything.
I also get the following later on:
cxx Flags:
-- ULTS_WORKING_DIR ...................... /home/user_name/idlf/idlf/bin/tester/g_ult/gcc/DebugULT
-- ULTS_RUN_DIR ...................... /home/user_name/idlf/idlf/bin/tester/g_ult/gcc/DebugULT
-- Looking for include file pthread.h
-- Looking for include file pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
Hi~ I am studying the code now , I have some questions about the code. I hope you may give me some answers. But if it's inconvenient for you to answer my question, just ignore it and close this issue.
I am currently study the convolution kernel. For a kernel who is responsible to compute a 4_3 output region from a 19_24 input feature map. It seems this kernel computes 16 different feature maps for this input region using 16 different kernels. My question is that why do you divide the image onto 16 SIMD lanes(30 floats per lane)? It doesn't makes any sense to me.
The related code is:
__attribute__((reqd_work_group_size(1, 1, SIMD_SIZE)))
kernel void
convolve_AlexNet_C1 (
__global float* outputs,
__global float* inputs,
filter_qualifier float* weights,
__global float* biases,
unsigned int batch_output_offset) //__global float *inputs, __global float* weights, __global float* outputs)
{
uint oc = get_global_id(0) * OUT_BLOCK_WIDTH; // oc = Output Column
uint or = get_global_id(1) * OUT_BLOCK_HEIGHT; // or = Output Row
uint fm = get_global_id(2); // fm = Feature Map = od = Output Depth
uint fmg = get_group_id(2);
uint lid = get_local_id(2);
//float w;
// 19 x 24 = 456; 456 / 16 = 29 floats per lane for SIMD16, but padding out to 30 to simplify the load loop.
float in[30]; // this holds a 19x24 block of the input data, enough to compute 4x3 outputs, simd_shuffle is used so that all work-items have access to all 19x24 locations.
float out[12]; // 4x3 block of outputs that is SIMD_SIZE deep (along the Feature Map dimension).
for(int i=0;i<12;i++)
{
// we need this address calculation for biases because we support views and batching
out[i] = biases[(fm - get_global_offset(2)) % _OD];
}
uint in_addr;
uint weight_addr = (fmg % (_OD/SIMD_SIZE)) * _ID * KERNEL * KERNEL * SIMD_SIZE + lid;
uint input_batch_offset = (fm / _OD) * (_IH + IHPAD) * (_IW + IWPAD) * _ID;
for(int kd = 0; kd < _ID; kd++) // _ID = 3, RGB
{
in_addr = input_batch_offset + (kd + INPUT_START_Z) * (_IH + IHPAD) * (_IW + IWPAD) + (or*K_STRIDE + INPUT_START_Y) * (_IW + IWPAD) + (oc*K_STRIDE + INPUT_START_X) + lid;
// read 24x19 block into registers.
// This is ugly, we really need to fix the programming model.
for(uint reg = 0; reg < 30; reg+=3) {
in[reg] = inputs[in_addr];// read 16 elements
// might be better to adjust the addrs, then do single load.
if(lid < 8) in[reg + 1] = inputs[in_addr + 16];// read 8 elements in lower portion, for total of 24 from input row.
in_addr += (_IW + IWPAD); // move to next row down
if(lid >= 8) in[reg + 1] = inputs[in_addr - 8]; // read 8 elements into upper portion
in[reg + 2] = inputs[in_addr + 8]; // read 16 elements
in_addr += (_IW + IWPAD); // move to next row down
}
float w[5];
int w_idx=0;
w[0] = weights[weight_addr]; weight_addr += SIMD_SIZE; // weights must be stored in just the right SIMD swizzled format for this to work, see host code for details.
w[1] = weights[weight_addr]; weight_addr += SIMD_SIZE;
w[2] = weights[weight_addr]; weight_addr += SIMD_SIZE;
w[3] = weights[weight_addr]; weight_addr += SIMD_SIZE;
w[4] = weights[weight_addr]; weight_addr += SIMD_SIZE;
int kr = 0; // kr = Kernel Row
LOOP(10, kr, // LOOP is a macro that unrolls the loop.
{
int kc = 0; // kc = Kernel Column
LOOP(KERNEL, kc,
{
for(int br=0; br<OUT_BLOCK_HEIGHT; br++) {
for(int bc=0; bc<OUT_BLOCK_WIDTH; bc++) {
//if we fix the programming model, then we could use a nice simple 2d array: val = in[br * K_STRIDE + kr][bc * K_STRIDE + kc];
float val = intel_sub_group_shuffle( in[(((br*K_STRIDE+kr)*24)+(bc * K_STRIDE + kc)) / SIMD_SIZE], (((br*K_STRIDE+kr)*24)+(bc * K_STRIDE + kc)) & (SIMD_SIZE - 1));
out[br * OUT_BLOCK_WIDTH + bc] = mad(w[w_idx%5], val, out[br * OUT_BLOCK_WIDTH + bc]);
}
}
w[w_idx%5] = weights[weight_addr]; weight_addr += SIMD_SIZE;
w_idx++;
});
});
// last kr loop split in two parts
int kc = 0; // kc = Kernel Column
LOOP(6, kc,
{
for(int br=0; br<OUT_BLOCK_HEIGHT; br++) {
for(int bc=0; bc<OUT_BLOCK_WIDTH; bc++) {
//if we fix the programming model, then we could use a nice simple 2d array: val = in[br * K_STRIDE + kr][bc * K_STRIDE + kc];
float val = intel_sub_group_shuffle( in[(((br*K_STRIDE+kr)*24)+(bc * K_STRIDE + kc)) / SIMD_SIZE], (((br*K_STRIDE+kr)*24)+(bc * K_STRIDE + kc)) & (SIMD_SIZE - 1));
out[br * OUT_BLOCK_WIDTH + bc] = mad(w[w_idx%5], val, out[br * OUT_BLOCK_WIDTH + bc]);
}
}
w[w_idx%5] = weights[weight_addr]; weight_addr += SIMD_SIZE;
w_idx++;
});
// last 5 kc loops don't prefetch weights
LOOP(5, kc,
{
for(int br=0; br<OUT_BLOCK_HEIGHT; br++) {
for(int bc=0; bc<OUT_BLOCK_WIDTH; bc++) {
//if we fix the programming model, then we could use a nice simple 2d array: val = in[br * K_STRIDE + kr][bc * K_STRIDE + kc];
float val = intel_sub_group_shuffle( in[(((br*K_STRIDE+kr)*24)+(bc * K_STRIDE + kc)) / SIMD_SIZE], (((br*K_STRIDE+kr)*24)+(bc * K_STRIDE + kc)) & (SIMD_SIZE - 1));
out[br * OUT_BLOCK_WIDTH + bc] = mad(w[w_idx%5], val, out[br * OUT_BLOCK_WIDTH + bc]);
}
}
w_idx++;
});
}
// write the 4x3 (and 16 feature maps deep) output tile to memory
uint out_addr = OUT_BUFF_OFFSET + fm * (_OW + OWPAD) * (_OH + OHPAD); // out_addr indexes into start of 16 feature maps.
out_addr += or * (_OW + OWPAD) + oc; // offset for the 4x3 block that this workitem is working on;
#ifndef WRITE_PADDED_VALUES
if(get_global_id(0) != (get_global_size(0)-1) &&
get_global_id(1) != (get_global_size(1)-1) )
{
#endif
for(uint r = 0; r < OUT_BLOCK_HEIGHT; r++) {
for(uint c = 0; c < OUT_BLOCK_WIDTH; c++) {
// this does a scattered write to 16 different feature maps, so that data within one map is contiguous, thus ready for input to next layer.
outputs[out_addr + r * (_OW + OWPAD) + c] = activation_function(out[r * OUT_BLOCK_WIDTH + c]);
}
}
#ifndef WRITE_PADDED_VALUES
}else if ( get_global_id(1) != (get_global_size(1)-1) )
{
for(uint r = 0; r < OUT_BLOCK_HEIGHT; r++) {
for(uint c = 0; c < LAST_BLOCK_WIDTH; c++) {
outputs[out_addr + r * (_OW + OWPAD) + c] = activation_function(out[r * OUT_BLOCK_WIDTH + c]);
}
}
}
else if ( get_global_id(0) != (get_global_size(0)-1) )
{
for(uint r = 0; r < LAST_BLOCK_HEIGHT; r++) {
for(uint c = 0; c < OUT_BLOCK_WIDTH; c++) {
outputs[out_addr + r * (_OW + OWPAD) + c] = activation_function(out[r * OUT_BLOCK_WIDTH + c]);
}
}
}
else
{
for(uint r = 0; r < LAST_BLOCK_HEIGHT; r++) {
for(uint c = 0; c < LAST_BLOCK_WIDTH; c++) {
outputs[out_addr + r * (_OW + OWPAD) + c] = activation_function(out[r * OUT_BLOCK_WIDTH + c]);
}
}
}
#endif //#ifndef WRITE_PADDED_VALUES
}
#endif //#ifdef INCLUDE_convolve_AlexNet_C1
)";
Also, why there are two sub convolution layers in Alexnet conv layer 2,4,5, does this implementation helps to improve the efficiency and why?
when I try to run the demo using command below:
./visual_cloud_demo --config=gpu_caffenet.config
I got this error:
error: regex_error
My gpu_caffenet.config is as follows:
--model=caffenet_float
--device=device_gpu
--batch=32
--input=/home/images/
What does 'regex_error' mean?
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.