Git Product home page Git Product logo

pipecnn's People

Contributors

dezengzang avatar doonny avatar nafest avatar timgates42 avatar xuke225 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pipecnn's Issues

Compiler Error: Unrecognized function call: mult_add_fix8bx4

Hi,
I am using Intel SDK for OpenCL on Arria10 to run the PipeCNN - Alex net. I am getting the above error when I compile the kernel with this command -
$ aoc device/conv_pipe.cl -o bin_fpga/conv_pipe.aocx --board bdw_fpga_v1.0 -v -g
aoc: Environment checks are completed successfully.
You are now compiling the full flow!!
aoc: Selected target board bdw_fpga_v1.0
aoc: Running OpenCL parser....
In file included from :11140:
:2:30: warning: ISO C99 requires whitespace after the macro name
#define ACL_BOARD_bdw_fpga_v1.0 1
^
:3:31: warning: ISO C99 requires whitespace after the macro name
#define AOCL_BOARD_bdw_fpga_v1.0 1
^
2 warnings generated.
aoc: OpenCL parser completed successfully.
aoc: Compiling....
Compiler Error: Unrecognized function call: mult_add_fix8bx4
Error: Optimizer FAILED.
Refer to conv_pipe/conv_pipe.log for details.

When I run the same on an emulator it is running fine and gives the expected output.

Why is it not able to recognise the function mult_add_fix8bx4 ? Should it be compiled separately ?

Thanks,
Akash

Inference result not displayed.

HI how to print:
" The inference result is n02123045 tabby, tabby cat (the prob is 56.00) ." at last in your documentation?
Thanks in advance..
Regards,
Ganda.

Floating point and fixed point

hi Prof @doonny , I noticed the current version is in fixed point mac. And how was the resource utilization compared to Floating point mac?
What is the Pipe depth ?
#define PIPE_DEPTH 6

Can't find arm32 lib when make host for de1soc

When compiling de1soc host ,can't find arm lib. Maybe ,modify Makefile Line 73 74
$(shell aocl compile-config) --> $(shell aocl compile-config --arm)
$(shell aocl link-config) --> $(shell aocl link-config--arm)
windows64,Quartus 16.1 and compile with SoC EDS 16.1 Command Shell

theoretical exaplanation of Host code.

I am trying to understand the host and fpga code as well. However I am not able to link it with available sources, I have.
Can you please provide or suggest me and other viewer on this topic? Can you provide any link based on which you have written your code?

Cannot find CONV_GP_SIZE_X

Hi,

I am trying to get this project running on Cyclone-V SEA5. The configuration you've used is V=8,L=8,GP_X=7, where GP_X is CONV_GP_SIZE_X. But I cannot find CONV_GP_SIZE_X anywhere in the code. Could you tell where can I set this variable?

Thank you in advance!

Why LRN in software

Any particular reason for the choice of implementing LRN on CPU and not FPGA (and gain the acceleration benefits of the FPGA)?

BSP for DE1-soc

Hi sir,
According to the description, the version of Intel's OpenCL SDK v16.1 is used in this project.
We would like to test run the program on the Terasic De1-SoC board.
But we found only the BSP for Altera SDK OpenCL 16.0 is provided in the official webpage.
Is it the one you used in this project?
And is it work fine in Intel's OpenCL SDK v16.1?

capture

Thank you

vgg16 results wrong

  The default network in the code is AlexNet, and I have successfully built and run it in xilinx kcu1500 board.However,when I modified some code and attempted to change the network to vgg16,I built the project successfully but got wrong results.I am a beginner in CNN and opencl.I think I really need some guide on how to change the code from the default AlexNet to vgg16.I can't see any documents in the readme or the user instruction.

  This is the AlexNet result which seems correct:

***************************************************
PipeCNN: An OpenCL-Based FPGA Accelerator for CNNs 
***************************************************

61063552 total weights read 
154587 bytes image read 
1024 total output reference read 


Platform: Xilinx
Using 1 device(s)
  Device 0: xilinx:kcu1500:4ddr-xpr:4.0
Device OpenCL Version: OpenCL 1.0
Device Max Compute Units: 1
Device Max WorkGroup Size: 4096
Device Max WorkItem Size: 4096
Device Global Memory Size: 16384 MBytes
Device Local Memory Size: 16 KBytes
Device Max Clock Freq: 500 Mhz

Loading kernel/binary from file conv.xclbin
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy

Executing Layer 1:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching single work-item kernel Pooling

Launching kernel MemWr with local size: 1, 1, 16  (global size: 27, 27, 96)

Launching kernel lrn with local size: 1, 1, 24  (global size: 27, 27, 24)

Executing Layer 2:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching single work-item kernel Pooling

Launching kernel MemWr with local size: 1, 1, 16  (global size: 13, 13, 256)

Launching kernel lrn with local size: 1, 1, 64  (global size: 13, 13, 64)

Executing Layer 3:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 16  (global size: 13, 13, 384)

Executing Layer 4:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 16  (global size: 13, 13, 384)

Executing Layer 5:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching single work-item kernel Pooling

Launching kernel MemWr with local size: 1, 1, 16  (global size: 6, 6, 256)

Executing Layer 6:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 16  (global size: 1, 1, 4096)

Executing Layer 7:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 16  (global size: 1, 1, 4096)

Executing Layer 8:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 16  (global size: 1, 1, 1024)

Copyed all batched results from fc_2 buffers.

Done !!!


-------------------

Performance Summary

Total runtime: 1.050996s 

Kernel runtime summary:
  Layer-1:
    MemRd: 59.144 ms
    Conv : 58.641 ms
    Pool : 58.187 ms
    MemWr: 56.728 ms
    Lrn  : 381.921 ms
  Layer-2:
    MemRd: 81.765 ms
    Conv : 81.385 ms
    Pool : 80.876 ms
    MemWr: 80.793 ms
    Lrn  : 336.314 ms
  Layer-3:
    MemRd: 18446744071709.031 ms
    Conv : 51.617 ms
    Pool : 0.000 ms
    MemWr: 51.168 ms
    Lrn  : 0.000 ms
  Layer-4:
    MemRd: 18446744071656.164 ms
    Conv : 18446744071656.809 ms
    Pool : 0.000 ms
    MemWr: 39.138 ms
    Lrn  : 0.000 ms
  Layer-5:
    MemRd: 26.615 ms
    Conv : 26.061 ms
    Pool : 26.632 ms
    MemWr: 25.660 ms
    Lrn  : 0.000 ms
  Layer-6:
    MemRd: 27.584 ms
    Conv : 27.147 ms
    Pool : 0.000 ms
    MemWr: 26.590 ms
    Lrn  : 0.000 ms
  Layer-7:
    MemRd: 18446744071562.098 ms
    Conv : 18446744071561.719 ms
    Pool : 0.000 ms
    MemWr: 11.994 ms
    Lrn  : 0.000 ms
  Layer-8:
    MemRd: 3.911 ms
    Conv : 3.548 ms
    Pool : 0.000 ms
    MemWr: 4.082 ms
    Lrn  : 0.000 ms

Total kernel runtime 36893488147419.102 ms 
Batch size = 1, average process time per batch: 36893488147419.102 ms 

Start verifying results ...
Selected item = 0 from the combined batch results in fc buffers

Check Pass !!!

The inference result is n02123045 tabby, tabby ca   (the prob is 56.00)

  And this is the vgg16 result which is obviously wrong:

***************************************************
PipeCNN: An OpenCL-Based FPGA Accelerator for CNNs 
***************************************************

138455872 total weights read 
150528 bytes image read 
1024 total output reference read 


Platform: Xilinx
Using 1 device(s)
  Device 0: xilinx:kcu1500:4ddr-xpr:4.0
Device OpenCL Version: OpenCL 1.0
Device Max Compute Units: 1
Device Max WorkGroup Size: 4096
Device Max WorkItem Size: 4096
Device Global Memory Size: 16384 MBytes
Device Local Memory Size: 16 KBytes
Device Max Clock Freq: 500 Mhz

Loading kernel/binary from file conv.xclbin
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy

Executing Layer 1:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 16  (global size: 224, 224, 64)

Executing Layer 2:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching single work-item kernel Pooling

Launching kernel MemWr with local size: 1, 1, 16  (global size: 112, 112, 64)

Executing Layer 3:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 16  (global size: 112, 112, 128)

Executing Layer 4:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching single work-item kernel Pooling

Launching kernel MemWr with local size: 1, 1, 16  (global size: 56, 56, 128)

Executing Layer 5:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 16  (global size: 56, 56, 256)

Executing Layer 6:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 16  (global size: 56, 56, 256)

Executing Layer 7:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching single work-item kernel Pooling

Launching kernel MemWr with local size: 1, 1, 16  (global size: 28, 28, 256)

Executing Layer 8:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 16  (global size: 28, 28, 512)

Executing Layer 9:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 16  (global size: 28, 28, 512)

Executing Layer 10:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching single work-item kernel Pooling

Launching kernel MemWr with local size: 1, 1, 16  (global size: 14, 14, 512)

Executing Layer 11:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 16  (global size: 14, 14, 512)

Executing Layer 12:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 16  (global size: 14, 14, 512)

Executing Layer 13:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching single work-item kernel Pooling

Launching kernel MemWr with local size: 1, 1, 16  (global size: 7, 7, 512)

Executing Layer 14:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 16  (global size: 1, 1, 4096)

Executing Layer 15:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 16  (global size: 1, 1, 4096)

Executing Layer 16:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 16  (global size: 1, 1, 1024)

Copyed all batched results from fc_2 buffers.

Done !!!


-------------------

Performance Summary

Total runtime: 6.911966s 

Kernel runtime summary:
  Layer-1:
    MemRd: 131.136 ms
    Conv : 130.630 ms
    Pool : 0.000 ms
    MemWr: 128.416 ms
    Lrn  : 0.000 ms
  Layer-2:
    MemRd: 18446744067806.332 ms
    Conv : 18446744067805.941 ms
    Pool : 18446744067805.469 ms
    MemWr: 840.861 ms
    Lrn  : 0.000 ms
  Layer-3:
    MemRd: 435.174 ms
    Conv : 434.800 ms
    Pool : 0.000 ms
    MemWr: 435.343 ms
    Lrn  : 0.000 ms
  Layer-4:
    MemRd: 18446744066528.754 ms
    Conv : 821.526 ms
    Pool : 821.065 ms
    MemWr: 820.978 ms
    Lrn  : 0.000 ms
  Layer-5:
    MemRd: 409.369 ms
    Conv : 409.873 ms
    Pool : 0.000 ms
    MemWr: 409.000 ms
    Lrn  : 0.000 ms
  Layer-6:
    MemRd: 18446744065296.562 ms
    Conv : 807.734 ms
    Pool : 0.000 ms
    MemWr: 807.327 ms
    Lrn  : 0.000 ms
  Layer-7:
    MemRd: 802.170 ms
    Conv : 802.702 ms
    Pool : 802.189 ms
    MemWr: 801.713 ms
    Lrn  : 0.000 ms
  Layer-8:
    MemRd: 18446744063685.164 ms
    Conv : 18446744063684.770 ms
    Pool : 0.000 ms
    MemWr: 388.292 ms
    Lrn  : 0.000 ms
  Layer-9:
    MemRd: 775.116 ms
    Conv : 774.742 ms
    Pool : 0.000 ms
    MemWr: 775.259 ms
    Lrn  : 0.000 ms
  Layer-10:
    MemRd: 775.115 ms
    Conv : 774.698 ms
    Pool : 774.239 ms
    MemWr: 774.151 ms
    Lrn  : 0.000 ms
  Layer-11:
    MemRd: 183.018 ms
    Conv : 182.621 ms
    Pool : 0.000 ms
    MemWr: 182.134 ms
    Lrn  : 0.000 ms
  Layer-12:
    MemRd: 183.014 ms
    Conv : 182.632 ms
    Pool : 0.000 ms
    MemWr: 182.164 ms
    Lrn  : 0.000 ms
  Layer-13:
    MemRd: 182.661 ms
    Conv : 182.243 ms
    Pool : 181.786 ms
    MemWr: 181.401 ms
    Lrn  : 0.000 ms
  Layer-14:
    MemRd: 18446744061195.703 ms
    Conv : 80.474 ms
    Pool : 0.000 ms
    MemWr: 86.381 ms
    Lrn  : 0.000 ms
  Layer-15:
    MemRd: 13.703 ms
    Conv : 14.216 ms
    Pool : 0.000 ms
    MemWr: 13.308 ms
    Lrn  : 0.000 ms
  Layer-16:
    MemRd: 18446744061093.504 ms
    Conv : 18446744061093.090 ms
    Pool : 0.000 ms
    MemWr: 3.383 ms
    Lrn  : 0.000 ms

Total kernel runtime 55340232221128.656 ms 
Batch size = 1, average process time per batch: 55340232221128.656 ms 

Start verifying results ...
Selected item = 0 from the combined batch results in fc buffers
Item=0 is wrong (result=-3.000000, golden_ref=-6.000000)
Item=1 is wrong (result=0.000000, golden_ref=3.000000)
Item=2 is wrong (result=-4.000000, golden_ref=-8.000000)
Item=3 is wrong (result=-4.000000, golden_ref=-9.000000)
Item=4 is wrong (result=-1.000000, golden_ref=-5.000000)
Item=5 is wrong (result=-4.000000, golden_ref=-1.000000)
Item=6 is wrong (result=-3.000000, golden_ref=-12.000000)
Item=7 is wrong (result=2.000000, golden_ref=7.000000)
Item=8 is wrong (result=-1.000000, golden_ref=10.000000)
Totally 946 Wrong Results

a problem about running the project in sdaccel gui

I put the image.dat,weights.dat and fc8.dat in the data folder, build and run the project in cpu emulation mode.The build process finishes successfully.However, when I attempt to run the exe file,it finishes so quick and no errors occur.The console output is very short:

***************************************************
PipeCNN: An OpenCL-Based FPGA Accelerator for CNNs 
***************************************************

61063552 total weights read 
154587 bytes image read 

I'm using SDaccel 2017.2 gui mode.Why is the output log so short?I don't see any output files created after I run the project.

DTYPE and MACTYPE

Hi, thank you for your project.
What kind difference should be between DTYPE and MACTYPE. In your example char (8 bit) and int (32 bit).
If I want to use DTYPE short (16 bit), what MACTYPE I should choose ?
Long int (64 bit) ? Or need to rounding and truncate immedietly after every multiplication ?

How to handle data is not multiple of VEC_SIZE?

Q1:
How to handle data is not multiple of VEC_SIZE?
take Alexnet for example:
conv1 have 11x11x3 that can't divided by VEC_SIZE 4.
so at mac operation, a1xb1+a2xb2+a3xb3+a4xb4, and at last one whouldn't have a3 and a4, is this will auto assign 0?

Q2:
And It looks like data_vec is reading bottom linearly with size of VEC_SIZE ?
in the pipeCNN paper that describe that weight is divided into size of VEC_SIZE at Z direction
ex. I have weight 3x3x4, I should have VEC_SIZE(4) of group of weight at Z direction, each have 3x3x1=9 datas.
0, 1, 2, 3, ... , 8
9,10,11,12, ... 17
18,...
27,...

but in algorithm, you group data into data_vec linearly:
{0, +1, +2 +3}, {+4, +5, +6, +7}, .... , ....., + 35
how this divided weight at Z direction?

for(unsigned short win_itm_z=0; win_itm_z<weight_dim3/VEC_SIZE; win_itm_z++){
	for(unsigned char  win_itm_y=0; win_itm_y<win_size_y; win_itm_y++){
		for(unsigned char  win_itm_x=0; win_itm_x<win_size_x; win_itm_x++){
			feature_idx_dim1 = win_itm_x;
			feature_idx_dim2 = win_itm_y;
			feature_idx_dim3 = win_itm_z;
			if(xy is at correct location){	
				data_vec = bottom[data_offset*data_dim1xdim2 + feature_idx_dim3*data_dim1xdim2 + (feature_idx_dim2-padding)*data_dim1 + (feature_idx_dim1-padding)];
			}
			else{
				#pragma unroll
				for(unsigned char vv=0; vv<VEC_SIZE; vv++){
					data_vec.data[vv] = CZERO;
				}
			}	
			// start from using buffer[0]
			win_buffer[0][win_itm_z*win_size_y*win_size_x + win_itm_y*win_size_x + win_itm_x] = data_vec;
		}
	}
}

Can you explain why did you use "accum_piped"

Hi, Dong
I have one more question.
Can you explain why did you use in this code "accum_piped"
Why PIPE_DEPTH = 6 ?
`

for(unsigned char ll=0; ll<LANE_NUM; ll++){

lane_accum[ll] = (MASK_ACCUM & accum_piped[ll][PIPE_DEPTH-1]) + (MASK_MULT & mac(mac_data.lane[ll], mac_weight.lane[ll]));
		
// Shift the pipelined registers backwards
#pragma unroll
for(unsigned int p=PIPE_DEPTH-1; p>0; p-- ){
	accum_piped[ll][p] = MASK_ACCUM & accum_piped[ll][p-1];
}
			
// update the first copy
accum_piped[ll][0] = MASK_ACCUM & lane_accum[ll];

}
`
Thank you
Best regards

Xilinx SDSoc: build error

Hi!
I encountered the following problems(the warning part has been tab by bold text) when I bulid debug for my project on zcu102(
System configuration:A53openCL Linux;
Runtime:OpenCL),
SDx2017.4(SDSoc available, SDAccel not available)

And I found that the pipe.cl already existed before using the pipe_gen.py, though I still ran the py script with argument(16 8)
it was my first time using SDx kit, thank you for your help!
---- error part ----(the warning part has been tab by bold text)

21:19:17**** Incremental Build of configuration Debug for project pcnn ****

make -j40 incremental
/opt/Xilinx/SDX/SDK/2017.4/gnu/aarch64/lin/aarch64-linux/bin/aarch64-linux-gnu-g++ -DSDX_PLATFORM=zcu102 -D__USE_XOPEN2K8 -I/opt/Xilinx/SDX/SDx/2017.4/runtime/include/1_2/ -I/opt/Xilinx/SDX/Vivado/2017.4/include/ -O2 -g -Wall -c -fmessage-length=0 -std=c++14 -DSDX_PLATFORM=zcu102 -D__USE_XOPEN2K8 -I/opt/Xilinx/SDX/SDx/2017.4/runtime/include/1_2/ -I/opt/Xilinx/SDX/Vivado/2017.4/include/ -O2 -g -Wall -c -fmessage-length=0 -o "src/common/ocl_util.o" "../src/common/ocl_util.cpp"
/opt/Xilinx/SDX/SDK/2017.4/gnu/aarch64/lin/aarch64-linux/bin/aarch64-linux-gnu-g++ -DSDX_PLATFORM=zcu102 -D__USE_XOPEN2K8 -I/opt/Xilinx/SDX/SDx/2017.4/runtime/include/1_2/ -I/opt/Xilinx/SDX/Vivado/2017.4/include/ -O2 -g -Wall -c -fmessage-length=0 -std=c++14 -DSDX_PLATFORM=zcu102 -D__USE_XOPEN2K8 -I/opt/Xilinx/SDX/SDx/2017.4/runtime/include/1_2/ -I/opt/Xilinx/SDX/Vivado/2017.4/include/ -O2 -g -Wall -c -fmessage-length=0 -o "src/common/timer.o" "../src/common/timer.cpp"
/opt/Xilinx/SDX/SDK/2017.4/gnu/aarch64/lin/aarch64-linux/bin/aarch64-linux-gnu-g++ -DSDX_PLATFORM=zcu102 -D__USE_XOPEN2K8 -I/opt/Xilinx/SDX/SDx/2017.4/runtime/include/1_2/ -I/opt/Xilinx/SDX/Vivado/2017.4/include/ -O2 -g -Wall -c -fmessage-length=0 -std=c++14 -DSDX_PLATFORM=zcu102 -D__USE_XOPEN2K8 -I/opt/Xilinx/SDX/SDx/2017.4/runtime/include/1_2/ -I/opt/Xilinx/SDX/Vivado/2017.4/include/ -O2 -g -Wall -c -fmessage-length=0 -o "src/project/host/main.o" "../src/project/host/main.cpp"

../src/project/host/main.cpp:21:22: fatal error: ocl_util.h: No such file or directory

#include "ocl_util.h"
^
compilation terminated.

**make: *** [src/project/host/main.o] Error 1

make: *** Waiting for unfinished jobs....**
../src/common/ocl_util.cpp: In function \u2018_cl_program* ocl_util::createProgramFromFile(cl_context, const char*, _cl_device_id* const*, unsigned int)\u2019:
../src/common/ocl_util.cpp:410:22: warning: ignoring attributes on template argument \u2018cl_int {aka int}\u2019 [-Wignored-attributes]
scoped_array<cl_int> binary_status(num_devices);
^

21:19:18 Build Finished (took 940ms)`

Regards!

Compile error on Intel FPGA SDK for OpenCL.

This source is supposed to be compiled on Intel FPGA SDK for OpenCL but I am getting following error for arria 10. I am using following version of the Intel(R) FPGA SDK for OpenCL(TM), 64-Bit Offline Compiler.
Version 17.0.2 Build 297

Error:

error: function 'read_channel_altera' is not supported by the Intel(R) FPGA SDK for OpenCL(TM), and no user definition is provided

error: function 'write_channel_altera' is not supported by the Intel(R) FPGA SDK for OpenCL(TM), and no user definition is provided

Board Performance

Regarding to the statement of pipeCNN has been tested in the following boards:

  • Terasic's DE5-net (Stratix-IV A7 FPGA)
  • Terasic's DE5a-net (Arria-10 1150 FPGA)
  • Terasic's DE1-soc (Cyclone-V SEA5 FPGA)
  • Terasic's DE10-standard (Cyclone-V SXC6 FPGA)
  • Xilinx's KCU1500 (XCKU115 FPGA)

May I know where can I get the performance and cost information of these board, as only the performance of DE5-net is listed in the paper.
Thank You

OpenCL runtime error

Dear doonny,

I'm very interested in testing your PipeCNN in my Zynq Ultrascale ZCU102.
I have compiled the source code with Xilinx SDSoC v2017.1 with zcu102_es1_ocl platform, then before launching the PipeCNN I have sent these commands:

cd /mnt
cp libxilinxopencl.so /usr/lib
export XILINX_OPENCL=/mnt

(libxilinxopencl.so is the opencl library for aarch64), then for launching the CNN:

./PipeCNN.elf conv.aocx

and finally the output is:

61063552 total weights read
154587 bytes image read
1024 total output reference read

ERROR: No device found
ERROR: CL_DEVICE_NOT_FOUND

Could you give me some help?

Thanks in advance

How to reduce RAMB utilization

Dear doonny,
I'm trying to test PipeCNN framework on some Xilinx's FPGA embedded boards for take some power measurement. Actually I would like to compile the framework for Digilent ZedBoard but the synthetized program is too large for this FPGA platform, the XOCC compiler returns this error:

297 RAMB18 and RAMB36/FIFO required but only 280

Could you give me some hints for reducing the BRAM utilization?

Thank You

weights file does not exist

你好,编译好后,运行./run.exe conv.aocx报错,说权重文件找不到。但是您给的模型文件已经放到data目录里了,我也尝试在main.cpp中把路径改为绝对路径,但是还是找不到文件。
希望得到您的回复,谢谢

Build failing for Xilinx flow (hardware accelerator integration stage)

When I build, I get the following errors during hardware accelerator integration stage.

INFO: [XOCC 60-251]   Hardware accelerator integration...

===>The following messages were generated while processing /PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.sim/sim_1/behav :
ERROR: [XOCC 10-426] cannot find port pool_ch15_TREADY on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:736]
ERROR: [XOCC 10-426] cannot find port pool_ch15_TVALID on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:735]
ERROR: [XOCC 10-426] cannot find port pool_ch15_TDATA on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:734]
ERROR: [XOCC 10-426] cannot find port pool_ch14_TREADY on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:733]
ERROR: [XOCC 10-426] cannot find port pool_ch14_TVALID on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:732]
ERROR: [XOCC 10-426] cannot find port pool_ch14_TDATA on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:731]
ERROR: [XOCC 10-426] cannot find port pool_ch13_TREADY on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:730]
ERROR: [XOCC 10-426] cannot find port pool_ch13_TVALID on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:729]
ERROR: [XOCC 10-426] cannot find port pool_ch13_TDATA on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:728]
ERROR: [XOCC 10-426] cannot find port pool_ch12_TREADY on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:727]
ERROR: [XOCC 10-426] cannot find port pool_ch12_TVALID on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:726]
ERROR: [XOCC 10-426] cannot find port pool_ch12_TDATA on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:725]
ERROR: [XOCC 10-426] cannot find port pool_ch11_TREADY on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:724]
ERROR: [XOCC 10-426] cannot find port pool_ch11_TVALID on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:723]
ERROR: [XOCC 10-426] cannot find port pool_ch11_TDATA on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:722]
ERROR: [XOCC 10-426] cannot find port pool_ch10_TREADY on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:721]
ERROR: [XOCC 10-426] cannot find port pool_ch10_TVALID on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:720]
ERROR: [XOCC 10-426] cannot find port pool_ch10_TDATA on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:719]
ERROR: [XOCC 10-426] cannot find port pool_ch9_TREADY on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:718]
ERROR: [XOCC 43-3322] Static elaboration of top level Verilog design unit(s) in library work failed
ERROR: [XOCC 60-399] vivado failed, please see log file for detail: '/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/vivado.log'
ERROR: [XOCC 60-626] Kernel link failed to complete
ERROR: [XOCC 60-703] Failed to finish linking
make: *** [pipecnn.xclbin] Error 1

21:27:33 Build Finished (took 10m:11s.280ms)

Can not get the correct result

Dear Prof. Wang,

We have tried to run your code on Altera FPGA DE5a_net_e1, unfortunately, we can not get the correct result. The result is random everytime. Sometime it is fox, sometime it is Cardigan or Pomerania. Could you please help me to figure out what was wrong? Thank you so much.

Best regards!

[root@dhcp70 project]# ./run.exe conv.aocx


PipeCNN: An OpenCL-Based FPGA Accelerator for CNNs


61063552 total weights read

Loading picture ./data/picture/cat.jpg .....

1024 total output reference read

Platform: Intel(R) FPGA SDK for OpenCL(TM)
Using 1 device(s)
Device 0: de5a_net_e1 : Arria 10 Reference Platform (aclde5a_net_e10)
Device OpenCL Version: OpenCL 1.0 Intel(R) FPGA SDK for OpenCL(TM), Version 16.1
Device Max Compute Units: 1
Device Max WorkGroup Size: 2147483647
Device Max WorkItem Size: 2147483647
Device Global Memory Size: 8192 MBytes
Device Local Memory Size: 16 KBytes
Device Max Clock Freq: 1000 Mhz

Loading kernel/binary from file conv.aocx
Reprogramming device [0] with handle 1

Executing Layer 1:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching single work-item kernel Pooling

Launching kernel MemWr with local size: 1, 1, 16 (global size: 27, 27, 96)

Launching kernel lrn with local size: 1, 1, 24 (global size: 27, 27, 24)

Executing Layer 2:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching single work-item kernel Pooling

Launching kernel MemWr with local size: 1, 1, 16 (global size: 13, 13, 256)

Launching kernel lrn with local size: 1, 1, 64 (global size: 13, 13, 64)

Executing Layer 3:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 16 (global size: 13, 13, 384)

Executing Layer 4:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 16 (global size: 13, 13, 384)

Executing Layer 5:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching single work-item kernel Pooling

Launching kernel MemWr with local size: 1, 1, 16 (global size: 6, 6, 256)

Executing Layer 6:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 16 (global size: 1, 1, 4096)

Executing Layer 7:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 16 (global size: 1, 1, 4096)

Executing Layer 8:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 16 (global size: 1, 1, 1024)

Copyed all batched results from fc_2 buffers.

Done !!!


Performance Summary

Total runtime: 0.057614s

Kernel runtime summary:
Layer-1:
MemRd: 8.850 ms
Conv : 8.819 ms
Pool : 8.813 ms
MemWr: 8.794 ms
Lrn : 0.643 ms
Layer-2:
MemRd: 14.013 ms
Conv : 13.992 ms
Pool : 13.987 ms
MemWr: 13.969 ms
Lrn : 0.243 ms
Layer-3:
MemRd: 9.407 ms
Conv : 9.365 ms
Pool : 0.000 ms
MemWr: 9.360 ms
Lrn : 0.000 ms
Layer-4:
MemRd: 7.080 ms
Conv : 7.057 ms
Pool : 0.000 ms
MemWr: 7.044 ms
Lrn : 0.000 ms
Layer-5:
MemRd: 4.782 ms
Conv : 4.751 ms
Pool : 4.748 ms
MemWr: 4.735 ms
Lrn : 0.000 ms
Layer-6:
MemRd: 2.583 ms
Conv : 2.547 ms
Pool : 0.000 ms
MemWr: 2.551 ms
Lrn : 0.000 ms
Layer-7:
MemRd: 1.223 ms
Conv : 1.199 ms
Pool : 0.000 ms
MemWr: 1.193 ms
Lrn : 0.000 ms
Layer-8:
MemRd: 0.331 ms
Conv : 0.286 ms
Pool : 0.000 ms
MemWr: 0.290 ms
Lrn : 0.000 ms

Total kernel runtime 48.018 ms
Batch size = 1, average process time per batch: 48.018 ms

Start verifying results ...
Selected item = 0 from the combined batch results in fc buffers

The inference result is n02119022 red fox, Vulpes vulpe (the prob is 70.00)

Memory bank mismatch

I am facing the following error when executing PipeCNN on AWS F1.


Loading kernel/binary from file cnnf1_pythonpipe2.awsxclbin
ERROR: ERROR: Memory bank specified for kernel instance "memRead_1" of kernel "memRead" for argument index 21 does not match the physical connectivity from the binary.
Bank specified on host side is "M01_AXI" while bank from the binary is "M00_AXI".

ERROR: clSetKernelArg() for kernel "memRead", argument index 21.

ERROR: CL_INVALID_MEM_OBJECT 
Location: ../src/host/main.cpp:730
Failed to set argument 21 kernel memRd

Was I supposed to set any parameter?
full output here

platform error

The following error was found in SDAcell version (SDK v2017.4) after executing make file under the project folder.
I think the solution is to change 'xilinx:kcu1500:4ddr-xpr:4.0' to 'xilinx_vcu1525_dynamic_5_0'.
But it is not sure that the PipeCNN's environment and behavior are still valid.

` * Error log ----------------------------------------------
ERROR: [XOCC 60-705] No device was found that matches 'xilinx:kcu1500:4ddr-xpr:4.0'. The supported devices are:
xilinx_vcu1525_dynamic_5_0
xilinx_kcu1500_dynamic_5_0

ERROR: [XOCC 60-587] Failed to add a device: specified platform xilinx:kcu1500:4ddr-xpr:4.0 is not found
Makefile:151: recipe for target 'conv.xclbin' failed
make: *** [conv.xclbin] Error 1
------------------------------------------------------------`

MemRead

Hi Professor @doonny , I read your paper and i'm still confused how the Memrd works. Can sir gives some pointer to understand this kernel (memrd)? Thank you in advance.

error: Channel support is not enabled

Hi,I compiled project with Intel OpenCL SDK for FPGA 17.1 and get the following error
In file included from /home/wangjf/PipeCNN/project/__all_sources.cl:2:
PipeCNN/project/device/conv_pipe.cl:75:24: error: Channel support is not enabled
channel channel_vec data_ch attribute((depth(0)));

Anyone got an idea? Thanks in advance!!

SDAccel compilation

conv_pipe.cl is heavily depend on altera specific extensions (write_channel_altera, read_channel_altera), so it can't be compiled to use with xilinx FPGAs.

ERROR: CL_COMPILER_NOT_AVAILABLE

Hello! I compile the project in emulator mode and get the following error when I run "run.exe conv.aocx" with alexnet data:

run.exe conv.aocx
***************************************************
PipeCNN: An OpenCL-Based FPGA Accelerator for CNNs
***************************************************

61063552 total weights read
154587 bytes image read
1024 total output reference read


Platform: Intel(R) FPGA SDK for OpenCL(TM)
Using 1 device(s)
  Device 0: EmulatorDevice : Emulated Device
Device OpenCL Version: OpenCL 1.0 Intel(R) FPGA SDK for OpenCL(TM), Version 17.0
Device Max Compute Units: 1
Device Max WorkGroup Size: 2147483647
Device Max WorkItem Size: 2147483647
Device Global Memory Size: 2048 MBytes
Device Local Memory Size: 16 KBytes
Device Max Clock Freq: 1000 Mhz

Loading kernel/binary from file conv.aocx
ERROR: CL_COMPILER_NOT_AVAILABLE
Location: ../common/ocl_util.cpp:429
Failed to build program with source

Environment: Windows 10, MinGW-w64, Arria 10 board, Intel OpenCL SDK for FPGA 17.0, MSVC 12.0

Anyone got an idea? Thanks in advance!!

Fixed-point model numbers

Dear Prof. Wang,
For the quantized parameter, you used (n,m) pair to denote the precision. For example, in VGG-16 1st layer, you used (8,7) (8, 0) (8,-2) to denote frac_w, frac_input and frac_output, for last FC layer, you used (8, 2) (8,2) (4,7). Is there any rules or constrains how to decide the numbers? Or just use any numbers you like? Can I change to other numbers? If I changed the fraction numbers and convert a new model, does it works or not? Thank you so much.

maxPool

Current maxpool have some problem if run kernel with next params:
-when size_x and size_y is odd
-and stride = 1;
Then maxpool unload less data then need.

make: *** [conv.aocx] Error 1

Hi,I compiled the project based on de10-standard,but it implied that LABs isn't enough.What's the solution?
aoc: First stage compilation completed successfully.
Compiling for FPGA. This process may take a long time, please be patient.
Error (170012): Fitter requires 4243 LABs to implement the design, but the device contains only 4191 LABs
Error: Cannot fit kernel(s) on device
Makefile:135: recipe for target 'conv.aocx' failed
make: *** [conv.aocx] Error 1

Thanks

HW Configuration for DE1SOC to reduce "Logic utilization"

Dear Prof. Wang,

I use De1-SoC and I changed,

VEC_SIZE  = 8
LANE_NUM = 8 
CONV_GP_SIZE_X = 7  , as the user instruction.

and
PLATFORM = arm32 and FLOW = hw

then it's Estimated Resource Usage Summary shows that,
Logic utilization = 111%

Then I got the following error even-though the resources of the device is not fully used up.
kernel cannot fit into device

My question is, Is there any other ways to reduce "Logic utilization", except reduce "LANE NUM" ?

Thank You.

make: aocl :命令未找到

Dear Prof. Wang,
I install Xilinx's SDAccel, but it display make: aocl: 命令未找到。Can ocal application on SDAccel?

Error message by SDAccel 2017.4

  1. It is needed to rename platform to xilinx_kcu1500_dynamic_5_0 for SDAccel 2017.4

  2. A lot of warnings like this "device/conv_pipe_xilinx.cl:680:708: warning: double precision constant requires cl_khr_fp64, casting to single precision"

  3. Finally an error message:
    ERROR: [XOCC 60-896] For unified platforms, please use -c or -l
    ERROR: [XOCC 60-598] Kernel build setup failed to complete
    ERROR: [XOCC 60-702] Failed to finish compilation and linking
    Makefile:142: recipe for target 'conv.xclbin' failed
    make: *** [conv.xclbin] Error 1

HW Configuration for DE1SOC

Hi, referring to instructions, the best result of AlexNet model executed on De1Soc is 149 ms.
However, the best result i got is just around 450 ms with the following hw configurations:

  • VEC_SIZE = 8
  • LANE_NUM = 4
  • CONV_GP_SIZE_X = 7

I do tried to increase the LANE_NUM to 8. But i got the following error even-though the resources of the device is not fully used up.

"kernel cannot fit into device"

Could you kindly share us the appropriate hw configuration for the VEC_SIZE, LANE_NUM, and CONV_GP_SIZE_X??

Thank you

What should I change in the code when I want to run CPU-Emulation for VGG16 in sdaccel

I uncomment the code for VGG16 and comment the code for AlexNet in the layer_config.h and the main.c.
The code following is what I change in main.c:

// AlexNet
// Original problem size
// File size is in num of DTYPE numbers
//#define IMAGE_FILE_SIZE   (227*227*3)
////#define WEIGHTS_FILE_SIZE 60965224 //fc8-1000
//#define WEIGHTS_FILE_SIZE 61063552  //fc8-1024
//#define LAYER_NUM         8
//#define CONV_NUM          5
//const char *weight_file_path = "./data/data_alex/weights.dat";
//const char *input_file_path = "./data/data_alex/image.dat";
//const char *ref_file_path = "./data/data_alex/fc8.dat";
//const char *dump_file_path = "./result_dump.txt";


// VGG16
// Original problem size
// File size is in num of DTYPE numbers
#define IMAGE_FILE_SIZE   (224*224*3)
#define WEIGHTS_FILE_SIZE 138455872  //fc8-1024
#define LAYER_NUM         16
#define CONV_NUM          13

const char *weight_file_path = "./data/data_vgg/weights.dat";
const char *input_file_path = "./data/data_vgg/image.dat";
const char *ref_file_path = "./data/data_vgg/fc8.dat";
const char *dump_file_path = "./result_dump.txt";

The code following is what I change in layer_config.h:

// Test with batch=1
// Alexnet Configuration
/*
unsigned layer_config[][NUM_CONFIG_ITEM] = {{0,
							227, 227, 3, 11, 11, 3, 96, 96,
							0,
							55, 55, 96, 4, 0, 0, 1,
							1, 27, 27, 96, 3, 2,
							1,
							1},//Layer-1
							{0,
							27, 27, 96, 5, 5, 48, 256, 256,
							0,
							27, 27, 256, 1, 2, 1, 1,
							1, 13, 13, 256, 3, 2,
							1,
							1},//Layer-2
							{0,
							13, 13, 256, 3, 3, 256, 384, 384,
							0,
							13, 13, 384, 1, 1, 0, 1,
							0, 13, 13, 384, 0, 0,
							0,
							1},//Layer-3
							{0,
							13, 13, 384, 3, 3, 192, 384, 384,
							1,
							13, 13, 384, 1, 1, 1, 1,
							0, 13, 13, 384, 0, 0,
							0,
							0},//Layer-4
							{0,
							13, 13, 384, 3, 3, 192, 256, 256,
							0,
							13, 13, 256, 1, 1, 1, 1,
							1, 6, 6, 256, 3, 2,
							0,
							2},//Layer-5  Note: for last conv layer, outputs are write to fc buffer
							{1,
							6, 6, 256, 6, 6, 256, 4096, 4096,  // Note: The input size (dim1/dim2) is the combined data size (batched)
							2,
							1, 1, 4096, 6, 0, 0, 1,
							0, 1, 1, 4096, 0, 0,
							0,
							3},//Layer-6 fc
							{1,
							1, 1, 4096, 1, 1, 4096, 4096, 4096,
							3,
							1, 1, 4096, 1, 0, 0, 1,
							0, 1, 1, 4096, 0, 0,
							0,
							2},//Layer-7 fc
							{1,
							1, 1, 4096, 1, 1, 4096, 1024, 1024,
							2,
							1, 1, 1024, 1, 0, 0, 0,
							0, 1, 1, 1024, 0, 0,
							0,
							3}//Layer-8 fc
							};

char precision_config[][3] ={{8,  0, -4},//Layer-1
							{ 8,  0, -2},//Layer-2
							{ 8,  0, -1},//Layer-3
							{ 8, -1, -1},//Layer-4
							{ 8, -1, -1},//Layer-5
							{11, -1,  0},//Layer-6
							{10,  0,  2},//Layer-7
							{10,  2,  2}//Layer-8
							};

unsigned input_config[5] = {227, 227, 3, 1}; //original image size(dim1, dim2, dim3), batch size

//unsigned output_config[3] = {27, 27, 96};//Layer-1
//unsigned output_config[3] = {55, 55, 96};//Layer-1

//unsigned output_config[3] = {13, 13, 256};//Layer-2

//unsigned output_config[3] = {6, 6, 256};//Layer-5

//unsigned output_config[3] = {1, 1, 4096};//Layer-6

unsigned output_config[3] = {1, 1, 1024};//Layer-8  Note: only one result is extracted and verified

*/


// Test with batch=1
// VGG-16 Configuration
unsigned layer_config[][NUM_CONFIG_ITEM] = {{0,
							224, 224, 3, 3, 3, 3, 64, 64,
							0,
							224, 224, 64, 1, 1, 0, 1,
							0, 224, 224, 64, 0, 0,
							0,
							1},//Layer-1 (conv1_1)
							{0,
							224, 224, 64, 3, 3, 64, 64, 64,
							1,
							224, 224, 64, 1, 1, 0, 1,
							1, 112, 112, 64, 2, 2,
							0,
							0},//Layer-2 (conv1_2)
							{0,
							112, 112, 64, 3, 3, 64, 128, 128,
							0,
							112, 112, 128, 1, 1, 0, 1,
							0, 112, 112, 128, 0, 0,
							0,
							1},//Layer-3 (conv2_1)
							{0,
							112, 112, 128, 3, 3, 128, 128, 128,
							1,
							112, 112, 128, 1, 1, 0, 1,
							1, 56, 56, 128, 2, 2,
							0,
							0},//Layer-4 (conv2_2)
							{0,
							56, 56, 128, 3, 3, 128, 256, 256,
							0,
							56, 56, 256, 1, 1, 0, 1,
							0, 56, 56, 256, 0, 0,
							0,
							1},//Layer-5 (conv3_1)
							{0,
							56, 56, 256, 3, 3, 256, 256, 256,
							1,
							56, 56, 256, 1, 1, 0, 1,
							0, 56, 56, 256, 0, 0,
							0,
							0},//Layer-6 (conv3_2)
							{0,
							56, 56, 256, 3, 3, 256, 256, 256,
							0,
							56, 56, 256, 1, 1, 0, 1,
							1, 28, 28, 256, 2, 2,
							0,
							1},//Layer-7 (conv3_3)
							{0,
							28, 28, 256, 3, 3, 256, 512, 512,
							1,
							28, 28, 512, 1, 1, 0, 1,
							0, 28, 28, 512, 0, 0,
							0,
							0},//Layer-8  (conv4_1)
							{0,
							28, 28, 512, 3, 3, 512, 512, 512,
							0,
							28, 28, 512, 1, 1, 0, 1,
							0, 28, 28, 512, 0, 0,
							0,
							1},//Layer-9  (conv4_2)
							{0,
							28, 28, 512, 3, 3, 512, 512, 512,
							1,
							28, 28, 512, 1, 1, 0, 1,
							1, 14, 14, 512, 2, 2,
							0,
							0},//Layer-10 (conv4_3)
							{0,
							14, 14, 512, 3, 3, 512, 512, 512,
							0,
							14, 14, 512, 1, 1, 0, 1,
							0, 14, 14, 512, 0, 0,
							0,
							1},//Layer-11  (conv5_1)
							{0,
							14, 14, 512, 3, 3, 512, 512, 512,
							1,
							14, 14, 512, 1, 1, 0, 1,
							0, 14, 14, 512, 0, 0,
							0,
							0},//Layer-12  (conv5_2)
							{0,
							14, 14, 512, 3, 3, 512, 512, 512,
							0,
							14, 14, 512, 1, 1, 0, 1,
							1, 7, 7, 512, 2, 2,
							0,
							2},//Layer-13  (conv5_3)    Note: for last conv layer, outputs are write to fc buffer
							{1,
							7, 7, 512, 7, 7, 512, 4096, 4096,
							2,
							1, 1, 4096, 7, 0, 0, 1,
							0, 1, 1, 4096, 0, 0,
							0,
							3},//Layer-14  (fc6)							
							{1,
							1, 1, 4096, 1, 1, 4096, 4096, 4096,
							3,
							1, 1, 4096, 1, 0, 0, 1,
							0, 1, 1, 4096, 0, 0,
							0,
							2},//Layer-15  (fc7)
							{1,
							1, 1, 4096, 1, 1, 4096, 1024, 1024,
							2,
							1, 1, 1024, 1, 0, 0, 0,
							0, 1, 1, 1024, 0, 0,
							0,
							3}//Layer-16  (fc8)		
							};

char precision_config[][3] ={{7,  0, -2},//Layer-1
							{ 8, -2, -5},//Layer-2
							{ 8, -5, -5},//Layer-3
							{ 8, -5, -6},//Layer-4
							{ 7, -6, -7},//Layer-5
							{ 8, -7, -7},//Layer-6
							{ 8, -7, -7},//Layer-7
							{ 8, -7, -6},//Layer-8
							{ 8, -6, -5},//Layer-9
							{ 8, -5, -5},//Layer-10
							{ 9, -5, -4},//Layer-11
							{ 9, -4, -3},//Layer-12
							{ 8, -3, -2},//Layer-13
							{ 8, -2,  0},//Layer-14
							{ 7,  0,  2},//Layer-15
							{ 7,  2,  2}//Layer-16
							};

unsigned input_config[4] = {224, 224, 3, 1};

//unsigned output_config[3] = {224, 224, 64};//Layer-1

//unsigned output_config[3] = {56, 56, 128};//Layer-4(pool2)

//unsigned output_config[3] = {28, 28, 256};//Layer-7(pool3)

//unsigned output_config[3] = {28, 28, 512};//Layer-8(relu4_1)

//unsigned output_config[3] = {28, 28, 512};//Layer-9(relu4_2)

//unsigned output_config[3] = {14, 14, 512};//Layer-10(pool4)

//unsigned output_config[3] = {7, 7, 512};//Layer-13(pool5)

//unsigned output_config[3] = {1, 1, 4096};//Layer-14

unsigned output_config[3] = {1, 1, 1024};//Layer-16

I compile the project successfully in the cpu-emulation mode in sdaccel gui.However,when I run the project,the error occurs:

***************************************************
PipeCNN: An OpenCL-Based FPGA Accelerator for CNNs 
***************************************************
Error: required win_buffer size is 3456, configured size is 2304 
Allocate memory for data and weights failed !!!

How can I solve this problem? What else should I change in the code?

Xilinx flow

When #define XILINX is set, generates this error.

‘write_event’ was not declared in this scope
                             0 /* flags, 0 means from host*/,0, NULL,&write_event[i]);
                                                                      ^~~~~~~~~~~
../src/host/main.cpp:466:71: error: ‘write_event’ was not declared in this scope
                              0 /* flags, 0 means from host*/,0, NULL,&write_event[i]);
                                                                       ^~~~~~~~~~~
../src/host/main.cpp: In function ‘int prepare()’:
../src/host/main.cpp:1414:5: warning: this ‘else’ clause does not guard... [-Wmisleading-indentation]
     else
     ^~~~
../src/host/main.cpp:1418:2: note: ...this statement, but the latter is misleadingly indented as if it is guarded by the ‘else’
  for(unsigned n = 0; n<layer_config[0][data_n]/VEC_SIZE; n++){
  ^~~
make: *** [src/host/main.o] Error 1

Compile error with alteracl.lib using mingw-w64 on Windows 10

Hello! I am trying to compile the code with the Intel OpenCL FPGA SDK 17.0 and an Arria 10 board on Windows 10 using mingw-w64. I got an error when the makefile run a command looks like:

g++ ./host/main.o ../common/ocl_util.o ../common/timer.o -o run.exe -LC:/intelFPGA_pro/17.0/hld/board/a10_ref/windows64/lib -LC:/intelFPGA_pro/17.0/hld/host/windows64/lib -laltera_a10_ref_mmd -lalteracl -lacl_emulator_kernel_rt -lpkg_editor -llibelf -lacl_hostxml

and I got an error saying like:

C:/intelFPGA_pro/17.0/hld/host/windows64/lib/alteracl.lib(d:/SJ/nightly/17.0/290/w64/acds/hld/obj/windows64/acl/acl_program.obj).text[l_build_from_source_in_dir]+0xa2): undefined reference to `__imp__wassert'
C:/intelFPGA_pro/17.0/hld/host/windows64/lib/alteracl.lib(d:/SJ/nightly/17.0/290/w64/acds/hld/obj/windows64/acl/acl_program.obj).text[l_load_binary_pkg]+0xb36): undefined reference to `__security_check_cookie'
C:/intelFPGA_pro/17.0/hld/host/windows64/lib/alteracl.lib(d:/SJ/nightly/17.0/290/w64/acds/hld/obj/windows64/acl/acl_program.obj).xdata[$unwind$l_compute_hash]+0x10): undefined reference to `__GSHandlerCheck'

(too long and get truncated; only repeating these 3 errors.)

Anyone has an idea? Thanks in advance!!

Logic Optimization failing when compiling on AWS F1 for xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0

maxpooling

                for(unsigned int k=0; k<input_num; k++){
                if(pool_size==3)
			row_pool_reg[ll] = pool_max(line_buf_1[ll][line_buf_ptr], line_buf_0[ll][line_buf_ptr]);
		else // pool_size==2
			row_pool_reg[ll] = line_buf_0[ll][line_buf_ptr];`
		
		pool_reg[ll][0] = pool_max(row_pool_reg[ll], conv_ch_out.lane[ll]);
		
		// Max pooling among colums
		// with previous row-pooling results stored in shift-registers
		if(pool_size==3)
			col_pool_reg[ll] = pool_max(pool_reg[ll][1], pool_reg[ll][2]);
		else //pool_size==2
			col_pool_reg[ll] = pool_reg[ll][1];

		pool_final.lane[ll] = pool_max(col_pool_reg[ll], pool_reg[ll][0]);

		// Update line buffer	
		line_buf_1[ll][line_buf_ptr] = line_buf_0[ll][line_buf_ptr];
		line_buf_0[ll][line_buf_ptr] = conv_ch_out.lane[ll];`

Hi Prof @doonny, Can you explain how you make this work? Is this not square max pooling?
Can i use pool_size ==2 ? If pool_size = 2 , the line_buf_1 seem redundant.

image

Makefile:129: recipe for target 'run.exe' failed

Hello! I compile the project and get the following error
lcf@lcf-9020:~/work/PipeCNN-master/project$ make
g++ ./host/main.o ../common/ocl_util.o ../common/timer.o -o run.exe -L/home/lcf/intelFPGA/16.1/hld/board/de10_standard/arm32/lib -L/home/lcf/intelFPGA/16.1/hld/host/arm32/lib -L/home/lcf/intelFPGA/16.1/hld/host/linux64/lib -Wl,--no-as-needed -lalteracl -lalterammdpcie -lstdc++ -lelf
/usr/bin/ld: skipping incompatible /home/lcf/intelFPGA/16.1/hld/board/de10_standard/arm32/lib/libalteracl.so when searching for -lalteracl
/usr/bin/ld: skipping incompatible /home/lcf/intelFPGA/16.1/hld/host/arm32/lib/libalteracl.so when searching for -lalteracl
/usr/bin/ld: skipping incompatible /home/lcf/intelFPGA/16.1/hld/board/de10_standard/arm32/lib/libalterammdpcie.so when searching for -lalterammdpcie
/usr/bin/ld: skipping incompatible /home/lcf/intelFPGA/16.1/hld/host/arm32/lib/libalterammdpcie.so when searching for -lalterammdpcie
/usr/bin/ld: cannot find -lalterammdpcie
/usr/bin/ld: skipping incompatible /home/lcf/intelFPGA/16.1/hld/board/de10_standard/arm32/lib/libelf.so when searching for -lelf
/usr/bin/ld: skipping incompatible /home/lcf/intelFPGA/16.1/hld/host/arm32/lib/libelf.so when searching for -lelf
collect2: error: ld returned 1 exit status
Makefile:129: recipe for target 'run.exe' failed
make: *** [run.exe] Error 1

Anyone got an idea? Thanks in advance!!

AWS F1 Build failing

Linking failed in Emulation-HW for AWS F1.
Hardware Platform: xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0

screenshot from 2017-11-20 10-09-04

MaxPool Problem

I'm sorry, I'm too much hurry with this issue.
This problem is actual if add padding(=1) at right side and lower side.
for geting result with sizes : size_x = 13, size_y = 13
More information:
input: size_x = 13, size_y = 13, pool_size = 2, pool_stride = 1, depth = 512
result: size_x = 13, size_y = 13, depth = 512

Xilinx Flow - Run time hang

Hello Prof. Wang,

I'm trying to run the PipeCNN on Alpha Data 7v3 FPGA. [xilinx_adm-pcie-7v3_1ddr_3_0]
I'm using SDx 2017.2 and software emulation runs properly giving correct results.
When hardware is built, the timing is not met for some paths and the tool reduces the clock speed to 170.3 MHz (from the original 200 MHz).
But when i run the generated binary conv.xclbin on the FPGA it results in a hang at run time

Executing Layer 1:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching single work-item kernel Pooling

Launching kernel MemWr with local size: 1, 1, 16  (global size: 27, 27, 96)

Launching kernel lrn with local size: 1, 1, 24  (global size: 27, 27, 24)

Could you please help me in figuring out the issue?
Thanks in advance

Regards

Best Parameter Setting on KCU1500 Board?

Hello, Prof. Wang.

I'm trying to run on KCU1500 board.
What is a best parameter setting of VEC_SIZE, LANE_NUM, CONV_GP_SIZE_X?

When I tried VEC_SIZE=8, LANE_NUM=16, then an error occurred like as below.

2017-11-06 8 40 50

Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.