nvdla / doc Goto Github PK

View Code? Open in Web Editor NEW

232.0 232.0 110.0 4.23 MB

Documentation for NVDLA.

License: Other

Batchfile 0.68% Shell 1.25% HTML 98.07%

doc's People

Contributors

Stargazers

Watchers

Forkers

eadgarchen hbhbts caldera gwsu clanceyluo quixotec wxbbuaa2011 yulizhu revonly cuihaoran yuechengli alexvonduar sodapeter carollee1993 maximgt teouue srinivasulu-reddy huangqiancun zwt5858 flyahead houruixiang ronobirdas pinpiew cfandy yyf1234 whaoer tanbour liubinfeng shonealbert cliffgold hoangt sunmyfong sun3388 zhuzhengchao hklee2040 sunxindx3906 mrmyhuang vnaveen0 creepigrui shi27feng wuzenghai cnweisop locussam xcz0513 valar1234 alvahro tritonsailor shgangchen qicny lvcargnini lgc2013 jinshanyue cholesky01 rfliu2013 pppp9999 naibiaozhou shelvenkang loyalbenny kannanrn prasshantg wuv522 dit4fun guoheng-mt uranus111 goupengzi fb1974 alekseikorobov xfpda mrli92 vjtagaltera loesterfranco ufo124 najibodhah dadongshangu stparuchuri zhuyh128 felixhsu123 wangfeng012316 zhuguoliang marenan gz2023 cbx1997 windsmell apx103 butcherxiaozi sownlee brightclark rainscut onlywithdream wwtghx yafeishi1981 zhangzek shangyunhai dbggg ichergui zcf1 sinferwu yu-gyoung-yun snowmanliu tuke-code

doc's Issues

Genral Questions on NVDLA

We have the following inquiries about NVDLA. Would appreciate feedback.

How do we access the full training infrastructure when we build our own ConvNet model ?
Can NVDLA parser (compilation tool) can read Caffe2, TensorFlow and pyTorch frameworks?
It seems that running parallel multiple independent layers is not feasible for “Headless Implementation”. Is “Fused (Pipelined) operation” possible in this environment?
If there is a companion macrocontroller, does KMD run on this companion microcontroller (and UMD runs on main system CPU)?
If the companion microcontroller runs on RTOS (say, Nucleus), do we need to reprogram the entire Kernel Mode Driver that is currently accessing Linux Kernel?
Can NVDLA’s 2D convolution support “1x1 convolution” (for ResNet implementation)?
If we need a softmax activations in the output layer, is there a way to implement it in DLA (using application program)?
Are there any KMD (or UMD) APIs available for this implementation?
In the following paragraph in http://nvdla.org, “multiple DLA devices” mean multiple DLA IP cores? or multiple independent layers in one DLA core?

“Runtime driver supports submitting inference jobs to multiple DLA devices”
Application program should be written in C/C++?, not Java?
How does the application program submit a inference job to KMD?
Does UMD also include Linux Kernel Driver?
What are the portability layers supposed to do in UMD and KMD, respectively? Licensee is supposed to program these?
Is KMD stack mainly composed of the firmware (scheduler) and Linux Kernel Driver?
Does KMD stack have system calls for power management in the Liux Kernel?
Suppose that runtime environment is going to be running on Android (instead of Linux), what do we need to do in this case?

1. Modify the Linux kernel driver in KMD into Android Kernel driver (such as IPC, power management, etc.)?
1. If necessary, also modify the Linux kernel driver in UMD into Android kernel driver?
1. If necessary, move some system calls/functionalities from KMD to UMD (due to discrepancy between Linux kernel and android kernel)?
1. Does UMD need to be connected to Android HAL?
1. Application program doesn’t need to be passed to Java framework in android?
1. What else do we need to modify?

Where is Compilation tools Parser and Compiling?

The Compilation phase is responsible for converting (aka compiling) a deep neural network into a sequence of hardware layers that are optimized for a given NVDLA configuration. Having a compiled network optimized for specific hardware configuration improves performance by reducing model size, load and run times. Compilation is a 2-step process that consists of: Parsing and Compiling.
But where is it?

more details about the register definition

thanks so mush for sharing NVDLA.
i read the register description carefully,and found that there is only a few vivid words for key points, what should i do if i want to integrate NVDLAinto my design(the integration)?
where can i find the details about "loadable image"?

Build fails on Ubuntu with "sw_vers not found"

I'm getting the error:
adding /home/travis/build/nvdla/doc/.env to cache
creating directory /home/travis/build/nvdla/doc/.env
sh: 1: sw_vers: not found

on build #16 for pull request #5. I see that appium/appium#1580
had the same error when iOS checks were done on Ubuntu. Could this be the same problem? I'm running Ubuntu.

Or am I doing something wrong?

nv_small and nv_full difference

This link on scalability configuration shows the differences in parameters for the two

http://nvdla.org/hw/v2/scalability.html

However, I would like to know what impact these parameters difference would mean from an application difference standpoint.

Question about data format document Fig17

对于新增的data format 文档中winograd weight data format有些疑问，在图17中，为什么假设kernel是5x5 stride 2 , 根据我的理解，dla目前支持kernelsize=3x3，stride=1的情景？
One question about the figure17 in the lastest document about the data format: why the original weight kernel in the left of figure is 5x5 x48byte with stride size of 2 ？ I do not think dla can support winograd opt in this situation .
http://nvdla.org/_images/format_channel_extension_and_conversion_for_wingorad.svg

Length of a Stripe Operation

The NVDLA unit description (http://nvdla.org/hw/v1/ias/unit_description.html) mentions an upper limit length of 32 for a Stripe Operation:
"The upper limit is 32 due to buffer size in the accumulator"
However, this seems to contradict the buffer size as mentioned in the "Convolution Accumulator" chapter. Let me explain why:

Every Atomic Operation results in 16 partial sums (see chapter "Atomic Operation"). So, we will have 32x16=512 Elements in total after a maximum sized Stripe Operation.
Each of these elements will be saved as an INT48 (when using INT16 in the previous steps) in the assembly SRAM group (see table 49).
This results in 512x6 Byte=3kiB.

According to the chapter "Convolution Accumulator", the buffer size is 96Bx128=12kiB
So, in theory the length of Stripe Operation could be 128 instead of 32.

Is there any reason why this is not the case or are the calculations wrong?

Power estimation

Is it possible to measure power consumption in the virtual simulator? We're just starting out and aren't sure if the NVDLA framework works for our project.

what's the MAX_BUSY_CYCLE mean?

what's the MAX_BUSY_CYCLE mean(mentioned in NVDLA_OpenSource_Performance.xls)? include the zero-calculations or not?

Address Values Wrong in Table

I think the Addresses in the hwarch.rst do not agree with the code. Table 10 looks mostly correct, except a couple of end addresses. But tables 11-30 have an extra 0x3000 added.

Added as pull request.

Pooling error for large images

Hello,

when executing YOLO on NVDLA, I realized that currently it is hard-corded somewhere either in the HW implementation or KMD. When I pool an 448x448 image, the result is completely wrong. But when I reduce the image size to half (224x224), the pooling engine works fine. From what it looks like, for 448x448 images, there is basically a duplication of the original image in the pooling result and it starts from a little bit right than the center of the image.

Pooling result:

Original image:

I'm assuming the 256th pixel, because I have a feeling somewhere in the code uint8_t is used and it is not large enough for reading 448x448 images. Has anyone encountered the same problem? Or does anyone know there is indeed a hard-coded section? I am very curious about this.

Best
Tim