설명

위 이미지에서 IN(x) 가 tensor x 에 대한 InstanceNorm 연산을 의미합니다. InstanceNorm 연산이 gamma 와 beta 라는 parameter 를 가지고 있음을 기억해 주시면 도움이 됩니다.

IN -> PostScale 은 다음과 같이 표현 가능합니다 (scale 상수=a).

IN(x) X a = gamma X a ((x - mu) / sigma) + beta X a
즉 gamma -> gamma X a, beta -> beta X a 로 수정하면 PostScale 연산을 한 결과를 얻을 수 있습니다. 다시 말해 IN(x) X a = IN'(x) 로 표현 가능한 것이죠 (IN'(x) 는 IN 의 파라미터인 gamma, beta 에 a 가 곱해진 IN 연산). 여기서 변환 후 IN' 이 여전히 InstanceNorm 연산의 형태를 가지고 있기 때문에, fusion 되었다고 할 수 있습니다.

PreScale -> IN 의 경우는 전개를 해보면 InstanceNorm 형태로 표현이 불가능합니다. 한번 시도해보겠습니다 (scale 상수=a).

IN(ax) = gamma X ((ax - mu') / sigma') + Beta
mu' 은 ax 의 평균, sigma' 은 ax 의 표준편차 + epsilon.
mu' (ax 의 평균) = a X mu 로 표현 됩니다.
sigma' (ax 의 표준편차 + epsilon) = root((a X 표준편차)^2 + epsilon) 로 표현됩니다.
여기서 sigma' 에 포함된 epsilon 때문에 sigma' 을 sigma 로 표현할 수 없어서 IN(ax) 를 InstanceNorm 의 형태로 변환할 수 없습니다. 따라서 fusion 이 불가능합니다.

Jasper Logo CGO 2023 Paper #36 Reviews and Comments =========================================================================== Paper #36 Pin or Fuse? Exploiting Scratchpad Memory to Reduce Off-chip Data Transfer in DNN Accelerators

Review #36A

Overall merit

I'd rather not accept it

Reviewer expertise

Knowledgeable

Would this paper be a good fit for the "Tools and Practical Experience"
category?

Reasons the paper should be accepted

New heuristic for memory allocation in ML models for layer fusion. It decides whether to pin certain tensors in memory until their lifetime ends.
It's a simplification of the register allocation problem as it only considers contiguous lifetimes. There's no spilling/reload.
Nevertheless, it's an interesting point in the design space to restrict liveness ranges to contiguous ranges for certain tensors (the pinned ones).

Reasons the paper should be rejected

The DP formulation doesn't take memory constraints in consideration. The produced solution may not work.
The original problem is NP-hard, and simplifications are useful, but it seems the proposed simplification may produce solutions that aren't valid.

The DP formulation isn't well explained.

Questions for the authors

The DP formulation doesn't take memory limits in consideration, right? What happens if the solution doesn't fit in memory? Can that happen? If not, why not?

In figure 11, what's MV1, MV2, EFN, etc? The benchmarks are not explained.

Figure 13 is totally non-obvious to me why the right-hand side is best. Could you please explain why? Does it consume less memory? Doesn't seem like it would. Lines 1015-1016 don't make sense to me.

Other comments for the authors

Algorithm 1 is not useful. The space would be best used to describe the DP formula (which is equivalent to algorithm 1). Right now the DP formulation isn't explained. What's v, g, p?

The font used for ifm/ofm isn't good. You should use something like mathsf or similar, so the characters aren't so spaced.

Review #36B

Overall merit

I will champion it

Reviewer expertise

Knowledgeable

Would this paper be a good fit for the "Tools and Practical Experience"
category?

Reasons the paper should be accepted

The paper proposes an algorithm for selecting both which convolution layers to fuse together into groups and which groups' results to keep in the global buffer, a tradeoff that improves performance over considering only the former.

The algorithm seems efficient and effective as demonstrated on a commercial AI accelerator using simulation on a handful of CNN's.

Reasons the paper should be rejected

The motivation suggests a simple extension to existing "fuse only" option which deserves comparison - after optimizing for fusion only, check which groups' results can be kept in the global buffer rather than be evicted needlessly.

The cost/benefit/extensibility of various simplifying design decisions are worth elaboration, e.g.: allowing a single ofm per group; considering only chains of convolutions for fusion w/o concat layers; if an ofm of a group is used by multiple subsequent groups retain it in global buffer for some but spill for others rather than "all or nothing" approach; how would batching or in general tensors with dynamic shapes affect the algorithm.

Approach is restricted to sufficiently small "lightweight" models where (at-least some) layers can fully fit in the global buffer. Unclear how extensible/scalable it may be for other models.

Questions for the authors

"Channel expansion" - typically the output channels are subject to tiling but input channels are not, so by fusing multiple layers we lose the ability to tile along the output channels of all intermediate layers, and can only tile the output channels of the last/group-ofm - collecting them for next group. Does this "increase the amount of computation"?

"Spatial expansion" - overlap between adjacent tiles is handled by rebringing and recomputing, rather than trying to retain already brought and computed values, right?

In motivating example of section 2.3 both algorithms (should) decide to fuse the last two layers of the six being considered, the only distinction is which if any of the five ofm's are retained in the global buffer rather than spilled to off-chip memory? ("... while other four are evicted ..." - should be five, unless the 6th layer produces an ofm of the network?) This can alternatively be achieved by a more selecting "best-effort" spilling following the baseline fusion-only algorithm, which spills only if needed (cf. classic Briggs vs. Chaitin.), so "joint optimization is necessary to find a global solution" arguably does not hold for this case. How about using Figure 13 as the motivating example instead?

Do equations (1),(2) account for potential pipelining between adjacent groups?

In Equation (2) better place the max inside the sum over tiles, e.g., to account for some tiles being load-bound while others compute-bound? Current form can serve as a lower-bound (cf. Resource-MII of modulo scheduling)

Figure 7: double-buffering allows loading second ifm in parallel to first compute_1, and loading third ifm needs to wait for first compute_1 to finish using first ifm- rather than wait for first compute_2 to finish? Is double-buffering of ofm profitable?

Equation (4) provides a lower-bound on actual memory allocation because it ignores fragmentation, and only because of that?

"Among feasible tile sizes, we pick the one with the minimum latency" - but selection process stops as soon as the first feasible tile size is found?

In Algorithm 1, how is the graph topologically sorted - are all sortings examined to select the best one, considering "for v in inputs(V_curr)"?

Other comments for the authors

Equation for DP(v,g,p) is hard to read, perhaps using intermediate equations for each of the three cases would help.

Table 1: it would be good to also reason about the overall "arithmetic density" of each DNN, which is independent of cycles, and provide some statistics about kernel sizes.

Section 2.2: "... fused layers are executed in a tiled manner..." - so are unfused layers if needed, the distinction is in the order in which the tiles are traversed.

Reusing the weights brought into the global buffer for a given layer across all width/height tiles of a group, instead of rebringing/rematerializing them, can be considered a form of "pinning".

Figure 3: units of y-axis are the total number multiply-and-accumulate instructions?

Figure 6: indicate that this pseudo code is subject to double-buffering and pipelining.

Subsection 3.1.1: "... graph G = (V,E) ..." - a DAG.

"DP" - explain what it stands for when first used. "... to find the optimal ..." - this is still a heuristic in many aspects, so claims for optimality need to be toned down.

"Cost functions for a layer group with multiple inputs can be derived similarly" - and have been?

E_{opt} simply records the ordering associated with F_opt, P_opt, L_min? Better define so explicitly.

Memory allocation is indeed NP-hard in the general case - for arbitrary interference graphs; but there are cases which are solvable efficiently, e.g., "Comparability Graph Coloring for Optimizing Utilization of Software-Managed Stream Register Files for Stream Processors", TACO 2012.

Figure 11: show averages, including the claimed 25% and 50% decreases in off-chip transfers.

Table 3: "PinRatio ignores layer groups that produce model outputs where pinning cannot be applied" - because they are too large? Would be good to include statistics about these cases.

Section 4.5: clarify which of the two ADD layers is being addressed in the text.

Review #36C

Overall merit

I'm OK with it

Reviewer expertise

Some familiarity

Would this paper be a good fit for the "Tools and Practical Experience"
category?

Reasons the paper should be accepted

This paper proposed a novel DP algorithm to find the best combination of pinning and fusion for performance under the on-chip memory constraints. This idea is somehow novel to consider both fusion and the data communication between fused layers for overall performance. The experimental results are good.
This paper is easy to follow. The issues to address as well as the problem formulation are presented well.

Reasons the paper should be rejected

I am curious about the estimation of memory footprint. This work uses the sum of ifm and ofm to estimate the memory footprint of a layer (or an operator). But, may the implementation of a layer (or an operator) use extra global buffer resource for temporary storage, which may increase the maximum usage larger than the sum of ifm and ofm?
This work claims that it can compensate for the computation increase from fusion, by pinning. But it is somehow confusion since the former relates to computation duplication, while the latter relates to memory transferring reduction instead of computation deduplication. Could the author clarify that more clearly?

Questions for the authors

I am curious about the estimation of memory footprint. This work uses the sum of ifm and ofm to estimate the memory footprint of a layer (or an operator). But, may the implementation of a layer (or an operator) use extra global buffer resource for temporary storage, which may increase the maximum usage larger than the sum of ifm and ofm?
This work claims that it can compensate for the computation increase from fusion, by pinning. But it is somehow confusion since the former relates to computation duplication, while the latter relates to memory transferring reduction instead of computation deduplication. Could the author clarify that more clearly?
Is there any challenge to generalize the proposed work to other DNNs than CNN?

Other comments for the authors

I expect more details on the benchmark and also the commercial compiler in the camera-ready version, if this paper is accepted.

Review #36D

Overall merit

I'm OK with it

Reviewer expertise

Knowledgeable

Would this paper be a good fit for the "Tools and Practical Experience"
category?

Reasons the paper should be accepted

A timely and important problem has been targeted
The paper is presented and motivated well
Reasonable approach

Reasons the paper should be rejected

The idea is similar to cache locking
Unclear notion of optimality
The experimental set up is not well explained

Questions for the authors

What is the size of the targeted neural network in the evaluation?
How does the idea of pinning differ from cache locking, at a higher level of abstraction?
I wonder what happens when the feature map is big enough to not fit in the global buffer at all. In this case, does the algorithm revert to fusion only mode?

Other comments for the authors

The paper targets an important and timely problem. The performance of on-chip AI has become
increasingly important. To this end, the idea of pinning sounds reasonable and it is reasonable to have a combined optimization involving pinning and fusion.
One of my concern is that the idea of pinning, at a certain level of abstraction, is very similar to cache locking (where cache lines/ways/sets can be selectively locked). I understand that the targeted application is different. However, I think it is worth explaining how the general idea of pinning is related to cache locking.
I am a bit confused about whether the proposed solution is actually optimal. The disintegration between the fusion and pinning (the so called pinning-aware fusion) does not seem to return optimal solution. While this is fine, but I think the paper should clearly discuss the optimality of the solution. If the proposed solution is indeed optimal, then a proof about optimality should be included. If the solution is not optimal, then I would like to know how far is the solution from optimal solution.
Equation below Figure 9 is an optimization problem. But it is not clear how the optimization problem is solved. Is it solved optimally or using heuristics.
I am wondering about the scale of the experiments. Typically, it would be nice to mention the scale and size of the networks targeted in the paper.
Section 3.1.2 is a bit vague in my opinion and it is not sufficiently grounded. In particular, it is not clear to me whether the estimated cost and memory footprint are good enough. It would be better if the evaluation includes experiments on showing the accuracy of the estimates instead of just the end result.

Review #36E

Overall merit

I will champion it

Reviewer expertise

Knowledgeable

Would this paper be a good fit for the "Tools and Practical Experience"
category?

Reasons the paper should be accepted

The paper introduces a new compiler approach for managing SPM by combining pinning and fusion to reduce off-chip data transfer in DNN accelerators.
The overall DP formulation makes sense to me.
The paper is written well
The proposed approach is adequately evaluated, demonstrating its performance advantages over the prior art.

Reasons the paper should be rejected

Nothing.

Questions for the authors

Q1. Have the authors attempted to solve the problem optimally using ILP?

Q2. How can the proposed SPM algorithm be improved if the solutions found are still further away from the optimal solutions? In other words, the current improvement for inference latency over the prior art is 15%. How far can we go if we improve your approach in this direction?

Q3. What are the limitations of the cost functions revealed by the experimental evaluation?

Other comments for the authors

I enjoyed reading the paper. The idea of solving one optimization problem for both fusion and pining is great, as both compete for the limited SPM resource. It is well-motivated and solved quite nicely using DP, although some heuristics-based iterations are needed.

I have only two comments on the paper.

First, there was previously a lot of work on developing compiler-based SPM algorithms for traditional array-dominated C programs, such as:

Xuejun Yang, et al,.Comparability Graph Coloring for Optimizing Utilization of Stream Register Files in Stream Processors. PPoPP'09
Lian Li, et al, Memory coloring: a compiler approach for scratchpad memory management, PACT'05.
Sumesh Udayakumaran, et al, Dynamic Allocation for Scratch-Pad Memory using
Compile-Time Decisions, ACM TECS 2006.
Isabelle Puaut and Christophe Pais, Scratchpad memories vs locked caches in hard real-time systems: a quantitative comparison, Date'07
S. Steinke, L. Wehmeyer, Bo-Sik Lee and P. Marwedel, "Assigning program and data objects to scratchpad for energy reduction, DATE'02.

In particular, these earlier SPM algorithms are able to solve both scheduling and allocation together so that a given allocation decision guarantees all the considered data can be allocated to the given SPM. However, the proposed approach requires iterations since both scheduling and allocation are solved separately. While this paper focuses on DNN models, it would make sense if the authors could place their work in this related context.

Second, have the authors considered linearizing all the constraints by solving all the sub-problems together, including tile size selection, scheduling and allocation as one ILP problem?

Amit Pabalkar, Aviral Shrivastava, Arun Kannan & Jongeun Lee, SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories, HiPC'08.
V. Suhendra, T. Mitra, A. Roychoudhury and Ting Chen, "WCET centric data allocation to scratchpad memory, RTSS'05
Lian Li, et al, Compiler-directed scratchpad memory management via graph coloring. ACM TACO, 2009.

Perhaps, better solutions can be found this way. In lines 845 -- 850, the authors stated:

If the footprint of the best result exceeds the global buffer size, we repeat the pinning-aware fusion algorithm with 2% less global buffer size, so that the final footprint becomes lower. Our benchmark models were successfully compiled in 1.5 retries on average.

Could the authors provide some insights on the gap between the "optimal" solution found and the real optimal solution, both in theory and in practice?

jinevening / public_note Goto Github PK

public_note's People

Watchers

public_note's Issues

Review #36A

Overall merit

Reviewer expertise

Reasons the paper should be accepted

Reasons the paper should be rejected

Questions for the authors

Other comments for the authors

Review #36B

Overall merit

Reviewer expertise

Reasons the paper should be accepted

Reasons the paper should be rejected

Questions for the authors

Other comments for the authors

Review #36C

Overall merit

Reviewer expertise

Reasons the paper should be accepted

Reasons the paper should be rejected

Questions for the authors

Other comments for the authors

Review #36D

Overall merit

Reviewer expertise

Reasons the paper should be accepted

Reasons the paper should be rejected

Questions for the authors

Other comments for the authors

Review #36E

Overall merit

Reviewer expertise

Reasons the paper should be accepted

Reasons the paper should be rejected

Questions for the authors

Other comments for the authors

Recommend Projects

Recommend Topics

Recommend Org