Git Product home page Git Product logo

Comments (10)

wangyuyue avatar wangyuyue commented on August 27, 2024

Hi Kevin,
Thanks for your question.
Take T[0, 0, 0, 1] as an example, it gives the condition that T[floor(k/8), floor(c/8), oy, k%8 + c%8 + ox] = T[0, 0, 0, 1].
I think under this condition k=0, c=1, and ox=oy=0 is a valid solution, and thus we can infer that PE[0, 1] is also used.

from tenet.

wangyuyue avatar wangyuyue commented on August 27, 2024

For the question

It assumes you can do many MACs in a PE per cycle.

Yes, your guess is correct here. If we don't map a variable to either the timestamp or the space, then it means we assume this variable is free and compute instances with all its possible values execute in a single cycle.
I know this kind of assumption is implicit and prone to negligence and error. A solution is to add the max concurrent MAC number per PE somewhere in the code and add an assertion to check no more instances as specified by this constraint are executed concurrently. You are welcome to contribute a pull request.

from tenet.

KelvinYang0320 avatar KelvinYang0320 commented on August 27, 2024

Take T[0, 0, 0, 1] as an example, it gives the condition that T[floor(k/8), floor(c/8), oy, k%8 + c%8 + ox] = T[0, 0, 0, 1].
I think under this condition k=0, c=1, and ox=oy=0 is a valid solution, and thus we can infer that PE[0, 1] is also used.

@wangyuyue Thank you for the reply. You are right.
However, there are still 9 MACs (9 instances) on PE[0, 0] at T[0, 0, 0, 1], right?

from tenet.

wangyuyue avatar wangyuyue commented on August 27, 2024

Take T[0, 0, 0, 1] as an example, it gives the condition that T[floor(k/8), floor(c/8), oy, k%8 + c%8 + ox] = T[0, 0, 0, 1].
I think under this condition k=0, c=1, and ox=oy=0 is a valid solution, and thus we can infer that PE[0, 1] is also used.

@wangyuyue Thank you for the reply. You are right. However, there are still 9 MACs (9 instances) on PE[0, 0] at T[0, 0, 0, 1], right?

Yes.

from tenet.

KelvinYang0320 avatar KelvinYang0320 commented on August 27, 2024

If we don't map a variable to either the timestamp or the space, then it means we assume this variable is free and compute instances with all its possible values execute in a single cycle.

@wangyuyue Do you mean I will get computation latency = 1 cycle even if there are 9 MACs in PE[0, 0] in this example at T[0, 0, 1].

But when I replace 0<=rx<1 and 0<=ry<1 with 0<=rx<3 and 0<=ry<3 in ./dataflow_example/conv.s, I get the Com latency is nine times less than the 0<=rx<3 and 0<=ry<3.

  • 0<=rx<1 and 0<=ry<1: 1 MAC in a PE per cycle
  • 0<=rx<3 and 0<=ry<3: 3*3 MAC in a PE per cycle

from tenet.

wangyuyue avatar wangyuyue commented on August 27, 2024

Sorry, my previous explanation is wrong.
As the timestamp is used to express the execution order of compute instances, the smallest timestamp interval doesn't necessarily correspond to a physical clock cycle. The compute latency is calculated as (total #compute instances) / (total # active PE). So changing the timestamp mapping doesn't really impact the compute latency.
If we really want to model concurrent MAC in a PE, just change the compute latency formula to (total #compute instances) / (total # active PE * PE_parallel_factor).

from tenet.

KelvinYang0320 avatar KelvinYang0320 commented on August 27, 2024

@wangyuyue
Thank you for the explanation! 😄

The compute latency is calculated as (total #compute instances) / (total # active PE).

Could you share the permalink for this feature in the TENET code?

I have a question about computation latency = (total #compute instances) / (total # active PE).

For simplicity, I replace ./dataflow_example/conv.s with

2 1
{S[k,c,ox,oy,rx,ry]:0<=k<2and 0<=c<3 and 0<=ox<3 and 0<=oy<3 and 0<=rx<2 and 0<=ry<2}
{S[k,c,ox,oy,rx,ry]->I[c,ox+rx,oy+ry]}
{S[k,c,ox,oy,rx,ry]->W[k,c,rx,ry]}
{S[k,c,ox,oy,rx,ry]->O[k,ox,oy]}

, replace ./dataflow_example/pe_array.p with

{PE[i,j]:0<=i<2 and 0<=j<2}
{PE[i,j]->PE[i+1,j]; PE[i,j]->PE[i,j+1]}

128 1024 64 4

and replace ./dataflow_example/KC_systolic_dataflow.m with

{S[k,c,ox,oy,rx,ry]->PE[k%2,c%2]}
{S[k,c,ox,oy,rx,ry]->T[floor(k/8),floor(c/8),oy,ox+k%8+c%8]}

After running the command

bin/tenet --m ./dataflow_example/KC_systolic_dataflow.m --p ./dataflow_example/pe_array.p --s ./dataflow_example/conv.s --o test.csv --all

At T[0, 0, 0, 2],
all instances in PE[0, 0]:

k c ox oy (rx, ry)
0 0 2 0 (0, 0), (0, 1), (1, 0), (1, 1)
0 2 0 0 (0, 0), (0, 1), (1, 0), (1, 1)

At T[0, 0, 0, 2],
all instances in PE[0, 1]:

k c ox oy (rx, ry)
0 1 1 0 (0, 0), (0, 1), (1, 0), (1, 1)

At T[0, 0, 0, 2],
all instances in PE[1, 0]:

k c ox oy (rx, ry)
1 0 1 0 (0, 0), (0, 1), (1, 0), (1, 1)

At T[0, 0, 0, 2],
all instances in PE[1, 1]:

k c ox oy (rx, ry)
1 1 0 0 (0, 0), (0, 1), (1, 0), (1, 1)

At T[0, 0, 0, 2], there are $4\times2 + 4 + 4 + 4 = 20$ instances and $4$ active PEs, right?
(total #compute instances) / (total # active PE) = 5 cycles, but we actually need $4 \times 2 = 8$ cycles (PE[0,0] needs 8 cycles to perform 8 MACs).

Did I miss something? 😕

from tenet.

Gnaiqing avatar Gnaiqing commented on August 27, 2024

Hi Kelvin,

The permalink for computation delay is here:

Dataflow::GetComputationDelay()
.
I think you are right that the required cycles will be 8 for that timestamp. Our current estimation is relatively coarse-grained as it assumes all active PEs have similar workload, and when the PEs workload differ significantly the estimation for computation delay will be less accurate. If this is the case one may enumerate the workload of every PE to get more accurate estimation, however this will take longer runtime as well.

from tenet.

KelvinYang0320 avatar KelvinYang0320 commented on August 27, 2024

@Gnaiqing @wangyuyue I further simplify the previous example by replacing 0<=oy<3 with 0<=oy<1 and rerun TENET.

Timestamp MAC Active PEs Actual Cycles
T[0,0,0,0] 4 1 4
T[0,0,0,1] 12 3 4
T[0,0,0,2] 20 4 8
T[0,0,0,3] 20 4 8
T[0,0,0,4] 12 3 4
T[0,0,0,5] 4 1 4

The actual total latency is 32 cycles. However, the computation latency from TENET is 18 (GetMacNum/dsize, GetMacNum=72, dsize=4). The relative error is about 43%, and that isn't ignorable.

You have also made a function GetAverageActivePENum, but only use it to show the average active PEs number. If you compute computation latency with GetMacNum/GetAverageActivePENum, you will get 27 cycles (GetMacNum=72, GetAverageActivePENum=2.67). The relative error is about 16%, and the relative error becomes smaller, although it is still not ignorable. Is there any reason for using GetPENum instead of GetAverageActivePENum to compute the computation latency?

  • TENET/src/dataflow.cpp

    Lines 336 to 339 in 8a3b3bc

    Dataflow::GetComputationDelay()
    {
    return GetMacNumPerPE();
    }
  • TENET/src/dataflow.cpp

    Lines 287 to 294 in 8a3b3bc

    Dataflow::GetMacNumPerPE(int mac_per_instance)
    {
    int mac_num = GetMacNum(mac_per_instance);
    // use GetSpaceDomain instead of pe.Getdomain() here in case some pes
    // are idle
    int dsize = GetPENum();
    return mac_num / dsize;
    }

one may enumerate the workload of every PE to get more accurate estimation, however this will take longer runtime as well.

Could you add this feature or explain in detail how to create that? Thank you. 😄

from tenet.

wangyuyue avatar wangyuyue commented on August 27, 2024

Hi Kelvin,

I've merged your pull request. Thanks for your contribution! If there is no further question, I will close this issue.

from tenet.

Related Issues (4)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.