Hi, I don't know if this the right place to ask but I was wondering

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Speed and FLOPs Information about pyscatwave HOT 9 CLOSED

edouardoyallon commented on June 9, 2024

Speed and FLOPs Information

from pyscatwave.

Comments (9)

lolz0r commented on June 9, 2024

I don't have specific numbers for you but I do know that raw GFLOPs for an operation are not the only consideration in overall computational speed. Things like code branches, non-sequential memory reads ... etc can greatly slow down a program running on a GPU even when overall GFLOPs are less.

from pyscatwave.

lolz0r commented on June 9, 2024

Also, you want to double-verify that you are using the .cuda() variety of the scattering transform.

from pyscatwave.

aesadde commented on June 9, 2024

@lolz0r thanks for the reply.

I am using cuda() and also transferring all the tensors to the GPU before I start comparing speeds of both networks.
The network is pretty simple, a feedforward network with 30+ convolutions.

I was expecting that by replacing the feature extractor (I'm using inception v3) by the scattering network I'd see some speed up, I was not expecting a 200x slowdown.

One thing I noticed is that there are multiple copies of the S transform in the scattering code. Is there a possibikity that these copies are moving data between cpu and gpu?

from pyscatwave.

lolz0r commented on June 9, 2024

Just for fun what are you M, N and J values set to ?

from pyscatwave.

aesadde commented on June 9, 2024

M=224, N=224, J=3

I've noticed that periodize and modulus take most of the time since their first pass is not already cached. If I run the scatter net twice, the second time is 1.5x faster

from pyscatwave.

edouardoyallon commented on June 9, 2024

Well, maybe you could do a proper timing, but I'm not shocked if you claim that the scattering takes about 1s for a batch of 256 RGB images of large size. The implementation is definitely not optimal, still faster than CPU implementations, the copies are due to the use of buffers. We're open to ideas to speed-up the software, while keeping the memory reasonable. In your particular case, the padding size could be a bit large. Depending on the application, it is possible to obtain a speed-up, w.r.t. your current pipeline.

I'm a bit surprised that the modulus kernel is slow, but not for periodize's one. I think that any expert in CUDA could drastically optimize our software, and I'd be pretty happy and open to any suggestions 👍

from pyscatwave.

aesadde commented on June 9, 2024

@edouardoyallon Thanks for your reply.

Actually my test is using only a single RGB input size is (1,3,224,224) and after a minor optimization it went from 2s to 0.5s. I'm using a gtx1080.

Also is there a reason why most fft and cdgmm calls are not inplace=True?

I am understanding the code more and will see if I can manage something, will submit PR in that case.

from pyscatwave.

edouardoyallon commented on June 9, 2024

So there is definitely an issue in your testing procedure, as a batch of size (256,3,224,224) should take about 1s on GPU.(this timing sounds more like for CPU?)

In this version of the code, we removed explicitely the buffers to let pytorch decides the allocations(such that it is optimal). When the fft is not in place, it means that the result will be used later. Imho, the code to optimize is there: https://github.com/edouardoyallon/pyscatwave/blob/master/scatwave/utils.py (each subroutine) Furthermore, if you don't use large batch, then, your timing could be a bit screw up.

from pyscatwave.

aesadde commented on June 9, 2024

True. I just ran a quick test on with size (64,3,224,224) and scaterring runs at 0.33 seconds on the second pass, first call takes 1.2s probably because the modulus kernel is not yet cached at that point.

Still this is slower than my whole original network which has 22GFLOPs, and yes I'm running everything on gpu.

Thanks for the help! :)

from pyscatwave.

Speed and FLOPs Information about pyscatwave HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent