Comments (10)
Does it mean that
torch.stack
andtorch.clone
are not currently supported, but if they will be in future, then it should work equally fast without skipping main thread?
Yes torch.stack
and torch.clone
are currently not supported, also there is a different issue in this case: the this functions are used outside of optimised model in this case (code is outside of model optimised by torch.compile
/ torch.jit.trace
functions.
As for auto-casting to fp16 - it looks like magic ) Indeed it works 2x faster. I supposed pytorch was not supporting fp16 on CPU as mentioned in #152. I actually tried myself, before I found that issue. So setting
AIO_IMPLICIT_FP16_TRANSFORM_FILTER=".*"
is kind of workaround to bypass pytorch limitations somehow? I'm not sure yet if I will use this mode because of potential precision loss - I have to check how my model will perform, but it is great to know that it actually works!
Since x86 CPU didn't supported FP16 before very recent AVX-512 extension, it seems that nobody really cared about CPU support for FP16 in Pytorch. Currently Pytorch support of FP16 on CPU is very limited, but currently there is some work going on master branch of Pytorch, eg:
pytorch/pytorch#98493
pytorch/pytorch#98819
And yes AIO_IMPLICIT_FP16_TRANSFORM_FILTER=".*"
is kind of workaround about this issue, to bring up support for FP16 in Ampere Optimized Pytorch, even when framework doesn't support it.
from ampere_model_library.
Thanks for reaching out!
You will need to play with AIO_SKIP_MASTER_THREAD env variable (possible values are 0 and 1, default is 0) to get best performance, i.e.:
AIO_SKIP_MASTER_THREAD=1 AIO_NUM_THREADS=2 numactl -C 0-1 python3 example.py
x = data # Latency: ~250 ms
x = torch.stack([data[0]]) # Latency: ~100 ms
x = torch.randn(1, 3, 224, 224) # Latency: ~400 ms
x = data.clone() # Latency: ~100 ms
AIO_SKIP_MASTER_THREAD=0 AIO_NUM_THREADS=2 numactl -C 0-1 python3 example.py
x = data # Latency: ~100 ms
x = torch.stack([data[0]]) # Latency: ~320 ms
x = torch.randn(1, 3, 224, 224) # Latency: ~110 ms
x = data.clone() # Latency: ~320 ms
As you can see, ~100 ms latency is possible on your 2-threaded Ampere Altra VM in each case listed in your example, provided proper value of env is set.
Fyi, we are working on a solution relieving the user from the need to adjust this parameter.
For the time being please refer to: https://ampereaidevelopus.s3.amazonaws.com/releases/1.7.0/Ampere+Optimized+PyTorch+Documentation+v1.7.0.pdf
Btw, you will get even better performance by auto-casting to fp16:
AIO_IMPLICIT_FP16_TRANSFORM_FILTER=".*" AIO_SKIP_MASTER_THREAD=0 AIO_NUM_THREADS=2 numactl -C 0-1 python3 test2.py
Latency: 50 ms
:)
from ampere_model_library.
Not related, but, do you have plans to release your packages to be used outside docker or so that they can be used in my own custom container?
Please contact us at [email protected] and we should be able to get you a working .deb installer.
from ampere_model_library.
Hi,
few things to mention:.
- My results were measured with model containing
scaled_dot_product_attention
op which was not handled by our kernels, just by regular Pytorch implementation. So there is some problem, but yet I don't really know where. I was testing our newest implementations, maybe that also affects the performance. - Indeed we don't support
scaled_dot_product_attention
yet, so it may be beneficial to turn it off in Pytorch implementation. We started some effort to support it, but I won't expect it to be available this year. Model without this op should be entirely handled by our kernels, overall I think it may be still better then Pytorch implementation ofscaled_dot_product_attention
. - The difference you saw in logs means that our software chosen different implementation of that operation for bigger size, likely it should be better.
from ampere_model_library.
Thanks a lot for fast and helpful reply )
Indeed AIO_SKIP_MASTER_THREAD=1
helped. For some reason it works equally fast now for all 4 cases (~100-110ms), I don't mind, but a bit confusing, why it differs from your results.
So the documentation says:
If the model contains nodes not supported by Ampere Optimized Pytorch we recommend setting following
environmental variable: AIO_SKIP_MASTER_THREAD=1
Does it mean that torch.stack
and torch.clone
are not currently supported, but if they will be in future, then it should work equally fast without skipping main thread?
As for auto-casting to fp16 - it looks like magic ) Indeed it works 2x faster. I supposed pytorch was not supporting fp16 on CPU as mentioned in #152. I actually tried myself, before I found that issue. So setting AIO_IMPLICIT_FP16_TRANSFORM_FILTER=".*"
is kind of workaround to bypass pytorch limitations somehow? I'm not sure yet if I will use this mode because of potential precision loss - I have to check how my model will perform, but it is great to know that it actually works!
I'm closing the issue, as you've already answered, but would appreciate another reply )
Not related, but, do you have plans to release your packages to be used outside docker or so that they can be used in my own custom container?
from ampere_model_library.
Hi, it's me again 🙈.
It seems like a similar issue, though a bit different. This time it depends on input size. For some (smaller) inputs it works fast, but after some threshold it suddenly slows down 2-3x. Here I'm using vision transformer from timm library. It basically reshapes an image from 2d to 1d sequence and runs quite basic transformer on it. So for img_size=110 and patch_size=10, sequence length will be 11 * 11 = 121, and if you increase img_size to 120, then sequence length will be 12 * 12 = 144.
import torch
import timm
import time
# img_size = 110 # latency: 10ms
img_size = 120 # latency: 33ms
model = timm.models.VisionTransformer(
img_size=img_size,
patch_size=10,
embed_dim=128,
num_heads=8,
depth=12
)
model.eval()
model = torch.compile(model, backend='aio', options={'modelname': 'vit'})
data = torch.rand(1, 3, img_size, img_size)
n_warmup = 5
n = 100
with torch.no_grad():
for i in range(n + n_warmup):
if i == n_warmup:
start = time.time()
model(data)
duration = time.time() - start
latency = duration / n * 1000
cps = 1000 / latency
print(f'Latency: {round(latency)}ms, rate: {round(cps)} per second')
With AIO_SKIP_MASTER_THREAD=1 it works a bit faster, though still there is same slowdown if changing input size.
Should I provide logs or maybe you have ideas what could be wrong without them?
from ampere_model_library.
Hi, I've tried to run your script with:
AIO_SKIP_MASTER_THREAD=1 AIO_NUM_THREADS=4 OMP_NUM_THREADS=4 python test.py
for img_size = 110
I'm getting:
Latency: 13ms, rate: 77 per second
for img_size = 120
Latency: 15ms, rate: 66 per second
There is some difference but, not that big.
How do you run the script? What number of threads are you using?
from ampere_model_library.
Thanks for looking into this. I'm using 8 threads. I'm attaching debug logs, hope it will contain all necessary information.
Also I'm testing on your previous docker container version (amperecomputingai/pytorch:1.7.0), I see that there is a newer one already. I'm not sure if it could make a difference, but I can try to test on it.
When comparing logs side by side, I don't see any major difference besides a bit larger tensor shapes, which match my calculations (122 seq length for image_size=110 and 145 seq length for image_size=120). But it shouldn't affect performance that much, I believe. The only difference besides shapes, I see is this:
Could be it be related to cache sizes somehow? It doesn't necessarily depends on image_size, for example if I change embed_dim, the drop will occur at different image_size threshold.
from ampere_model_library.
Hmm.... looks like I found the reason.
There are the following lines in Attention module implementation:
if self.fused_attn:
x = F.scaled_dot_product_attention(
q, k, v,
dropout_p=self.attn_drop.p if self.training else 0.,
)
else:
q = q * self.scale
attn = q @ k.transpose(-2, -1)
attn = attn.softmax(dim=-1)
attn = self.attn_drop(attn)
x = attn @ v
I actually wanted to ask if you have optimizations for transformers/attention. Pytorch has this scaled_dot_product_attention
method that I guess is optimized version of the code in the else branch. It is possible to set which attention to use via env variable: TIMM_FUSED_ATTN=0/1
, but if not set explicitly it checks whether scaled_dot_product_attention
method is available, and if yes, then it uses it.
If I disable fused attention explicitly by setting TIMM_FUSED_ATTN=0
, then it works actually a bit faster and there no performance drop. So I guess scaled_dot_product_attention
is just not implemented in Ampere, and just basic matrix multiplication should be used.
from ampere_model_library.
Thanks for clarification!
I've actually tried it on a completely new VM with the latest docker image (1.8.0) and looks like there is still same issue for me when scaled_dot_product_attention
is used. I tried all combinations of TIMM_FUSED_ATTN x AIO_SKIP_MASTER_THREAD x image_size.
As you can see below if TIMM_FUSED_ATTN=1
then for any AIO_SKIP_MASTER_THREAD
there is perf drop when increasing img_size from 110 to 120.
I'm not sure if this can give any clue, but when perf drop occurs I see similar increase of red/kernel CPU usage, like if AIO_SKIP_MASTER_THREAD=1
not set, although without it red portion is even larger. However for image_size=110, CPU bars are completely green.
This is not critical for me at the moment, as there is a workaround which is even faster for now, but if at some point you'll get an idea what could be wrong, I'd be interested to know.
from ampere_model_library.
Related Issues (20)
- Add read-mes HOT 1
- CI test for model_zoo HOT 1
- Help users lay hands on datasets / models easily
- Move models / other stuff to S3 HOT 2
- Add Int8 BERT from MLPerf (PyTorch)
- Add support for torch.jit.trace() models to Pytorch. HOT 1
- Some models fail without AIO with with uncorrect batch sizes
- Pytorch models needs two warmup runs HOT 1
- Cache pytorch models after tracing/scripting HOT 1
- VGG16 fails in Pytorch HOT 3
- AML should support AIO_NUM_THREADS="all" HOT 2
- Missing TBTracer HOT 1
- Add fp16 PyTorch models HOT 3
- --disable_jit_freeze flag is not working in many Pytorch models HOT 2
- Add NHWC Pytorch torchvision models HOT 3
- Pytorch Roberta Base Squad fails accuracy tests HOT 2
- Nano GPT model
- Alpaca accuracy HOT 9
- Issues running AML on amperecomputingai/onnxruntime:1.8.0 image HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ampere_model_library.