export a non-stream onnx model from a streaming pytorch model about icefall HOT 6 CLOSED

1215thebqtic commented on June 23, 2024

export a non-stream onnx model from a streaming pytorch model

from icefall.

Comments (6)

MicKot commented on June 23, 2024 1

setting chunk-size=-1 and left-context-frames=-1 does not mean 'non-streaming', it just means the model gets full context but the model is still 'streaming', i.e. the convolutions are still causal. By exporting the model by nonstreaming script you get rid of the ability to use the cache, which is the whole point of training streaming model. From my testing setting chunk-size=512 and left-context-frames=512 gives the best WER (which is not suprising given context is king) if WER is what you care about while still having the ability to 'stream' (just not in realtime)

from icefall.

JinZr commented on June 23, 2024

hi, im not quite familiar with the onnx export scripts, but i believe it’s not designed to just directly export a streaming model to a non-streaming one, perhaps the conversion ruins some of the attention masks, anyway if you need a non-streaming model, you should train a non-streaming model at the first place. best jin

…

On Apr 2, 2024, at 17:26, 1215thebqtic ***@***.***> wrote: Hi, I'm trying to export a non-stream onnx model from a streaming pytorch zipformer2 model. Training a non-stream zipformer2 model from scratch takes long time, so I decide to use "--chunk-size -1 --left-context-frames -1" as a non-stream model. The streaming model was trained using causal=1. The script I used to export the non-stream onnx model from a streaming pytorch model: ./zipformer/export-onnx.py \ --tokens $tokenfile \ --use-averaged-model 0 \ --epoch 99 \ --avg 1 \ --exp-dir zipformer/exp_L_causal_context_2 \ --num-encoder-layers "2,2,3,4,3,2" \ --downsampling-factor "1,2,4,8,4,2" \ --feedforward-dim "512,768,1024,1536,1024,768" \ --num-heads "4,4,4,8,4,4" \ --encoder-dim "192,256,384,512,384,256" \ --query-head-dim 32 \ --value-head-dim 12 \ --pos-head-dim 4 \ --pos-dim 48 \ --encoder-unmasked-dim "192,192,256,256,256,192" \ --cnn-module-kernel "31,31,15,15,15,31" \ --decoder-dim 512 \ --joiner-dim 512 \ --causal True \ --chunk-size -1 \ --left-context-frames -1 When I use the following code to decode the onnx model: ./zipformer/onnx_pretrained.py \ --encoder-model-filename $repo/encoder-epoch-99-avg-1.onnx \ --decoder-model-filename $repo/decoder-epoch-99-avg-1.onnx \ --joiner-model-filename $repo/joiner-epoch-99-avg-1.onnx \ --tokens $tokenfile \ icefall-asr-zipformer-streaming-wenetspeech-20230615/test_wavs/DEV_T0000000001.wav An error occured: broadcasting_error.PNG (view on web) <https://github.com/k2-fsa/icefall/assets/11812181/35281c26-7db7-4c76-80dc-b2558e11e0d3> the error node in netron: onnx_node.PNG (view on web) <https://github.com/k2-fsa/icefall/assets/11812181/895b6969-4665-4539-8f98-d9b199c33f35> According to the netron and zipformer code, I think it's because of the broadcasting in https://github.com/k2-fsa/icefall/blob/6cbddaa8e32ec5bc5c2fcc60a6d2409c7f5c7b11/egs/librispeech/ASR/zipformer/scaling.py#L671 x_chunk's shape is (batch_size, num_channels, chunk_size), chunk_scale's shape is (num_channels, chunk_size). I noticed that the streaming_forward also has the same code(https://github.com/k2-fsa/icefall/blob/6cbddaa8e32ec5bc5c2fcc60a6d2409c7f5c7b11/egs/librispeech/ASR/zipformer/scaling.py#L730), but there aren't any errors when exporting the streaming onnx model. I deleted this line of code, and waves can be decoded successfully, the wers on my test dataset differ a little bit: 5.89 (pytorch) versus 5.61 (onnx) (pytorch decoding script: ./zipformer/pretrained.py; onnx decoding script: ./zipformer/onnx_pretrained.py) And my questions are: Why does the broadcasting in non-stream mode lead to onnx errors, while no errors in streaming onnx model ? How do I change this line of code that I can avoid this error, and make the wer is same as the pytorch one? Thanks! — Reply to this email directly, view it on GitHub <#1576>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOON42AK6CF2KO2IGJXE223Y3J2S3AVCNFSM6AAAAABFTAJHCCVHI2DSMVQWIX3LMV43ASLTON2WKOZSGIZDAMBQGU2DINQ>. You are receiving this because you are subscribed to this thread.

from icefall.

1215thebqtic commented on June 23, 2024

hi jin, thanks for your reply! More than 100k hours are used to train the model, training will take about 20days due to limited gpus. As for the streaming model, the non-stream decoding options (chunk-size=-1, --left-context-frames=-1) shows relative 20%~30% better wer than the streaming decoding options (chunk-size=32, --left-context-frames=128), so I decide to export the non-streaming model.

from icefall.

csukuangfj commented on June 23, 2024

我在看这个

from icefall.

1215thebqtic commented on June 23, 2024

我在看这个

您好，那个错误我发现是因为这个函数里面有个if else判断语句

icefall/egs/librispeech/ASR/zipformer/scaling.py

Line 681 in 6cbddaa

def _get_chunk_scale(self, chunk_size: int):

导出onnx的时候输入的input是

icefall/egs/librispeech/ASR/zipformer/export-onnx.py

Line 297 in 6cbddaa

x = torch.zeros(1, 100, 80, dtype=torch.float32)

所以走的上面那个if else中if，我在用导出的模型测试的时候语音都是十几秒的语音，所以会报那个维度不匹配的错误。如果把这个dummy input改成x = torch.zeros(1, 1000, 80, dtype=torch.float32)就会走else那个分支，就不会报错可以正常识别了。但是，现在是我不知道怎么把这个if else合并或者拆分让他长短语音都能用

from icefall.

csukuangfj commented on June 23, 2024

Replied in the Next-gen Kaldi WeChat group.

The fix is

diff --git a/egs/librispeech/ASR/zipformer/scaling_converter.py b/egs/librispeech/ASR/zipformer/scaling_converter.py
index 76622fa1..346db55e 100644
--- a/egs/librispeech/ASR/zipformer/scaling_converter.py
+++ b/egs/librispeech/ASR/zipformer/scaling_converter.py
@@ -36,7 +36,7 @@ from scaling import (
     SwooshROnnx,
     Whiten,
 )
-from zipformer import CompactRelPositionalEncoding
+from zipformer import CompactRelPositionalEncoding, ChunkCausalDepthwiseConv1d


 # Copied from https://pytorch.org/docs/1.9.0/_modules/torch/nn/modules/module.html#Module.get_submodule  # noqa
@@ -93,6 +93,10 @@ def convert_scaled_to_non_scaled(
             # the input changes, so we have to use torch.jit.script()
             # to replace torch.jit.trace()
             d[name] = torch.jit.script(m)
+        elif is_onnx and isinstance(m, ChunkCausalDepthwiseConv1d):
+            # to export a zipformer model that is trained with --causal=1
+            # but to export it with --chunk-size=-1 and --left-chunk-size=-1
+            d[name] = torch.jit.script(m)

     for k, v in d.items():
         if "." in k:

from icefall.

export a non-stream onnx model from a streaming pytorch model about icefall HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent