Comments (6)
Very interesting papers, especially the CIF-T. Will read them after the long weekend.
My thought was pretty much the same, very hard to beat RNNT without sacrificing something (speed or quality).
AEDs, despite being slow and having hallucination issues, enable lots of sophisticated capability which rnnt alone can probably handle but with poor results. So imo it makes sense to trade off some of the benefits of RNNT for those new capabilities.
CIF, to me at least, just falls in the middle of the niche already dominated by CTC and RNNT, that is efficient and highly accurate ASR. AEDs fill another niche - advanced capabilities that aren't efficient with monotonic alignment learning losses.
from nemo.
@titu1994 Thank you very much for your detailed answer!
Yes, I also had doubts about how high-quality Paraformer could actually be.
In our team we already use Fast Conformer (CTC or HAT)
(xlarge for Knowledge Distillation and medium for production) and its speed is quite good compared to the regular Conformer
, and the quality practically does not drop.
But in search of some improvements, it would be interesting to try alternative architectures. And one of them is Paraformer
(due to the fact that it is NAR
) and probably even its CIF
.
There are also several interesting articles on CIF
, using which maybe can be achievied interesting or better results:
- https://arxiv.org/pdf/1905.11235.pdf - standard article (CIF: CONTINUOUS INTEGRATE-AND-FIRE FOR END-TO-END SPEECH RECOGNITION) (https://github.com/MingLunHan/CIF-PyTorch)
- https://github.com/MingLunHan/CIF-ColDec (add selective context to improve decoding result)
- https://github.com/MingLunHan/CIF-HieraDist (transfer of knowledge from a PLM model to an ASR model, at the linguistic and acoustic level) (in this article Branchformer shows a good summary - you can take a look at it)
- https://arxiv.org/pdf/2307.14132.pdf (CIF-T: A NOVEL CIF-BASED TRANSDUCER ARCHITECTURE FOR AUTOMATIC SPEECH RECOGNITION) (good article with interesting additions) (here, by the way, paraformer does not show very good results)
NeMo support for
AED
models will come soon* (no release date for the time being).
- Its be very good!
from nemo.
I've read this paper before, so some of my comments are below.
-
MWER is a pain to train with and implement, plus there's no public efficient implementation as far as I can see. I'd rather stick to rnnt loss or even CTC. 2
-
Continuous Integrate and Fire is a novel concept but has not gained much traction in the 4 years since it's paper in 2019. I've personally experimented a bit with it in some branch, and found that while it works fine, it's wer is inferior to RNNT. Maybe there are new variants to CIF that surpass RNNT
-
NeMo support for AED models will come soon* (no release date for the time being).
-
The papers core contribution seems to be RTF at compatible accuracy to AR models. This is very good but for RTF - Fast Conformer already gets very good RTF (0.0003x on long audio files ~ 3687 times real time factor). So encoder level optimizations already surpass this paper with CTC. With RNNT, that RTF is something like 0.009 which is on par with this paper. I'm somewhat confident that rnnt wer would be competitive as well with this model, while also supporting long audio inference and other tasks like speech translation.
So it seems para former would be a middle ground between CTC and RNNT in terms of accuracy and speed both.
Still, I need to reread the paper to see if I'm missing something crucial.
Just a note, there are just my personal comments, the team will need to discuss whether the model will be added to NeMo or not.
If you're up for it, we will gladly welcome contributions to add this model too !
from nemo.
What about AIF ?
from nemo.
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
from nemo.
This issue was closed because it has been inactive for 7 days since being marked as stale.
from nemo.
Related Issues (20)
- Precision Problem between nemo model and hugging face model HOT 2
- Llama2 70B SFT with FSDP failing HOT 2
- training config used for training stt_en_quartznet15x5 HOT 2
- llama2 training hangs when pp_size > 1 HOT 2
- Integration of Turn-Taking Models into Nemo Framework for Enhanced Realistic Conversations
- FileNotFoundError: Model stt_fa_fastconformer_hybrid_large was not found. HOT 6
- [Feature] Add Support on Multiple Metrics Reporting during Training Progress for Validation
- checkpoints not saved due to wrong loss comparison?
- when "write_predictions_to_file" is true,generate will fail。 HOT 2
- "RuntimeError: start (4) + length (1) exceeds dimension size (4)." when running cache aware streaming inference
- slow validation process HOT 2
- Optimizing Learning Rate Parameters in Model Fine-tuning
- AUDIO FILE SIZE for fine tuning STT En FastConformer Hybrid Transducer-CTC Large Streaming Multi HOT 1
- `EncDecCTCModel.transcribe(audio=...)` changed to `EncDecCTCModel.transcribe(paths2audio_files=...)` HOT 7
- Enormous number of `.nemo` checkpoints produced in training HOT 4
- [Conversion] How to convert Finetuned T5 checkpoint ended with `.ckpt` to `.nemo` checkpoint with NeMo toolkit?
- Can't launch NeMo containers with CUDA support
- Latest huggingface transformers version breaking nlp modules HOT 6
- Any tts models in nemo that can simulated human laughter and other human sounds?
- setuptools 70.0.0 results in ImportError: cannot import name 'packaging' from 'pkg_resources' HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nemo.