Comments (6)
from icefall.
from icefall.
from icefall.
@JinZr Thanks so much for the prompt answer, really appreciate it. Could you kindly help with some follow-up questions (pardon my lack of knowledge with Kaldi and ASR in general):
- just to confirm, for the folder structure, do you mean the following?
data_dir/
├─ train/
│ ├─ text
│ ├─ wav.scp
│ ├─ utt2spk
│ ├─ spk2utt
├─ test/
│ ├─ text
│ ├─ wav.scp
│ ├─ utt2spk
│ ├─ spk2utt
-
since I don't care about the speaker info in my training, both files utt2spk and spk2utt can just contain
utt_id utt_id
(the two files each just contains 1 single lineutt_id utt_id
)? -
For wav.scp and text, do the contents below look right to you:
wav.scp contains the following:
data_001 /absolute_file_path_to_001.wav
data_002 /absolute_file_path_to_002.wav
...
text contains the following:
data_001 (Chinese text with words separated (分词后的数据) )
...
In addition, may I ask some general questions about ASR training:
-
I wonder if you have any advice on how to prepare for the contents of
text
most efficiently? It involves first performing ASR for each audio file - but given the current ASR model doesn't perform well on these data, it may need quite some human proof reading. Also how should we perform word separation, is it by using jieba, or you have any better recommendation? -
For my custom audio data which I plan to use for training, at the beginning of most audio files, there are about 10 to 30 seconds of silence or music (彩铃) playing before the person starts talking. Will such music or silence negatively impact the training results? Do you think it's necessary to remove these parts first before training?
Thanks again for your help.
from icefall.
@JinZr Thanks a lot! Your answers really help.
1.Regarding 4, got it, I originally thought word segmentation is needed here. For the text
part, should the transcriptions contain spaces and punctuation? For example, in text
should I prepare the data to look like the following:
data_001 你好。对的。哦,我现在不太方便。好的我知道了。
or should I prepare it as follows:
data_001 你好对的哦我现在不太方便好的我知道了
?
2.Regarding transcription for training, another thing I'd like to get your thoughts on is, in our phone conversation scenario, we have some users who tend to think as they speak, and as a result, add some delays when they say certain words and characters. For example, instead of saying "这个没问题" at a regular speed,a user may say "这个——没问题" where it takes a user 1 sec or more to finish saying "个" (the sound is continuous, I call it 拖音, prolonging the pronunciation of a sound). Do you think it's possible to train the ASR model to output special character like "——" so that it can detect 拖音?I'm thinking of adding special character in the transcriptions for text
to catch it, but I'm not sure if it's a good idea. Any thoughts?
Thanks again.
from icefall.
@JinZr Thanks. Two more questions:
1.We currently have a few hundred audio files for training (not so many), how do you suggest we divide the data for training and test set? I'm thinking of using most or probably all of them for training, and few or even none of them for the test set.
2.For ASR training, what's the ideal length of each audio? Is 20 seconds ok?
from icefall.
Related Issues (20)
- kaldifeat installation error HOT 2
- Why unique lexicon is needed in Chinese ASR, but not in English ASR?
- Error during training OTC conformer_ctc2 HOT 1
- What is difference between zipformer and zipformer_ctc models? HOT 1
- ONNX and Torch models HOT 1
- how to decrease the right chunk size when using zipformer model?
- How to load the base model in the fine-tuning task of KWS HOT 3
- Getting segmentation fault HOT 9
- Training break down. details is showed below HOT 3
- training very slowly HOT 16
- Using my own data to train pruned_transducer_stateless7_ctc_bs, encountered an error
- Grad scale is too small HOT 10
- Non-streaming Conformer model with pruned_rnnt_loss always emits the first non-blank characters on the very first frames. HOT 1
- fast_beam_search_nbest gives very high WER compared to fast_beam_search and greedy. HOT 2
- fine-tune problem HOT 14
- How to reduce memory when decoding on CPU? HOT 8
- The exported ONNX cannot recognize speech offline. It is normal to use python scripts to infer the pt model HOT 7
- [Help needed] Support https://huggingface.co/datasets/Alex-Song/MSR-86K HOT 2
- Cannot download reazonspeech dataset HOT 4
- Regular spikes in training metrics for zipformer training on custom data HOT 10
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from icefall.