Git Product home page Git Product logo

Comments (6)

JinZr avatar JinZr commented on July 21, 2024 2

from icefall.

JinZr avatar JinZr commented on July 21, 2024 1

from icefall.

JinZr avatar JinZr commented on July 21, 2024 1

from icefall.

daocunyang avatar daocunyang commented on July 21, 2024

@JinZr Thanks so much for the prompt answer, really appreciate it. Could you kindly help with some follow-up questions (pardon my lack of knowledge with Kaldi and ASR in general):

  1. just to confirm, for the folder structure, do you mean the following?
data_dir/
├─ train/
│  ├─ text
│  ├─ wav.scp
│  ├─ utt2spk
│  ├─ spk2utt
├─ test/
│  ├─ text
│  ├─ wav.scp
│  ├─ utt2spk
│  ├─ spk2utt
  1. since I don't care about the speaker info in my training, both files utt2spk and spk2utt can just contain utt_id utt_id (the two files each just contains 1 single line utt_id utt_id)?

  2. For wav.scp and text, do the contents below look right to you:

wav.scp contains the following: 

data_001 /absolute_file_path_to_001.wav
data_002 /absolute_file_path_to_002.wav
...

text contains the following:

data_001 (Chinese text with words separated (分词后的数据) )
...

In addition, may I ask some general questions about ASR training:

  1. I wonder if you have any advice on how to prepare for the contents of text most efficiently? It involves first performing ASR for each audio file - but given the current ASR model doesn't perform well on these data, it may need quite some human proof reading. Also how should we perform word separation, is it by using jieba, or you have any better recommendation?

  2. For my custom audio data which I plan to use for training, at the beginning of most audio files, there are about 10 to 30 seconds of silence or music (彩铃) playing before the person starts talking. Will such music or silence negatively impact the training results? Do you think it's necessary to remove these parts first before training?

Thanks again for your help.

from icefall.

daocunyang avatar daocunyang commented on July 21, 2024

@JinZr Thanks a lot! Your answers really help.

1.Regarding 4, got it, I originally thought word segmentation is needed here. For the text part, should the transcriptions contain spaces and punctuation? For example, in text should I prepare the data to look like the following:

data_001 你好。对的。哦,我现在不太方便。好的我知道了。
or should I prepare it as follows:
data_001 你好对的哦我现在不太方便好的我知道了 ?

2.Regarding transcription for training, another thing I'd like to get your thoughts on is, in our phone conversation scenario, we have some users who tend to think as they speak, and as a result, add some delays when they say certain words and characters. For example, instead of saying "这个没问题" at a regular speed,a user may say "这个——没问题" where it takes a user 1 sec or more to finish saying "个" (the sound is continuous, I call it 拖音, prolonging the pronunciation of a sound). Do you think it's possible to train the ASR model to output special character like "——" so that it can detect 拖音?I'm thinking of adding special character in the transcriptions for text to catch it, but I'm not sure if it's a good idea. Any thoughts?

Thanks again.

from icefall.

daocunyang avatar daocunyang commented on July 21, 2024

@JinZr Thanks. Two more questions:
1.We currently have a few hundred audio files for training (not so many), how do you suggest we divide the data for training and test set? I'm thinking of using most or probably all of them for training, and few or even none of them for the test set.

2.For ASR training, what's the ideal length of each audio? Is 20 seconds ok?

from icefall.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.