Comments (3)
So, the dynamic program as described in the paper and as it is implemented in this repo, is of course not designed for unknown audio in the middle of the text. Therefore the alignment algorithm will probably try to stretch the words of the transcript before and after the unknown part across it. So in the end most of the audio will be aligned correctly besides the part around the unknown segment. However, one can easily detect such incorrect segments by looking at the confidence score provided by the network.
Additionally, it should also be possible to extend the dynamic program to support unknown segments in the middle of the audio. One could for example allow the algorithm to skip parts of the text or the audio if this leads to a higher average probability across whole alignment in the end. However, this needs to be carefully designed, otherwise the algorithm might just skip the whole audio.
from ctc_segmentation.
Thanks for being interested in our work!
Our main motivation for this tool was to align public available data in an utterance-wise fashion, so we can use it for supervised ASR training.
Usually data from e.q. librivox.de consists of long audio files (~1h) together with one long transcript without any alignments.
What makes the automatic alignment particularly challenging is that the speaker is often introducing himself/herself and the book at the beginning and the end of every audio file.
As these parts are not contained in the transcript, many forced alignments algorithms fail.
In our evaluation we tried to simulate such situations by prepending/appending audio to our test data.
We have therefore not looked into using this technique for completely unlabelled data and I am also not completely sure how this would work out, but it sounds like a good idea and might be promising for future work to look into this.
from ctc_segmentation.
Thanks, that makes sense and was more or less what I expected. What would be your expectation here regarding utterance internal misalignments in that case? For example did you look at stitching any of these 'incorrect' segment directly into the middle of the utterances? Do you have any reason to think it would be worse or better than the performance you saw here working with appended/prepended segments? Thanks again for sharing this work and the great implementation.
from ctc_segmentation.
Related Issues (8)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ctc_segmentation.