Git Product home page Git Product logo

Comments (3)

cornerfarmer avatar cornerfarmer commented on August 16, 2024 4

So, the dynamic program as described in the paper and as it is implemented in this repo, is of course not designed for unknown audio in the middle of the text. Therefore the alignment algorithm will probably try to stretch the words of the transcript before and after the unknown part across it. So in the end most of the audio will be aligned correctly besides the part around the unknown segment. However, one can easily detect such incorrect segments by looking at the confidence score provided by the network.

Additionally, it should also be possible to extend the dynamic program to support unknown segments in the middle of the audio. One could for example allow the algorithm to skip parts of the text or the audio if this leads to a higher average probability across whole alignment in the end. However, this needs to be carefully designed, otherwise the algorithm might just skip the whole audio.

from ctc_segmentation.

cornerfarmer avatar cornerfarmer commented on August 16, 2024

Thanks for being interested in our work!

Our main motivation for this tool was to align public available data in an utterance-wise fashion, so we can use it for supervised ASR training.
Usually data from e.q. librivox.de consists of long audio files (~1h) together with one long transcript without any alignments.
What makes the automatic alignment particularly challenging is that the speaker is often introducing himself/herself and the book at the beginning and the end of every audio file.
As these parts are not contained in the transcript, many forced alignments algorithms fail.
In our evaluation we tried to simulate such situations by prepending/appending audio to our test data.

We have therefore not looked into using this technique for completely unlabelled data and I am also not completely sure how this would work out, but it sounds like a good idea and might be promising for future work to look into this.

from ctc_segmentation.

AdolfVonKleist avatar AdolfVonKleist commented on August 16, 2024

Thanks, that makes sense and was more or less what I expected. What would be your expectation here regarding utterance internal misalignments in that case? For example did you look at stitching any of these 'incorrect' segment directly into the middle of the utterances? Do you have any reason to think it would be worse or better than the performance you saw here working with appended/prepended segments? Thanks again for sharing this work and the great implementation.

from ctc_segmentation.

Related Issues (8)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.