georgid / alignmentduration Goto Github PK

Lyrics-to-audio-alignement system. Based on Machine Learning Algorithms: Hidden Markov Models with Viterbi forced alignment. The alignment is explicitly aware of durations of musical notes. The phonetic model are classified with MLP Deep Neural Network.

Home Page: http://mtg.upf.edu/node/3751

License: GNU Affero General Public License v3.0

Python 95.29% Shell 1.04% MATLAB 0.50% C 1.89% TeX 1.28%

python htk lyrics duration decoding deep-learning hidden-markov-model alignment synchronization mfcc

alignmentduration's Introduction

AlignmentDuration

Tool for Aligning lyrics to audio automatically using a phonetic recognizer with Hidden Markov Models. The Viterbi Decoding with explicit durations of reference syllables can be toggled on with the parameter WITH_DURATIONS

Built from scratch. Alternatively one can use this tool as a wrapper around htk (may be faster) by setting the parameter DECODE_WITH_HTK

If you are using this work please cite http://mtg.upf.edu/node/3751

NOTE: A version building upon this research is built by Voice Magix. It features

latest deep-learning enabled acoustic model
English language lyrics parser and normalizer
runtime speed optimization
option to run on recordings with diverse types of background instruments
reduced external package dependencies

If interested in using it write to info at voicemagix dot com

Folder Structure

example: example/test sound and annotation files
scripts: help scripts for running the code (including on hpc cluster )
src: main source code
- align: main alignment logic
- hmm: hidden Markov model alignment
- for_makam: Makam-specific logic (see music traditions below)
- models_makam: acoustic model for Turkish
- models_jingju: acoustic model for Jingju Mandarin
- for_jingju: jingju-specific logic (see music traditions below)
- onsets: logic for note-onset-aware alignment (ISMIR 2016)
- parse: logic for parsing lyrics files
- smstools: modifications to the https://github.com/MTG/sms-tools
- utilsLyrics: any utility scripts
test: test scripts (scould be used in CI)
thrash: code that has to be reviewed and deleted, left for the sake of completeness

LICENSE

AlignmentDuration is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation (FSF), either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/

For more details see COPYING.txt

BUILD INSTRUCTIONS

NOTE: python3 is not supported and tested

git clone https://github.com/georgid/AlignmentDuration.git; sudo apt-get install python-dev python-setuptools python-numpy

pip install -r requirements; python setup.py install

essentia.
pdnn needed if OBS_MODEL is MLP or MLP_fuzzy

cd ..; git clone https://github.com/yajiemiao/pdnn

install also Theano

install htk needed if either MFCC_HTK or DECODE_WITH_HTK is set to 1
htkModelParser needed if on Turkish makam and OBS_MODEL is GMM

git clone https://github.com/georgid/htkModelParser.git; cd htkModelParser; sudo pip install -r requirements; python setup.py install

sci-kit-learn (branch with fixed underflow issues) needed if on Turkish makam and OBS_MODEL is GMM

git clone https://github.com/georgid/scikit-learn; sudo apt-get install python-scipy; python setup.py install

git clone https://github.com/georgid/makam_acapella needed if using MLP_fuzzy model
Evaluation (optional if evaluation of accuracy needed)

cd ..; git clone https://github.com/georgid/AlignmentEvaluation.git

Citation

Georgi Dzhambazov, Knowledge-based Probabilistic Modeling for Tracking Lyrics in Music Audio Signals, PhD thesis thesis materials companion page

USAGE on different music traditions

jingju (Beijing Opera) : Chinese

python AlignmentDuration/jingju/runWithParamsAll.py 2 0 /JingjuSingingAnnotation-master/lyrics2audio/results/3folds/ 3 0

to test: python AlignmentDuration/test/testLyricsAlign.py

with method testLyricsAlign_mandarin_pop

Turkish Makam music: Turkish

You need to provide the musicbrainz ID (MBID) of the recording. This requirement could be removed on demand...

call as a method from an aggregator API:

install https://github.com/MTG/pycompmusic; python pycompmusic/compmusic/extractors/makam/lyricsalign.py

or locally:

python https://github.com/georgid/AlignmentDuration/blob/noteOnsets/src/for_makam/lyricalign_local.py

to test: python AlignmentDuration/test/testLyricsAlign.py

with method testLyricsAlignMakam

English

Write to georgi.dzhambazov at upf dot edu or info at voicemagix dot com if you would like to use the English language model. It is not included here for licensing issues.

Evaluation

Use evalAccuracy script. 100 means perfect alignment. Usually values above 80% are acceptably well for human listeners.

The default evaluation level is set at word boundaries

BUILD INSTRUCTIONS ON server kora.s.upf.edu

git clone https://github.com/georgid/AlignmentDuration.git git checkout for_pycompmusic

cd /homedtic/georgid/test2/AlignmentDuration source /homedtic/georgid/env/bin/activate python setup.py install

to test: python /homedtic/georgid/test2/AlignmentDuration/test/testLyricsAlign.py

on server: git pull https://github.com/MTG/pycompmusic /srv/dunya/env/src/pycompmusic/compmusic/extractors/makam/lyricsalign.py with recording MB-ID: 727cff89-392f-4d15-926d-63b2697d7f3f

alignmentduration's People

Contributors

Stargazers

Watchers

Forkers

loretoparisi xrick dgacc lidianxiang zbuster05 13717630148

alignmentduration's Issues

deprecate for-segments branch. test on makam. integrate in mir_eval

AlignmentEvaluation has two branches:

https://github.com/georgid/AlignmentEvaluation/tree/for-audio-segmentsused for makam data which is annotated per audio segment (section of 10-20 seconds)
http://compmusic.upf.edu/turkish-makam-acapella-sections-dataset
https://github.com/georgid/AlignmentEvaluation/tree/master for jingju data

REDUCE CODE: check if we can replace some of the code in SymbTrPrser.syllable2LyricsOneSection with code from LyricsParsing.expandlyrics2WordList.

as well LyricsWithModels.LyricsWithModels.printWordsAndStates and LyricsWithModels.LyricsWithModels.printWordsAndStatesAndDurations can take code from LyricsParsing.expandlyrics2WordList.

merge SectionLinkMakam.loadSmallAudioFragment and SectionLinkMakam.loadSmallAudioFragmentOracle

there is kimseye-specific code in
align.LyricsAligner.LyricsAligner.alignLyricsSection

remove it.

show lyrics player on recordings with only singing voice

Disable the link to the lyric player for all recordings that are not in the list of recording MBIDs with singing voice present.

- put reading pitch as input in lyricsalign not in align.FeatureExtractor.loadMFCCs

missing words from dictionary

here)

move method to LyricsWithModels URGENT!

Decoder.Decoder.duration2numFrameDuration

have the same special token for silence in all phoneme sets. Now it is REST for Mandarin and '' for English

This is the SIL_TOKEN item

replace logic for taking intervals of onsets with mir_eval.adjust_intervals

onsets.OnsetDetector.OnsetDetector.parseNoteOnsetsGrTruth
and
replace also parse.TextGrid_Parsing._findBeginEndIndices

optimize: do observation prob for a given feature vector for all phonemes in alphabet, not for all phonemes in phonemeNetwork

make sure finalTs of referenceScore duration < actual duration of recording

EXAMPLE : /Users/joro/Documents/Phd/UPF//ISTANBUL//barbaros/02_Gel_9_nakarat2.scoreDeviation

this is a problem for evaluation Accuracy::
For now WORKAROUD in AccuracyEvaluator.calcCorrect
currEndDetected = finallTsAnno
logging.warn("currEndDetected > finallTsAnno")

REASON: Munir nurretim makes shorter some notes? ??

dont trim file,

so eliminate RecordingSegmenter step,
do alignment for each section in a loop

recomupte timestamps fron whole recording mfc to given sectuin( input: section.json)
Align
recomupte tinmestamps for whole recordings

clean code in function doitOneChunk.decodeAudioChunk()

TODO: could be done weasier with this code, and check last method in Word

grTruthWordList = testT(decoder.lyricsWithModels)

the loops of some methods from LyricsWithModels use same code.

e.g. printWordsAndStatesAndDurations and phoneteme2StatesNetwork.
Put this repeating logic as a helper method in LyricParsing

with_section_annotations = 0, sectionLInks do not work

sectionLInk object has no section object assigneed.
See in method
makam.MakamRecording.MakamRecording._loadsectionTimeStampsLinks
have a look for an example at:
makam.MakamRecording.MakamRecording._loadsectionTimeStampsAnno()

This fails in align.LyricsAligner.LyricsAligner.alignRecording:
if not hasattr(currSectionLink, 'section') or currSectionLink.section == None:

reducnant Lyric() constructor call

this is dangerous

try to Remove

all loggers to be the same: e.g. the one of Decoder

Move this file to dir for_english

Move this file to dir for_english
/Users/joro/workspace/AlignmentDuration/src/for_makam/state_str2int_METU

detectedTokenList with flag DETECTION_TOKEN_LEVEL ='words' has Word object and not the string. So there is a problem at json.dump() in LyircsAligner

Reduce code by refining LyricsWithModels

The LyricsWithModels is not needed for NeuralNEtwork, so Baseclass is used for DNN, add padded silicce method. As result:

LyricsWithModelsBase is used for DNN. LyricsWithModelsCNN._linkTomodels and
LyricsWithModelsBase._linkTomodels do not make sense.

maybe remove lyrics With models in general.

then reduce if statement in SecionLink.loadSmallAudioFragment()

reduce dependency on htk and scikit learn

make sure extracting MFCC with essentia same as damp model:

add preempahsis (or recreate model without preemphasis )
add cepstral mean normalization

dont use scikit learn at all, keep LyricsWIthModelsGMM class for chinese.

test on Jingju

refine and commit LyricAligner

-LyricsAligner. withAnnotations and withLinks have two similar loops, think how to put it in one loop.

test and update pycompmusic.lyricsAlign

improve roganization of MusicXML Parser: use ScoreSection class in MusicXMLParser

Would be nice that the field sectionLyrics in MUsicXML Parser is type ScoreSection.
further MakamScore should have a field score in the same way symbTrParser has a field score of type MakamScore.

remove underscores form lyrics. e.g. bir ihtimal

test WITH_SECTION_ANNO = False with new MLP model

https://github.com/georgid/AlignmentDuration/blob/noteOnsets/src/align/LyricsAligner.py#L142

test from for_makam/lyrics_align WITH_DURATIONS = True
first

get rid of loading htk models, and get rid of class LyricsWithModelsHTK (see issue #40)
select recordings with no second verse
let andres run on dunya
optionally divide into good and bad quality

give section link number as arg instead of TextGridTs and so fromTextGrid

in lines starting at
if ParametersAlgo.WITH_ORACLE_PHONEMES: # oracle phonemes

https://github.com/georgid/AlignmentDuration/blob/noteOnsets/align/LyricsAligner.py#L253

remove redundant code that is already in AlignmentStep

make SymbTrParser and maybe MakamScore inherit form AlignmentStep.*

when withPaddedSilence, consider creating dummy Word with one phoneme sp

in align.LyricsWithModels.LyricsWithModels._linkToModels

This way not needed to handle specially the phoneme sp. It is now inserted as Phoneme in phonemes network in LyricsWithModels but not seen in list of words in Lyrics

means, covars, weights

means, covars, weights, should not be in the constructor of _ContinuousHMM, because they are observation probabilities.

more precise duration distr. function.

No round() at
Decoder.Decoder.duration2numFrameDuration
but instead put this as mean value and round() duration times

Add Mandarin Char to mandarinSyllable class

Store the char, not only the pinynin in the mandarinSyllable class. See initi of the mandarinSyllable class here. Split chinese characters in this method and give as argument.

saving the phrasesAligned is not OK for HTK-based system

in function detectedAlignedfileName = mlf2TabFormat(detectedWordList, URIrecordingNoExt, tokenLevelAlignedSuffix)

concatenate textGrid data.

Concatenate TextGrid annotation files for the segmented files
into one-per-recording TextGrid annotation automatically:

install TextGridTools version 1.4.1 (either through GitHub or pip install --upgrade tgt). If you use pip, please make sure that it is really version 1.4.1 that is installed — if you get an older version, try again.
QUESTION:
Because each big audio file to which the concatenated text grid corresponds starts with n seconds of silence, then I need to insert in the beginning n seconds of silence and then concatenate TextGrids where the first starts at timestamp = n.

To do this I tried this code:

shiftTime = 51.354230

tiers_ = []
os.chdir(pathInput)
tgtURI = '/Users/joro/Documents/Phd/UPF/ISTANBUL/goekhan/02_Kimseye_2_zemin.TextGrid'

from tgt.util import shift_boundaries
tg = tgt.read_textgrid(tgtURI)

tier = tg.get_tier_by_name('words')
tierShifted = shift_boundaries(tier, shiftTime,0)

tg.add_tiers(tierShifted)

tgOutURI = pathOut + 'Kimseye.TextGrig'
tgt.write_to_file(tg, tgOutURI)

However I get this error:

in ()
21
22 tgOutURI = pathOut + 'Kimseye.TextGrig'
---> 23 tgt.write_to_file(tg, tgOutURI)

/usr/local/lib/python2.7/site-packages/tgt/io.pyc in write_to_file(textgrid, filename, format, encoding, **kwargs)
390 with codecs.open(filename, 'w', encoding) as f:
391 if format in _EXPORT_FORMATS:
--> 392 f.write(_EXPORT_FORMATS[format](textgrid, **kwargs))
393 else:
394 raise Exception('Unknown output format: {0}'.format(format))

/usr/local/lib/python2.7/site-packages/tgt/io.pyc in export_to_short_textgrid(textgrid)
241 textgrid_corrected = correct_start_end_times_and_fill_gaps(textgrid)
242 for tier in textgrid_corrected:
--> 243
result += ['"' + tier.tier_type() + '"',

244                    '"' + escape_text(tier.name) + '"',
245
                tier.start_time, tier.end_time, len(tier)]

AttributeError: 'Interval' object has no attribute 'tier_type'

RESPONSE:

the problem you encounter is solved easily. To add the shifted tier, you did the following:

tg.add_tiers(tierShifted)

This method, however, expects a list of tiers, not a single tier. You have to do the following instead

tg.add_tier(tierShifted)

tg.add_tiers([tierShifted])

remove hard-coded logic for discriminating btw two duration distributions

when WITH_SHORT_PAUSES = 1

we got error: last state for word SAZ is not sp. Sorry - not implemented.

The problem is it is that I removed sp from SAZ so that it is not sil sp but sil.

Merge the wo Phonetizer() - delte the one from AlignmentStep

in line 117 of AlignmentStep.Aligner() there is different num arguments. problem triggered by outputHTKPhoneAlignedURI = Aligner.alignOnechunk(MODEL_URI, URIrecordingWav, lyrics, URIrecordingAnno, '/tmp/', withSynthesis)
in AlignmentDuration.alignOneChunk()

Duration result > maxDuration

happens when t < MAX_DUR.
probable reason:
phi Star stays more states than currMaxDur.

in LyricsWithModels assign small deviation to phonemes and the given by as parameter one to vowels

_phonemes2stateNetwork(self):

for MTG/HMM

Check if Path makes sence. put just backtracking logic in Path

reduce dependecy on htkmfc

make sure extracting MFCC with essentia same as damp model:

add preempahsis (or recreate model without preemphasis )
add cepstral mean normalization

make consistent. phoneme2states

for getRefDurations : expandlyrics2WordList
for decoded done with other code.
unify these two.

solve together with HARD CODED bug

integrate resynthesis from sms-tools to AlignmentDuration

check steps from TODO

repetiion of _constructHMMNetworkParameters in DurationHMM and HMM

_constructHMMNetworkParameters is on 3 different places. Leave onlpy in ContinousHMM or Decoder.

adapt the viterbi search to lyricsWithModels

instead of adapting lyricsWithModels class to guyzs implementation,
adapt the viterbi search to lyricsWithModels

MEthods to change: parse lyricsWithModels and construct params for guyZ
=Decoder.Decoder._constructHMMNetworkParameters

=Decoder.Decoder.path2ResultWordList - uses stateNetwork indices

merge _constructTimeStampsForToken and _constructTimeStampsForTokenDetected

in file LyricsParsing

optimize code for expnasion to syllables

The code for expansion of silence word could be reduced here

Doesn't install if cython not available

If cython is not available, this package fails setup.py install with this error:

    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 20, in <module>
      File "/mnt/compmusic/itri/jenkins/jobs/Dunya/workspace/env/src/alignment-duration/setup.py", line 62, in <module>
        cmdclass = {'build_ext': build_ext},
    NameError: name 'build_ext' is not defined

This is preventing dunya from building and running tests, and so we have currently removed it as a dependency