google / language-resources Goto Github PK

Datasets and tools for basic natural language processing.

License: Apache License 2.0

Python 26.59% C++ 40.34% C 0.63% Shell 4.63% Makefile 0.18% Java 8.68% Ruby 0.01% Dockerfile 0.79% Starlark 18.16%

language-resources's Issues

Add space to Tamil-to-IPA rules

FYI, the Tamil-to-IPA rules are now in Unicode CLDR, with one small modification: http://unicode.org/cldr/trac/changeset/13170

Interestingly, changing this rule didn’t affect the unit tests.

SLR72 is about Colombian Spanish not "Columbian"

"Columbia" and "Columbian" refer to locations in North America. The country is Colombia. The cited paper gets the spelling right, but this repo and the SLR website have misspelled it.

Using Ossian and Merlin, can't synthesise voice

Error in creating labels from festival utts

I manually placed my utts created in festival into labels/cmu_us_my_voice/festival/utts.

When trying to make labels SIOD error is coming and tmp file is not created. The error comes as follows under each of the file name:

** SIOD ERROR: unbound variable : eof
gawk: fatal: cannot open file ./full_context_labels/labels//tmp' for reading (No such file or directory) gawk: fatal: cannot open file ./full_context_labels/labels//tmp' for reading (No such file or directory) **

Can someone please help me with this issue?

Error while trying to build prompts for Tamil language

In my lexicon file, words are in the format as follows
("அ" nil (((a) 1)))
I'm actually trying to build Tamil voices using festival.
But when I'm trying to build prompts using " ./bin/do_build $PARALLEL build_prompts " , it gives errors as " ta_1566 PROMPTS
LEXICON: Word ஆண்டாள் (plus features) not found in lexicon " for all the sentences in txt.done.data file.
But all those words are in my lexicon.scm file. How to get over with this error?

How to conversion of FestVox voices to Flite?

Hello. I want conversion of FestVox voices to Flite (for thai language #25 ). Can you recommend me?

Why 'markup' pattern in .tsv files

I see many .tsv files containing following text for example in Sundanese:
CARDINAL_MARKUP cardinal|integer:-1| mineus hiji

And also the .grm files have corresponding grammar definitions to parse these patterns in .tsv files.
Where are these patterns "cardinal|integer:-1|" useful in real world text normalisation?
Why not .grm files just export rules to parse real world examples like "-1" instead of markup?

Detection of font encoding

Hi,

Thanks for this library. I am wondering if there is a function to check if the input is Zawgyi or Unicode encoding.

Need en text normalization resources

Really nice work and strong baseline of text normalization!

I am looking for a tool to do english text normalization and find sparrowhawk to solve my problem. But only a en_toy in documentation in sparrowhawk repo.

Counld you please provide a better english grm resources for sparrowhawk text normalization?

Thanks a lot !

unicode.fst model

Please let me know how can I build unicode.fst model? Many thanks.

how to build phoneset for a language using phonology.json

I'm trying to build phoneset for chinese language, still stuck on how to use apply_phonology.py. I've tried to use it and getting issues with regex

language-resources/festival_utils/apply_phonology.py

Line 39 in 9a7c2c6

INST_LANG_VOX = re.compile(r'.*/([^_]+_[^_]+)_([^_]+)_phoneset.scm$')

Here is what I'm getting when I try to build necessary files.

$ python apply_phonology.py cantonese_phonology.json cantonese/data/
Traceback (most recent call last):
  File "apply_phonology.py", line 774, in <module>
    main(sys.argv)
  File "apply_phonology.py", line 739, in main
    assert len(phoneset_paths) == 1
AssertionError

How to build phonology.json (consonant , vowel , tone marke) with IPA?

Can you give me some advice?

I can't train thai language.

I build a thai text to speech. I found issue. $ ./festival_utils/build_festvox_voice.sh ${WAV_FOLDER} th ${OUTPUT_VOICE_FOLDER} > log.txt

Log : https://gist.github.com/wannaphongcom/aa60b952f6913364b3f55f823ec297a8

How did you generate 'universal_depot.far' file?

Just wanted to understand how the 'universal_depot.far' file has been created?
Where are the .grm file for all those FSTs in the 'universal_depot.far' file?

Thank you,
Surendra

Error in Training a Bangla Voice

Hi,
I was trying to build a Bangla Festival voice according to this guideline. But I got an error in the training phase. The error messages are quite long, but the first one was this:
xargs: <path to festival>/build/festvox/src/ehmm/bin/FeatureExtraction: No such file or directory

Can anyone point out the reasons for this error and/or methods to solve this problem.

Thanks in advance.

Duration of speech recordings in datasets?

Are the total durations of the speech recordings in these datasets available anywhere? I'd love to know that without downloading them and figuring it out for myself, if possible.

speaker level annotation in the speech datasets

Crowdsourced high-quality multi-speaker speech data sets, I see there are line_index_{gender}.tsv file.

In the tsv files, I can see the following data,

guf_03209_00443170675	જગતમાં કોઈ જીવ ન્નસ જંગમ છે અને કોઈ સ્થાવર.
guf_01414_00718082800	ગામમાં ભગવાન સ્વામિનારાયણનું મંદિર છે

Here, what are 03209 and 00443170675 for the first sample? Are they speaker and utterance id?

Merlin integration with festival for Bengali language.

@pasindud sir, how you trained the TTS Voice without merlin defined in (https://github.com/googlei18n/language-resources/blob/master/bn/festvox/README.md).

I'm not understanding how can I integrate merlin with festival. I also viewed (https://github.com/googlei18n/language-resources/blob/master/bn/merlin/README.md).

Thanks in advance.

google / language-resources Goto Github PK

language-resources's Issues

Recommend Projects

Recommend Topics

Recommend Org