Git Product home page Git Product logo

language-resources's Introduction

Language Resources and Tools

Build Status

Datasets and scripts for basic natural language and speech processing.

This is not an official Google product.

Natural Languages

Directory Language Available
af Afrikaans
bn Bengali / Bangla
hi_ur Hindi & Urdu
is Icelandic
jv Javanese
km Khmer
lo Lao
my Burmese / Myanmar
ne Nepali
si Sinhala
su Sundanese
xh Xhosa
zu Zulu

Tools

We are including a few tools for working with the natural language datasets. These tools are written in C++ and Python and are built with Bazel. To compile and use these tools, install a recent version of Bazel (minimally Bazel release 0.4.5 is required).

Opensourced Audio Data

Resource Link
Sinhala TTS recordings (~3K) https://www.openslr.org/30/
TTS recordings for four South African languages (af, st, tn, xh) https://www.openslr.org/32/
Large Javanese ASR training data set (~185K) https://www.openslr.org/35/
Large Sundanese ASR training data set (~220K) https://www.openslr.org/36/
High quality TTS data for Bengali languages https://www.openslr.org/37/
High quality TTS data for Javanese https://www.openslr.org/41/
High quality TTS data for Khmer https://www.openslr.org/42/
High quality TTS data for Nepali https://www.openslr.org/43/
High quality TTS data for Sundanese https://www.openslr.org/44/
Large Sinhala ASR training data set https://www.openslr.org/52/
Large Bengali ASR training data set https://www.openslr.org/53/
Large Nepali ASR training data set https://www.openslr.org/54/
Crowdsourced high-quality Argentinian Spanish speech data set https://www.openslr.org/61/
Crowdsourced high-quality Malayalam multi-speaker speech data set https://www.openslr.org/63/
Crowdsourced high-quality Marathi multi-speaker speech data set https://www.openslr.org/64/
Crowdsourced high-quality Tamil multi-speaker speech data set https://www.openslr.org/65/
Crowdsourced high-quality Telugu multi-speaker speech data set https://www.openslr.org/66/
Data set which contains recordings of Catalan https://www.openslr.org/69
Crowdsourced high-quality Nigerian English speech data set https://www.openslr.org/70
Crowdsourced high-quality Chilean Spanish speech data set https://www.openslr.org/71
Crowdsourced high-quality Colombian Spanish speech data set https://www.openslr.org/72
Crowdsourced high-quality Peruvian Spanish speech data set https://www.openslr.org/73
Crowdsourced high-quality Puerto Rico Spanish speech data set https://www.openslr.org/74
Crowdsourced high-quality Venezuelan Spanish speech data set https://www.openslr.org/75
Crowdsourced high-quality Basque speech data set https://www.openslr.org/76
Crowdsourced high-quality Galician speech data set https://www.openslr.org/77
Crowdsourced high-quality Gujarati multi-speaker speech data set https://www.openslr.org/78
Crowdsourced high-quality Kannada multi-speaker speech data set https://www.openslr.org/79
Crowdsourced high-quality Burmese speech data set https://www.openslr.org/80
Data set which contains male and female recordings of English from various dialects of the UK and Ireland. https://www.openslr.org/83
Crowdsourced high-quality Yoruba speech data set https://www.openslr.org/86

Other reading resources

SLTU 2016 Tutorial - https://sites.google.com/site/sltututorial/overview

Publications

License

Unless otherwise noted, all original files are licensed under an Apache License, Version 2.0.

Where specifically noted, some datasets are licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).

The directory third_party/ contains third-party works, which we are including under the respective licenses of the upstream projects. See third_party/README.md for further details.

language-resources's People

Contributors

agutkin avatar brawer avatar cibu avatar iamchathu avatar jimregan avatar mjansche avatar oddurk avatar pasindud avatar rajan-sust avatar rwsproat avatar twattanavekin avatar wannaphong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

language-resources's Issues

Duration of speech recordings in datasets?

Are the total durations of the speech recordings in these datasets available anywhere? I'd love to know that without downloading them and figuring it out for myself, if possible.

Error in Training a Bangla Voice

Hi,
I was trying to build a Bangla Festival voice according to this guideline. But I got an error in the training phase. The error messages are quite long, but the first one was this:
xargs: <path to festival>/build/festvox/src/ehmm/bin/FeatureExtraction: No such file or directory

Can anyone point out the reasons for this error and/or methods to solve this problem.

Thanks in advance.

unicode.fst model

Please let me know how can I build unicode.fst model? Many thanks.

Detection of font encoding

Hi,

Thanks for this library. I am wondering if there is a function to check if the input is Zawgyi or Unicode encoding.

Why 'markup' pattern in .tsv files

I see many .tsv files containing following text for example in Sundanese:
CARDINAL_MARKUP cardinal|integer:-1| mineus hiji

And also the .grm files have corresponding grammar definitions to parse these patterns in .tsv files.
Where are these patterns "cardinal|integer:-1|" useful in real world text normalisation?
Why not .grm files just export rules to parse real world examples like "-1" instead of markup?

speaker level annotation in the speech datasets

Crowdsourced high-quality multi-speaker speech data sets, I see there are line_index_{gender}.tsv file.

In the tsv files, I can see the following data,

guf_03209_00443170675	જગતમાં કોઈ જીવ ન્નસ જંગમ છે અને કોઈ સ્થાવર.
guf_01414_00718082800	ગામમાં ભગવાન સ્વામિનારાયણનું મંદિર છે

Here, what are 03209 and 00443170675 for the first sample? Are they speaker and utterance id?

Error in creating labels from festival utts

I manually placed my utts created in festival into labels/cmu_us_my_voice/festival/utts.

When trying to make labels SIOD error is coming and tmp file is not created. The error comes as follows under each of the file name:

** SIOD ERROR: unbound variable : eof
gawk: fatal: cannot open file ./full_context_labels/labels//tmp' for reading (No such file or directory) gawk: fatal: cannot open file ./full_context_labels/labels//tmp' for reading (No such file or directory) **

Can someone please help me with this issue?

how to build phoneset for a language using phonology.json

I'm trying to build phoneset for chinese language, still stuck on how to use apply_phonology.py. I've tried to use it and getting issues with regex

INST_LANG_VOX = re.compile(r'.*/([^_]+_[^_]+)_([^_]+)_phoneset.scm$')

Here is what I'm getting when I try to build necessary files.

$ python apply_phonology.py cantonese_phonology.json cantonese/data/
Traceback (most recent call last):
  File "apply_phonology.py", line 774, in <module>
    main(sys.argv)
  File "apply_phonology.py", line 739, in main
    assert len(phoneset_paths) == 1
AssertionError

Error while trying to build prompts for Tamil language

In my lexicon file, words are in the format as follows
("அ" nil (((a) 1)))
I'm actually trying to build Tamil voices using festival.
But when I'm trying to build prompts using " ./bin/do_build $PARALLEL build_prompts " , it gives errors as " ta_1566 PROMPTS
LEXICON: Word ஆண்டாள் (plus features) not found in lexicon
" for all the sentences in txt.done.data file.
But all those words are in my lexicon.scm file. How to get over with this error?

Need en text normalization resources

Really nice work and strong baseline of text normalization!

I am looking for a tool to do english text normalization and find sparrowhawk to solve my problem. But only a en_toy in documentation in sparrowhawk repo.

Counld you please provide a better english grm resources for sparrowhawk text normalization?

Thanks a lot !

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.