cuny-cl / wikipron Goto Github PK

View Code? Open in Web Editor NEW

308.0 18.0 69.0 171.88 MB

Massively multilingual pronunciation mining

License: Apache License 2.0

Python 99.55% Shell 0.45%

g2p pronunciation scraped-data python-api phonology phonetics speech nlp computational-linguistics linguistics

wikipron's People

Contributors

Stargazers

Watchers

wikipron's Issues

[may] Split into Latin and Arabic files

Malay has a decent number of Arabic spellings.

Specifying the "script" component in languages.json and then regenerating ought to take care of it.

    "script": {
        "latn": "Latin",
        "ara": "Arabic"
    }

[chi] Chinese support

During the big scrape, wikipron worked on scraping Chinese for about 24 hours and the tsvs generated only contained a few entries.

Chinese entries on Wiktionary appear to have IPA information in <li> elements within a collapsible <div>. As mentioned by Jackson here, these entries often contain IPA transcriptions for various Chinese languages and dialects within those languages. Mandarin IPA transcriptions are often given in Sinological IPA, which may contain unofficial(?) IPA symbols.

These entries are perhaps causing problems for _LI_SELECTOR_TEMPLATE in config.py because of intervening <small> elements between the <sup> and <span class="IPA"> that we want to target?

(Certain Chinese entries will simply link to other entries.)

[ice] semi-duplicate entries

We noticed some duplicate entries in Icelandic such that one has the segmented pronunciation and one doesn't (example). We don't yet know where this comes from but it only seems to affect words that begin with þ, and none of these examples seem to have ended up in our Icelandic sample for the paper.

Smoke test timeouts

For an example, see #100. I think these are all false positives...timeout should be increased substantially.

[scraping] duplicate entries

For various reasons we get true duplicate entries (i.e., same graphemic form, same phonemic form) in the big scrape. For example I I don't think this is exactly a bug; this could arise if our processing of the graphemic and/or phonemic form causes a duplicate pair to be formed.

Example: phonemic Icelandic has 77 duplicates.

Two possible solutions:

We keep a set of the pairs as we go. If we find a pair already in the set, we don't print it.
Perhaps we just keep the last form and don't repeat it---this assumes the data is already sorted, which is not obviously correct.
We just apply sort | uniq to every TSV file at the end of the big scrape (can turn this into a simple Bash script and add it to the README.

I think (3) is but could be persuaded.

[khm] Khmer / [tha] Thai / [chi] Chinese support

Both [khm] Khmer and [tha] Thai pull no data in phonetic or phonemic mode.

Both languages list their transcriptions within tables.

Khmer example
Thai example

Remove the "require dialect label" option

From https://github.com/kylebgorman/wikipron/issues/66#issuecomment-544214656 and subsequent discussion, we should remove the "require dialect label" option. We would lose the ability to do the following, neither of which I think is practically useful now:

~~One wants only entries with some dialect label, but doesn't care which dialects~~ (edit: I lied -- this is not possible, because when "require dialect label" is used, "dialect" must also be specified: https://github.com/kylebgorman/wikipron/blob/abecf0341268a9426e489226599ae8881c02aaf3/wikipron/config.py#L198-L202)
One wants a particular set of dialects (by specifying the "dialect" option) but doesn't want entries that lack a dialect label -- this seems strange, as entries without a dialect label (should) mean the pronunciation applies to all (or most major) dialects.

[geo] inconsistencies

The Wikipedia transcription guidelines say that the high front phoneme is /i/ but many transcriptions have /ɪ/ instead. We should fix this upstream (e.g., on Wiktionary itself) and rescrape.

See issue for more context.

[scraping] Segment IPA symbols

We have found that if we construct models over Unicode codepoints, simple models will predict impossible IPA symbol combinations. (Most IPA diacritics do not exist in combining forms.) One solution @m-sean experimented with is as follows. If a codepoint is

a Unicode combining character
in a list of IPA "modifier" characters (e.g., ʷ)

do not split; otherwise split between codepoints. Sample logic here: https://github.com/yeonju123/pair_ngram_model/blob/master/build_sym.py#L13

I propose that, as a feature enhancement, we apply this logic in WikiPron, putting spaces between "symbols" so defined); this could also be made optional and triggered by flag.

(Then we will have to modify the baseline model tools to handle data chunked in this fashion.)

[tha] Thai support

Thai IPA transcriptions are within tables. Example

This is related to the issue in #70 regarding Khmer in that the solution offered for scraping Khmer (creating a new selector template) looks like it would work for scraping Thai.

Thai IPA transcriptions also include tones. Does this present a problem? Are there pre- or post-processing steps that we are considering for handling languages that use IPA tones?

[cym] Add dialect labels

It looks like there are "North Wales" and "South Wales" entries.

Language generator not honoring UTF8

Running codes.py turns the following section in languages.js:

            "hanoi": "Hà Nội",
            "hcmc": "Hồ Chí Minh City",
            "hue": "Huế",
            "tc": "Vinh, Thanh Chương",
            "ht":"Hà Tĩnh"

into

            "hanoi": "H\u00e0 N\u1ed9i",
            "hcmc": "H\u1ed3 Ch\u00ed Minh City",
            "hue": "Hu\u1ebf",
            "tc": "Vinh, Thanh Ch\u01b0\u01a1ng",
            "ht": "H\u00e0 T\u0129nh"

[lat] Macrons not scraped

The Latin "big scrape" files are missing macrons (vertical bars above vowels indicating contrastive length) on the grapheme side. This is because, in Wiktionary backend, Latin words WITHOUT macrons are used as headwords. Example: https://en.wiktionary.org/wiki/malus#Etymology_2_2

Note that that page does pair the macroned form mālus with thhe pronunciation maːlus like we want it to. It seems to me that we need to switch from using the backend headword to the word rendered on the page.

(As it happens, this is exactly the same issue that we discuss for the CoNLL-SIGMORPHON 2017 shared task in http://wellformedness.com/papers/gorman-etal-2019.pdf, which also scraped data from Wiktionary.)

[aze] Azerbaijani has three 3 scripts

Our Azerbaijani data - phonetic, phonemic - contains Latin, Cyrillic, and Arabic scripts.

lat.py does not casefold

While working on Japanese I noticed that lat.py in the extract directory does not call config.casefold on the extracted words. I checked our most recent Latin data and indeed, despite casefold being set to true in languages.json, the data is not all lower case.

This should be an easy fix but I'm submitting the issue because we will need to run Latin again.

mssing phoneme value

Wiktionary sometimes does not provide IPAs, in which case, only graphemes are scraped.

[scraping] remove small TSVs

I propose that the big scrape, instead of removing empty files, removes any file with less than, say, 100 entries. It is relatively clear to me such files will never be any use for modeling.

I would entertain suggestions that the threshold should be 1k instead.

Languages with less than 100 entries should be automatically removed.

Languages with less than 100 entries after remove_duplicates.sh should be removed. We should perhaps move the logic which removes small files out of scrape.py and into generate_summary.py.

Incorporate whitelisting into `languages.json`

Once #152 is merged we should merge the whitelist functionality into languages.json. Presumably it should simply be the case that for any language or dialect if a whitelist value is specified, we load it and generate open two TSV files, one for the total data, one for the filtered data. This would also in theory make whitelist.py obsolete.

I would suggest that we actually call the unfiltered one a special name or put it somewhere different.

[jpn] Japanese support

Bringing over the discussion on Japanese from #62...

Which script(s) should we target? Head words of Japanese entries on Wiktionary could be in Kana, Kanji, or a mix of both? Do we filter or not? Or do we want some sort of post-processing like what's being planned for Serbo-Croatian (#62)?
In parallel with the WIkiPron package, there's this prior work by Alan/Kyle for scraping Japanese specifically. Maybe incorporate this into WikiPron somehow?

@lfashby @kylebgorman Please edit or comment as you see fit.

Big scrape directory reorg

Taking this online after discussion with @lfashby...

One annoying thing about the big scrape scripts is that the TSV directory has a huge list of files only then followed by the README table below. One way to work around this is as follows:

wikipron/languages/wikipron/README.md, which details how to re-run the big scrape, should be moved to the src subdirectory, which doesn't have a README yet
wikipron/languages/wikipron/tsv/README.md, the autogenerated language table, should be moved to wikipron/languages/wikipron instead.

Since this will screw with the paths we should wait until after the next submission deadline to do it...

[khb] Lü support

From Wiktionary:

Lü is a Southwestern Tai language spoken mainly in Yunnan province, China.

Lü has 301 entries on Wiktionary but we were only able to scrape one. This look to be another problem with our _LI_SELECTOR_TEMPLATE. It seems all entries but the one we scraped have title set as title="Appendix:Lü pronunciation (page does not exist)". Here is an example. The sole entry we scraped has title set as title="wikipedia:Lü phonology"

I'd imagine fixing this is a very low priority, but I figured it is something we should be aware of.

[khm] Khmer support

Khmer IPA transcriptions are within tables. Example

An obvious, although perhaps not ideal, solution would simply be to create a new selector template in config.py that targets <tr> or <td> elements containing a span[@class = "IPA"]. You'd then have to have some logic in the yield_phn function that checks if the language is Khmer or Thai and switches the selector template.

[afr] Afrikaans support

Afrikaans pulls no data in both phonemic and phonetic modes.

Dialect support in extraction functions

As mentioned in #146, certain of our extraction functions do not support dialects.

Currently, dialects are handled by the _get_pron_xpath_selector method in config.py, the results of which are stored in pron_xpath_selector. _yield_phn in our default extraction function makes use of config.pron_xpath_selector, as does Japanese (because its extraction function is used for targeting orthographic entries).

Our Chinese, Khmer, Latin, and Thai extraction functions do not use the default pron selector and skip over _yield_phn by calling yield_pron directly - meaning dialect information is ignored.

[pan] Split into Gurmukhi and Shahmukhī (Arabic).

This lexicon really has to come in two parts.

[kxd] Brunei (Malay) not in languagecodes.json

Trying to scrape for Brunei (ISO 639 name; Brunei Malay is the Wiktionary name) led to the following error.

File ".../wikipron/config.py", line 128, in _get_language
language = iso639.to_name(key).split(";")[0]
File ".../lib/python3.7/site-packages/iso639/init.py", line 115, in to_name
raise NonExistentLanguageError('Language does not exist.')
iso639.NonExistentLanguageError: Language does not exist.

Brunei Malay's code is "kxd" which is ISO 639-3 only, meaning it isn't covered by the iso639 package we are using.

For the time being I'll just go ahead and add "kxd": "Brunei Malay" myself to languagecodes.py. Perhaps it should be added in a separate pull request from the one I'll eventually submit with all the new data?
I'm not sure how this happened, shouldn't this have been caught by test_languagecodes.py? (Unless someone added 228 entries to Brunei Malay since test_languagecodes.py was last run.)

[eng, spa]: dialect specifications

In the current checked-in languages.json there is no dialect specification for either English or Spanish, whereas the earlier Bash script had one (and we want one).

For eng: --dialect="US | General American"
For spa: --dialect="Latin America"

[scraping] Support all config options

The scraping should support all config options including cut-off date, and then next big scrape run should use, IDK, 11/1/2019 as a cutoff date?

[eng] UK phonetic tilde in transcriptions

Our English UK phonetic data has ~ separating alternate pronunciations similar to - which we previously removed.

Ex: dance d æ n s ~ d a n s

`output` has no effect in API usage

Lucas has reported the following in https://github.com/kylebgorman/wikipron/issues/68#issuecomment-544209039:

Specifying an output config option prompts Wikipron to create a file that doesn't get written to (when using Wikipron from python).

This is a bug, as it means the output option (for specifying whether scraped data is in a text file or in stdout) has inconsistent behavior being CLI and API usage:

CLI: operational as documented
API: no effect

Two options:

Make the exposed scrape function call use the output option somehow to match the CLI behavior, i.e., if output is used, pipe the results to the specified text file and return `None, but if not used, then just yield the <word, pron> pairs (= current behavior).
Remove the output option completely. For CLI usage, if a user wants data in a text file, they could redirect stdout by >, like $ wikipron fra > fra.tsv (which we would document in the readme). From this perspective, the output option seems unnecessary.

Relatedly, I'm pretty open to improving WikiPron by making breaking changes at this early stage, as I'm suggesting another breaking change now, see https://github.com/kylebgorman/wikipron/issues/66#issuecomment-544214656). For this current ticket, I have a slight preference for just removing the output option, since this would mean keeping the scrape function behave in only one way (i.e., for always yielding the <word, pron> pairs, and not having an alternative behavior of returning None).

Single-language "big scrape" functionality

Often we want to test changes to one or a small number of languages by running a partial "big scrape" and it would be nice to make this easier to do than making copies of languages.json. Maybe something like:

./scrape.py --restriction=lit

Would run the scrape for Lithuanian (all dialects and scripts) but would use the appropriate metadata in languages.json, and

./scrape.py

would run it for all languages in languages.json.

Lingering TSV files

There are lingering TSV files in wikipron/languages/wikipron/ made obsolete by the Big Scrape.

Add logging to whitelister

Once #152 is in we should add some kind of logging to the whitelister so we can get a list of "bad" words and the phones that go with them.

Perhaps it should also produce a some kind of structured report (JSON?) instead of just logging.

Potential problem in _parse_combining_modifiers()

I started the second big scrape and while scraping for phonetic data from Albanian, Wikipron threw an error, the last line of which I'll reproduce below:

File ".../wikipron/config.py", line 73, in _parse_combining_modifiers
last_char = chars.pop()
IndexError: pop from empty list

The final line in the Albanian phonetic tsv is herë h ɛː ɾ meaning the scrape likely failed on this entry which contains what looks like word initial aspiration.

I guess for words like the one that caused this error we would want to combine with next char ʰi d r ɔ ɟ ɛ n?

[jpn] Japanese support

Running the big scrape has revealed that the Japanese Wiktionary community decided to change the format of Japanese entries.

Our Japanese extraction function no longer works and the Japanese data gathered in this run of the big scrape needs to be discarded. We'll need to re-write the extraction function at some point, though for the time being I believe they are still ironing out the details of how the Japanese entry pages should look.

Use `pkg_resources` for version number

Use "#5" here: https://packaging.python.org/guides/single-sourcing-package-version/

[quc] K'iche' support

K'iche' has 117 words with IPA pronunciations but

wikipron quc | wc -l

only gives us 65 entries. I believe this is because many of the entries have whitespace in the graphemic side and that we just need to enable

    "no_skip_spaces_word": true,
    "no_skip_spaces_pron": true,

for this language.

Handle the tie bar for affricates in IPA segmentation

IPA segmentation doesn't appear to correctly handle the tie bar for affricates. For instance, from https://github.com/kylebgorman/wikipron/pull/76#discussion_r339324885, I saw the following for Brazilian Portuguese:

adição a d͡ ʒ i s ɐ̃ w

Instead, we would want the following for this example:

adição a d͡ʒ i s ɐ̃ w

Use positive, not negative flags

Flags should be renamed to be positive statements rather than negative ones, if possible.

[yue] Cantonese support

[yue] is ISO 693-3 only, but should be handled by languagecodes.py in the same way that [nci] for Classical Nahuatl and [hbs] for Serbo-Croatian are.

Nonetheless, [yue] pulls nothing in phonetic and phonemic modes and does not even seem to begin scraping.

[vie] Enhancements

I notice a few obvious problems with Vietnamese:

As far as I can tell, nearly all Vietnamese words have three phonetic pronunciations: "Hà Nội", "Huế, and "Hồ Chí Minh City". We should just add dialects to languages.json
As is well-known, the Vietnamese Roman orthography puts a space between every syllable. So our data set is really only monosyllabic Vietnamese words! We could solve this by adding a flag (--no-skip-space) here, or we could make a language specific-extractor, I suppose.

[lat] Dialect specifications

Add "Classical", "Ecclesiastical", and "Vulgar" specifications for Latin.

Add a --no-tone flag

Thai (#71) was handled by #90, but we'd still like a --no-tone flag to optionally remove the IPA tones.

Serbo-Croatian pitch accents

Somewhat like Latin (#80) Serbo-Croatian headwords largely don't match the word given on the page. The headwords are not given with accent diacritics.
Cyrillic example
Latin example

Here's a quote from the Serbo-Croatian phonology page:

Accent diacritics are not used in the ordinary orthography, but only in the linguistic or language-learning literature (e.g. dictionaries, orthography and grammar books). However, there are very few minimal pairs where an error in accent can lead to misunderstanding.

Is it important to us to grab the words with accents, or can we get away with just grabbing the headword? (The people putting together the Serbo-Croatian Wiktionary IPA category elected to omit accents from the headword.)

[mlt] Blacklist entries in Arabic script?

Maltese lexicon has one entry in Arabic script. I suspect it's ended there from an etymology section somewhere. Is there a way to blacklist scripts during scraping?

[bul] inconsistencies

As reported here there are some inconsistencies with /l/ and the dental stops. As [discussed here](https://en.wiktionary.org/wiki/Wiktionary:Information_desk/2020/April#Performing_bulk_edits, there is a pronunciation module and pron template for Bulgarian on Wiktionary; we may be just bulk-migrate Bulgarian to that template and re-scrape.

[hbs] Serbo-Croatian support

Wiktionary only lists Serbo-Croatian and does not have separate categories for Serbian, Croatian, Montenegrin, and Bosnian - the languages that the ISO 693-3 [hbs] code encapsulates.

Wikipron can handle scraping if given [hbs] but the resulting tsv files may not be of much use. [hbs] tsv files are therefore not included in pull request #61 .

Add a whitelist README

:Add note to contributors of whitelists about what to do. Here is the basic guidance:

Use the standard fork and pull approach.
Make a list of all phones or phonemes, in descending-frequency order, using the appropriate file in data/wikipron/tsv.
Remove any rare, bizarre, and non-native sounds.
For phonemic whitelists, annotate reasonably-frequent allophones with a comment. This is optional for phonetic whitelists.
Run ./postprocess && ./generate_summary.py.
git add the whitelist, the newly-filtered TSV files, and the README and summary TSVs modified, then commit, push, and file the PR. This will also allow the reviewer to quickly see how many forms are removed by whitelisting.

Comment style is Python-like: phone/phoneme, two spaces, #, space, then a title-cased expression with period at end. E.g., t̪ # Allophone of /t/.

[eng] dashed pronunciations

The English data contains roughly 300-400 entries where the pronunciation is given incompletely, with a dash at the beginning, middle, or end of the entry (usually indicating that we're supposed to fill in the rest of the pronunciation ourselves). Some examples:

arabica ə - ɹ ă b ĭ - k ə
observer    ɒ b -
fornicatory - t o ʊ ɹ i

I would propose that we should remove all these punctuations and can do so on a language-independent basis.

cuny-cl / wikipron Goto Github PK

wikipron's People

Contributors

Stargazers

Watchers

Forkers

wikipron's Issues

Recommend Projects

Recommend Topics

Recommend Org