jbrry / irish-bert Goto Github PK

View Code? Open in Web Editor NEW

9.0 9.0 0.0 26.25 MB

Repository to store helper scripts for creating an Irish BERT model.

License: Other

Shell 19.53% Perl 3.17% Python 77.30%

irish-bert's People

Contributors

Stargazers

Watchers

irish-bert's Issues

feature request: download handlers to skip existing files

Running python scripts/download_handler.py --datasets conll17 NCI again, I get error messages

unxz: data/ga/conll17/Irish/ga-common_crawl-000.conllu: File exists
bzip2: Output file data/ga/conll17/raw/ga-common_crawl-000.txt.bz2 already exists.

and the process seems to take as long as in the initial run. It would be great if the download handler detect that nothing needs to be done.

How is the NCI sampled?

Do the sentences follow on from each other? How big are the passages that are sampled? Do we have document/passage delimiter information?

This could affect the Next Sentence Prediction task in BERT.

Populate unused vocabulary entries of our mBERT-based models

Issue #33 points out that there are 99 unused entries in the mBERT vocabulary intended for users to add task-specific vocabulary entries for fine-tuning. We could use the entries to improve the vocabulary's coverage of Irish without having to train from scratch. However, to not put stones in the way of users of our models who want to use unused entries for their own tasks, we should not use all 99 entries.

A way to choose the entries to add would be to induce new vocabularies for a clean Irish corpus, reducing the size until the number of new entries, i.e. entries that are not in the mBERT vocabulary, is less than or equal to the number of entries we want to add, say 49.

NCI: \x sequences

Issue #4 reports 7 occurrences of \x\x13 but no other use of backslashes as escape characters with special meaning.

Create a ga_BERT model which does continued pre-training on Irish Tweets or is trained from scratch with twitter data

Lauren mentioned that she will be using parser-bootstapping to annotate the Irish Twitter UD treebank.

The current data used in ga_BERT might not be that suitable for parsing tweets. It might be a good idea to create a ga_BERT model tailored to Irish twitter data. This could either be:

ga_BERT (with all data) with continued pre-training on a corpus of Irish tweets.
ga_tweeBERT (or some other name) that is all the data used in ga_BERT + ga twitter data. This is initialised from scratch so the vocab contains code-switched tokens, acronyms, slang etc.

For reference, see: BERTweet

NCI: all-caps text

We noticed some all-caps text, mostly headings.

How frequent is this?

It can be argued that these cases should be kept as is to allow BERT to learn to produce useful representations for all-caps text as all-caps text may also occur at test / production time.

Robustness to missing accents, all-caps text and other deviations from well-edited text

To make our BERT model more robust to deviations from well-edited text typically found in real-world input, we could augment the training corpus with synthetic text derived from the current corpus by removing some accents from characters in a way to mimic social media content, putting text into all-caps, removing punctuation and/or spaces, using short forms used in text messages in Irish and inserting common spelling errors.

NCI: & html entities in text

Issue #4 reports various issues around ampersands in the NCI.

NCI: inconsistent <s> and tags

Issue #4 reports:

I found a missing <s> tag. Our new extractor script should use any of <s>, , <doc> and <file> (and the respective closing tags) as trigger for a sentence boundary.
Glue tags <g/> indicating that there was no space between the neighbouring tokens are not used.
No occurrences of < or > outside tags.
Number of  equals number of <s>, i.e.  are useless here.
Some  and </s> are missing.

Try adding Hiberno-English corpora and English side of parallel corpora

Preliminary results with less strict language filtering may mean that including English corpora helps.

html code fragments in NCI

investigate frequency of code fragments like color= and filter / clean up if worthwhile

NCI: unescaped & in doc attribute

Issue #4 reports that doc id="itgm0022", doc id="icgm1042" and doc id="iwx00055" have unescaped & in the value of attributes. XML parser not happy.

Character encoding problems in the NCI

Some characters in the NCI are not properly encoded. This affects characters in otherwise ok sentences, or whole blocks of text.

NCI: tokens may contain spaces

Issue #4 reports: Tokens in tab-separated columns may contain space characters.

This caused early versions of our extractor to miss some tokens.

Why are long sentences removed?

Issue #39 (comment) discovered that the pre-processing pipeline removes lines with more than 100 tokens. Why? Is there a problem feeding very long sentences into BERT?

Original version of NCI

Get the original version of the NCI from Meghan

Investigate sentpiece vocabulary conversion

#41 (comment) links to vocabulary conversion code that converts from sentpiece to wordpiece format. However, this conversion does not catch cases where the sentpiece boundary is inside a sentpiece token, e.g. [Hell] [o▁Wor] [ld] [.] instead of [Hello] [▁Wor] [ld] [.]. Note this is not a normal underscore but U+2581.

Do we have such cases in our vocabulary?
When used as a wordpiece vocabulary, the special U+2581 symbol is not used. Doesn't this mean that such vocabulary entries are unused entries and could be removed or renamed to additional [unused%d] entries?

Decoding in text files (character reference entities)

Thanks to Joachim for pointing this issue out and providing the command line.
There are decoding issues with some (approx 65,000) characters in plaintext files.
Searching through text files for regex [&][#0-9a-zA-Z]+[;] yields the counts and matched strings listed below.
(RegEx reminder [&][#0-9a-zA-Z]+[;] is any string beginning with & followed by one or more # or alphanumeric and ending in ; )

Note that some/many/all of these strings may be in files that are on the exclude list of Irish_Data/gdrive_filelist.csv so could potentially be ignored. Assuming the strings should be replaced by the correct character in the first instance, further investigation and action is required.

39 &#000;
125 &#124;
  1 &#233;
  1 &#250;
  8 &#38;
1854 &#91;
1828 &#93;
176 &#x2018;
174 &#x2019;
  1 &#x201C;
  1 &#x201D;
  1 &#xe1;
  2 &Dodgers;
  1 &aacute;
26840 &amp;
15743 &apos;
  4 &c;
 85 &gt;
 59 &lt;
 35 &nbsp;
18510 &quot;

find Irish_Data -type f | fgrep -v .tmx | xargs grep -h -o -E "[&][#0-9a-zA-Z]+[;]" | sort | uniq -c

39 �
125 |
1 é
1 ú
8 &
1854 [
1828 ]
176 ‘
174 ’
1 “
1 ”
1 á
2 &Dodgers;
1 á
26840 &
15743 '
4 &c;
85 >
59 <
35
18510 "

Concatenate output of different tokenisers

Rather than having to tell users of our BERT models what tokeniser to use it would be nice to be robust to the choice of tokenisers. Robustness is likely to improve by combining data obtained with different tokenisers, preferably the most popular ones.

To some extend we are doing this already:

The NCI is tokenised with a different tool than the other gdrive files, which are processed by udpipe trained on IDT+EWT.
CoNLL'17 data is tokenised with udpipe trained on IDT only.

readme: what bucketsize should be used?

Section "Steps for Downloading pre-training Corpora" gives the reader freedom to chose the bucket size as they see fit and from recent discussion I understood we need multiple buckets for the next sentence prediction objective. However, Section "Steps for Filtering Corpora" says one must have only 1 bucket.

Improve sentence splitter for tokenised text

The heuristic in split_tokenised_text_into_sentences.py is too simplistic:

Full-stops in quoted text such as in ' Is cuid den searmanas é . ' ar sise . should not count as split point.
35704 "sentences" containing just the single quote character are produced.
Enumerations such as 1. seem to be tokenised as two tokens in the NCI. These should not be split.
Roman enumerations are rare, e.g. 21 cases of IV or iv at the start of a sentence.

Suggestion:

Recursively split at the best split point as long as there is a sufficiently good split point. ✔️
Reject split points that would result in a half
- not containing any letters ✔️
- with the first letter being a lowercase letter ✔️ (+ exception for sub-enumeration e.g. (a))
- only containing a Roman number (in addition to the full-stop) ✔️
Reject the following split points:
- DR . (Dr. is tokenised correctly.) ✔️
- Prof . (Prof. does not occur.) ✔️
- nDr . (seems to be an inflected form of Dr.; always following an) ✔️ + Iml .
All else being equal, preferring a split point balancing the lengths of the halves ✔️

Include en_ewt in the tokenizer's training data

Understanding the process behind data anonymisation

Some corpora (e.g. Roinn na Gaeltachta) have anonymised versions available. In this situation, we have made the decision to train on the anonymised version.
However we still need to understand what the process of anonymisation does to the data, so that we can understand whether whole sentences have been deleted (could affect Next Sentence Prediction task) or whether emails, names have been masked with special tokens, or simply deleted.

Include unused entries in vocabulary of "from scratch" models

As discussed in issue #33, having a few unused entries in the vocabulary is a great idea to make it easier for users of a model to add extra tokens for fine-tuning. We should do this as well when training our final "from scratch" models. Multilingual BERT provides 99 such entries. We should use the same number of entries and use the same ["[unused%d]" %i for i in range(99)] format.

Try adding synthethic Irish text

We could augment the BERT training data with English text, or text in other languages, machine translated to Irish and/or with automatic paraphrases of Irish text.

Is their previous work adding synthetic text in the target language to the BERT training data, such as output from a machine translation model?

Provide up-to-date pre-processed text files

The file Irish_Data > processed_ga_files_for_BERT_runs > train.txt mentioned in issue #32 is severely out of date and there is no documentation what setting was used. Please update and add a readme. Given that BERT requires multiple input files for its next sentence objective, it will also be better for reproducibility do provide these individual files, e.g. as a .tgz.

Merge subcorpus-specific wordpiece vocabularies

When training on Irish, English and possibly other languages, Hung et al. (2020) "Improving Multilingual Models with Language-Clustered Vocabularies" suggest to create wordpiece vocabularies for clusters of related languages and then use the union of these vocabularies as the final vocabulary used during BERT training and prediction. For us this could mean to split the data into (1) clearly English only, (2) clearly Irish only and (3) all other text, train 3 vocabularies and merge them.

NCI: large <s> elements

Issue #4 reports: The content inside some <s> elements is huge and spans many sentences. The longest element has 65094 tokens. The 100th longest has 5153 tokens.

Assuming the tokeniser used to tokenise the text is good, sentence-ending punctuation appearing as a separate token is a reliable indicator of a sentence boundary. In case of quotation at the end of a sentence, the boundary may have to be moved after a quote. A more tricky case are quotations of full sentences within a sentence.

For BERT, however, we should be ok with some wrong boundaries as these will only be visible to BERT if a pre-processing filter removes a sentence next to a wrong boundary.

Leak of non-public OSCAR download URL

Somebody added a download script with a URL for oscar unshuffled that should not become public. TODO: Ask OSCAR people can they invalidate the URL. If not, we must be careful to never make this repo public. Only releases without code history can be made, after the URL has been removed from the code base.

Switch to 2.7 of IDT

At the moment we are using 2.5 of IDT, switch to 2.7

NCI: character replaced by space

Issue #4 reports: There are cases of special characters replaced by spaces, e.g. in G idhlig.

How frequent is this issue?

Detection probably is different to related issue #17 as removing the space will often produce unknown or rare character n-grams.

Try to find a filtering sweet spot

With very aggressive filtering, we don't see improvements over the unfiltered results:

https://docs.google.com/spreadsheets/d/1ssKM8xQZSTED_-mhVsmhercU9zmMxYxHmxB06wZM-wY/edit#gid=1677680531

See what happens when we keep more, e.g. sentences containing titles in English.

Is confidential training data sufficiently protected?

Carlini et al. (2020) suggest that training data can be recovered from modern LMs. Furthermore, there are "membership inference" methods that can check whether a given text fragment was part of the training data. Do such methods also work for ga_BERT? If yes, are our data providers ok with this?

It may also make our resource paper stronger to include such an analysis.

References:

Language filtering for NCI?

Lauren's annotation of a sample of 1000 <s> segments from the .vert file, i.e. not yet split into sentences according to sentence final punctuation, indicates that about 1% of the NCI is English and about 0.6% is code-switching. 1.4% cannot be annotated out of context.

If we still want to try applying a language filter, we can choose between Ailbhe's hand-crafted filter and the machine-learning based filter in our current BERT pipeline. These could be tested using the sample annotated by Lauren.

NCI: unexpected hyphens in words

Issue #4 reports Some tokens contain unexpected hyphens, e.g. Massa-chusetts. Probably a problem with conversion from PDF.

Wagner et al. (2007) Section 5.1.3 propose to "create three candidate substitutions (deletion, space, normal hyphen) and vote based on the frequency of the respective tokens and bigrams in [a reference corpus]".

Questions:

Does this concern normal hyphens, soft-hyphens or both?
How frequent is the issue?

What is filtered out?

What kind of material does the filter remove from the NCI? Take a random sample of the 886823 sentences and look for patterns.

NCI: extract text from doc title attribute

Issue #4 reports: Some of the <doc> tags have a title attribute that has Irish text not part of the document itself. We could add this text as a separate sentence before the first sentence to get even more data. Same could be done with the author attribute whenever the pubdate field is not empty and medium one of "book" and "newspaper".

Our extraction script can include these with --title and --author. A restriction to particular media types or pubdates is not implemented yet.

Increase weight of clean corpora such as NCI

When combining NCI with common crawl, paracrawl, OSCAR and other noisy corpora, it may be beneficial to give more weight to clean corpora, e.g. by concatenating multiple copies.

Try adding parallel text or dictionary content with template texts

https://arxiv.org/abs/2010.08275 found via https://twitter.com/hila_gonen/status/1318465935104245760 suggests that BERT can translate via simple text prompts (gap completion). This means that it learns the necessary connections and may mean that BERT's knowledge of words (and their translations) can be improved by feeding sentences containing statements about translation equivalences into BERT at training time. Same may work for phrases and complete sentences.

https://twitter.com/Wjrgo/status/1602919018507403266

NCI: empty sentences in vert file

Issue #4 reports: There are empty sentences.

Our extractor skips these: https://github.com/jbrry/Irish-BERT/blob/master/scripts/extract_text_from_nci_vert.py

Can we move gdrive_filelist.csv to the repo?

It would be handy for issue #35 to have gdrive_filelist.csv in this repo rather than in cloud storage. Can the list of filenames be published or is the list a secret?

Run some configurations with RoBERTa instead of BERT

Several changes, including removal of NSP task

This issue is about training a ga-roberta model on our data. For evaluating existing multilingual roberta models in our tasks, see issue #68 .

What happens with sentences greater than 128 tokens in length with BERT

What vector representation does the BERT model provide if the sequence length is over 128 tokens? Over 512 tokens?

Look at some long sentences in ga_idt.

NCI: use of combining diaeresis character

Issue #4 reports: The unicode combining diaeresis character occurs 18 times. When slicing and recombining character sequences, care must be taken not to separate it from its preceding character, or at least not let it end up at the start of a token, not to fail strict unicode encoding checks.

Inspect how BERT tokenization affects tokens which are composed of characters and punctuation

Since inconsistencies in the tokenisation are hard to avoid when working with corpora from different sources, it may help the final model to force tokens like "etc." to be split into two word pieces by removing vocabulary entries X+PUNCT if X is in the vocabulary and replacing X+PUNCT with X if not before we train BERT, in particular if the user's tokeniser splits more aggressively than our tokenisers.

5 runs per model - different random seed per model

Experiment with latest version of paracrawl (7.1)

We aren't using paracrawl at the moment, but it might be worth looking at the latest version (7.1)

NCI: words sometimes split into small pieces

Issue #4 reports: There are cases of words split into small pieces, e.g. T UA R A SC Á I L B H L I A N TÚ I L A N O M B UD S MA N 1 9 9 7.

How frequent is this issue? Are there any tools we could use to automatically detect and fix such cases?

An idea for detecting the errors may be to scan in a window of say 5 tokens for a surge in OOV rate that does not go hand-in-hand with a high rate of unknown character n-grams after removal of all spaces.

An idea for fixing the errors may be to synthesise a parallel corpus of text with this error automatically inserted and the original text and then train

an MT system to translate from a mixture of split words and normal words to normal words
a sequence tagger to tag each space whether it should be removed

Handling of new emoji and other OOVs

There will be a lot of characters that are not in the word piece vocabulary, especially when we limit the building of the vocabulary to the most clean sources and move to very different domains such as social media content that can use a rich and expanding set of emojis.

How are OOV characters handled when they occur in the input at training time?
Will the embedding table be expanded to include the new character?
Are the rarest characters mapped to a special word piece <UNK> to learn how to handle new characters that appear at test time?
If not, what other strategy is used to handle new characters at test time? For example, a possibility is to replace them with [MASK] to pretend one cannot see them. (At pre-training time, it probably would be a good idea to exclude such tokens from the loss.)

NCI: missing boundary between headings and first paragraph

Issue #4 reports: Looking at the first 100 lines, it seems that all-caps headings and the first sentence of a section are not separated. However, re-doing the sentence splitting without the extra signals from markup in the original documents probably would produce an overall worse segmentation.

jbrry / irish-bert Goto Github PK

irish-bert's People

Contributors

Stargazers

Watchers

irish-bert's Issues

Recommend Projects

Recommend Topics

Recommend Org