Comments (6)
Good point. I checked the vocab.txt
files produced by wiki-bert-pipeline
and I can confirm that they follow the same approach:
head -n 106 vocab.txt
[unused..]
[unused98]
[unused99]
[UNK]
[CLS]
[SEP]
[MASK]
a
from irish-bert.
As discussed in issue #33, having a few unused entries in the vocabulary is a great idea to make it easier for users of a model to add extra tokens for fine-tuning. We should do this as well when training our final "from scratch" models. Multilingual BERT provides 99 such entries. We should use the same number of entries and use the same
["[unused%d]" %i for i in range(99)]
format.
Just to note a tiny boundary issue there. The stop argument in the range()
function is exclusive, so to cover 0-99 is ["[unused%d]" %i for i in range(100)]
Also to note the first token is typically [PAD]
, followed by ["[unused%d]" %i for i in range(100)]
, followed by [UNK] [CLS] [SEP] [MASK]
etc.
Sorry if that seems pedantic to point out, I had the opposite in R recently since it's c() function is inclusive :^)
from irish-bert.
@alanagiasi Your code would produce 100 entries, one more than previous work quoted in #33 (comment).
As to the placement of [PAD]
and other special tokens, I agree it is best to follow what the existing tools do as otherwise it may cause problems with other software, including software that we currently are not using.
As to confusions when moving between programming languages, I like to write code in a way that makes things clear even when the reader doesn't know the language specific details, e.g. instead of the above I would write something like
number_of_entries = 99
list_of_entries = []
for entry_index in range(number_of_entries):
list_of_entries.append('[unused%d]' %entry_index)
where
- the use of
_index
implies it starts at0
, number_of_
makes clear how many items are created- the two facts above together make clear that the last entry will be
'[unused98]'
- use of
list_
makes clear that[]
is an empty list here - initialisation of the list and use of
.append()
make clear that we are creating a list and in what order the elements will appear (To somebody with mathematical training, Python's list comprehensions look like a set definition, i.e. one would expect the order of items to be undefined and duplicate elements to be discarded, which is btw the right data structure for a vocabulary.)
from irish-bert.
@jowagner yes it produces 100 entries, I double checked the vocabulary file on Google Drive and it has 100 entries i.e. unused0
to unused99
. Thanks for the comment you linked earlier, I double checked mBERT (HuggingFace implementation) and it has 99 entries i.e. unused1
to unused99
which agrees with the Chau et al., (2020) paper James cited.
@jbrry How was the vocab.txt file on Google Drive generated, do you happen to know if there are options to specify the number of `unused' tokens etc?
from irish-bert.
@alanagiasi, Yes the file used to create the vocabulary in wiki-bert-pipeline
can be seen here. It populates the unused
entries as well as the padding and special tokens: https://github.com/spyysalo/sent2wordpiece/blob/47ba44e4bb4faa50bc617a7da93987f94a934d3f/sent2wordpiece.py
from irish-bert.
Ok, so the answer is yes, we have those unused entries in our "from scratch" models. Nothing to do. Closing.
from irish-bert.
Related Issues (20)
- corpus statistics after de-duplication
- Increase number of parsers from 5 to 9
- Effect of corpus sampling on continued pre-training
- tag, branch and/or release code for reproducibility HOT 3
- report statistical power of test sets
- Include Scannell's corpus
- Reference for NCI paper Kilgarriff et al. HOT 1
- rclone is unable to find Theme A folder on Google Driver HOT 1
- Investigate gaHealth parallel corpus
- Add Irish subset of Indigenous Tweets
- Use ELRC and OPUS corpora directly
- Repair ligatures in NCI
- Add the Irish Crúbadán Web Corpus
- Add Irish subset of Indigenous Blogs
- Add the Gaois Corpus of Contemporary Irish
- Add the EduGA Corpus of Educational Materials
- Add the classical modern Irish Corpas Filíocht shiollach na Gaeilge
- Update HF model cards to refer to LREC paper HOT 1
- Repair fadas in NCI
- Release of the Cloze Test set? HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from irish-bert.