Comments (6)
I have added a folder to the Irish Data folder on the drive. It has 2 text files of tweets (26493 tweets in total) and a README to describe how they were gathered.
from irish-bert.
Created https://github.com/jbrry/Irish-UD-Parsing/issues/9 on supporting unsupervised domain adaptation in our parser.
from irish-bert.
Observations:
line.decode('utf-8') for line in binaryfile.readlines()
reports no errors- Rich in special characters including many emojis and symbols. Asian and other scripts are present but only small number of letters from each alphabet.
- Not yet tokenised. Many cases of an apostrophe being used as a single quote.
- Inconsistent html-like encoding: Many occurrences of
&
,<
and>
but also 37 occurrences on its own. Two numeric character references'
.
from irish-bert.
@jbrry Is the folder Lauren added to the Irish Folder part of our current pipeline? I don't see it mentioned in Sec 2 of our paper. It should not be hidden under IMT, especially if Dowling et al. (2018, 2020) do not describe it.
from irish-bert.
I believe we decided to exclude it from the model pretraining data but decided to incorporate some experiment where we use it as fine tuning data. I'm not sure if I have that in writing anywhere but I remember the general consensus being that it wasn't urgent to add it in to the pipeline and all twitter related files were marked with a 0 in our gdrive_filelist.csv
deliberately.
from irish-bert.
Confirmed in a fresh copy of gdrive_filelist.csv
and replaced paper todo with a green note.
$ grep -i tw gdrive_filelist.csv
0,data/ga/gdrive/Tweets/Lauren_twitter_corpus.txt
0,data/ga/gdrive/Tweets/README.md
0,data/ga/gdrive/Tweets/Teresa_twitter_corpus.txt
from irish-bert.
Related Issues (20)
- Paper: report random seeds of from scratch models HOT 1
- corpus statistics after de-duplication
- Increase number of parsers from 5 to 9
- Effect of corpus sampling on continued pre-training
- tag, branch and/or release code for reproducibility HOT 3
- report statistical power of test sets
- Include Scannell's corpus
- Reference for NCI paper Kilgarriff et al. HOT 1
- rclone is unable to find Theme A folder on Google Driver HOT 1
- Investigate gaHealth parallel corpus
- Add Irish subset of Indigenous Tweets
- Use ELRC and OPUS corpora directly
- Repair ligatures in NCI
- Add the Irish Crúbadán Web Corpus
- Add Irish subset of Indigenous Blogs
- Add the Gaois Corpus of Contemporary Irish
- Add the EduGA Corpus of Educational Materials
- Add the classical modern Irish Corpas Filíocht shiollach na Gaeilge
- Update HF model cards to refer to LREC paper HOT 1
- Repair fadas in NCI
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from irish-bert.