Lauren mentioned that she will be using parser-bootstapping to annotate the Irish Twit

Observations: line.decode('utf-8') f

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Create a ga_BERT model which does continued pre-training on Irish Tweets or is trained from scratch with twitter data about irish-bert HOT 6 OPEN

jbrry commented on June 12, 2024

Create a ga_BERT model which does continued pre-training on Irish Tweets or is trained from scratch with twitter data

from irish-bert.

Comments (6)

laurenCassidy commented on June 12, 2024 1

I have added a folder to the Irish Data folder on the drive. It has 2 text files of tweets (26493 tweets in total) and a README to describe how they were gathered.

from irish-bert.

jowagner commented on June 12, 2024

Created https://github.com/jbrry/Irish-UD-Parsing/issues/9 on supporting unsupervised domain adaptation in our parser.

from irish-bert.

jowagner commented on June 12, 2024

Observations:

line.decode('utf-8') for line in binaryfile.readlines() reports no errors
Rich in special characters including many emojis and symbols. Asian and other scripts are present but only small number of letters from each alphabet.
Not yet tokenised. Many cases of an apostrophe being used as a single quote.
Inconsistent html-like encoding: Many occurrences of &, < and > but also 37 occurrences on its own. Two numeric character references '.

from irish-bert.

jowagner commented on June 12, 2024

@jbrry Is the folder Lauren added to the Irish Folder part of our current pipeline? I don't see it mentioned in Sec 2 of our paper. It should not be hidden under IMT, especially if Dowling et al. (2018, 2020) do not describe it.

from irish-bert.

jbrry commented on June 12, 2024

I believe we decided to exclude it from the model pretraining data but decided to incorporate some experiment where we use it as fine tuning data. I'm not sure if I have that in writing anywhere but I remember the general consensus being that it wasn't urgent to add it in to the pipeline and all twitter related files were marked with a 0 in our gdrive_filelist.csv deliberately.

from irish-bert.

jowagner commented on June 12, 2024

Confirmed in a fresh copy of gdrive_filelist.csv and replaced paper todo with a green note.

$ grep -i tw gdrive_filelist.csv 
0,data/ga/gdrive/Tweets/Lauren_twitter_corpus.txt
0,data/ga/gdrive/Tweets/README.md
0,data/ga/gdrive/Tweets/Teresa_twitter_corpus.txt

from irish-bert.

Recommend Projects

Create a ga_BERT model which does continued pre-training on Irish Tweets or is trained from scratch with twitter data about irish-bert HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent