Git Product home page Git Product logo

Comments (6)

laurenCassidy avatar laurenCassidy commented on June 12, 2024 1

I have added a folder to the Irish Data folder on the drive. It has 2 text files of tweets (26493 tweets in total) and a README to describe how they were gathered.

from irish-bert.

jowagner avatar jowagner commented on June 12, 2024

Created https://github.com/jbrry/Irish-UD-Parsing/issues/9 on supporting unsupervised domain adaptation in our parser.

from irish-bert.

jowagner avatar jowagner commented on June 12, 2024

Observations:

  • line.decode('utf-8') for line in binaryfile.readlines() reports no errors
  • Rich in special characters including many emojis and symbols. Asian and other scripts are present but only small number of letters from each alphabet.
  • Not yet tokenised. Many cases of an apostrophe being used as a single quote.
  • Inconsistent html-like encoding: Many occurrences of &, < and > but also 37 occurrences on its own. Two numeric character references '.

from irish-bert.

jowagner avatar jowagner commented on June 12, 2024

@jbrry Is the folder Lauren added to the Irish Folder part of our current pipeline? I don't see it mentioned in Sec 2 of our paper. It should not be hidden under IMT, especially if Dowling et al. (2018, 2020) do not describe it.

from irish-bert.

jbrry avatar jbrry commented on June 12, 2024

I believe we decided to exclude it from the model pretraining data but decided to incorporate some experiment where we use it as fine tuning data. I'm not sure if I have that in writing anywhere but I remember the general consensus being that it wasn't urgent to add it in to the pipeline and all twitter related files were marked with a 0 in our gdrive_filelist.csv deliberately.

from irish-bert.

jowagner avatar jowagner commented on June 12, 2024

Confirmed in a fresh copy of gdrive_filelist.csv and replaced paper todo with a green note.

$ grep -i tw gdrive_filelist.csv 
0,data/ga/gdrive/Tweets/Lauren_twitter_corpus.txt
0,data/ga/gdrive/Tweets/README.md
0,data/ga/gdrive/Tweets/Teresa_twitter_corpus.txt

from irish-bert.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.