Git Product home page Git Product logo

irish-word-frequency's Introduction

Irish Word Frequency List

This is a list of approximately 6,500 Irish lemmas (= "words") ordered by frequency of use. It has been extracted from the Irish half of the New Corpus for Ireland and then cleaned up by cross-checking against a large-coverage lexicon (Kevin Scannell's Líonra Séimeantach na Gaeilge) and removing lemmas that don't occur in this lexicon. The result is a list which is "clean" in the sense that it doesn't contain any punctuation, personal names, English words or other noise.

License

Available under the Open Database License.

Format

This is a plain-text tab-delimited file encoded in UTF-8 with Windows-style line breaks.

  • Column 1 is the lemma's rank: number 1 means the first most frequently used lemma, number 2 the second, and so on.
  • Column 2 is the lemma itself.
  • Column 3 is the lemma's corpus frequency: a number that says how many times it occurs in the entire corpus.
  • Column 4 is the lemma's window size: it tells you that the lemma occurs on average once in this number of words.

irish-word-frequency's People

Contributors

michmech avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

irish-word-frequency's Issues

`cál` too high up?

Sorry just wanted to register a further issue although I know this is an old repository.
I'm wondering why cál is so high up the list as 'kale/cabbage' doesn't seem to merit such a high position.

Anyhow probably time I dived into creating a similar word frequency list myself from the source texts as then I'll be able to investigate myself!

téarmach

You should not have replied to the other issues you are just encouraging me :D

I came across this one this week.

téarmach, a1. Terminal. is in https://www.teanglann.ie/ but nothing in https://www.focloir.ie/

I search in the corpas http://corpas.focloir.ie/ reveals only téarma and ghearrtéarmach

I guess the answer is téarmaí (terms) has been incorrectly lemmatized to the adjective rather than the noun.

bigrams?

Hi,
I'm interested in the script / methodology used to construct this list.

Specifically, 'coinne' comes up quite high in the frequency list, but I imagine that's because of it's use in phrases such as 'i gcoinne' (against), 'gan choinne' (unexpectedly) & 'os coinne' (in front of/opposite).

From a language learning pov, I'd like to learn these phraselets separately, so my idea is to allow bigrams alongside high frequency words. E.g. given the corpus frequency for 'coinne' as 8507, maybe the above 3 phrases have (say) frequencies of 4000, 3000, and 1000, in which case, they would appear in the top 6,500 list and bump the plain 'coinne' version off the list (which would now have a frequency score 507 after subtracting the bigram frequency).

Is the source code for how this list was created available?

With thanks!

proper name removal?

Was wondering why 'dobhar' was appearing so high up in the list and after puzzling over the dictionary entries on focloir & teanglann, I remembered that Gaoth Dobhair would likely be a common Gaeltacht placename mentioned in the source texts. Just want to mention it as an issue if others' use this repository and add a query as to whether proper names were correctly identified? (I know Gaillimh is in the list and kept capitalized which is fine)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.