Git Product home page Git Product logo

gaelic-resources's Introduction

Gaelic Resources

A list of computational resources for Gaelic.

This list has grown out of https://github.com/RichardLitt/endangered-languages, my list for all open source resources for low resource languages. I'm particularly interested in Gaelic, going forward.

Tools

Kevin Scannell has a repository with data files and scripts for building Scottish Gaelic spell checkers. This script was started through the Crúbadán project. GPL Licensed. This hunspell-gd repo is likely derivative.

Corpora

A representative, tagged corpus of Scottish Gaelic, divided into 8 registers (4 spoken, 4 written) of approximately 10k words each. The corpus is presented as individual txt files.

The corpus was hand-tagged by Lamb, Arbuthnot and Naismith and separately verified by them. It uses the Brown format tag separators ('/': e.g. 'agus/Cc') and an annotation scheme derived from the Irish PAROLE tagset (see Uí Dhonnchadha, E. and van Genabith, J. 2006. A Part-of-Speech tagger for Irish using finite state morphology and constraint grammar disambiguation. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), 2241-2244.).

The annotation scheme is described in a PDF included with the data: Lamb, W. and Naismith, S (2014) Scottish Gaelic Part-of-Speech Annotation Guidelines.

This work was funded by Bòrd na Gàidhlig and Carnegie Trust for the Universities of Scotland.

Corpas na Gàidhlig is a constituent project of DASG. It was founded in 2008 with the following aims: to create a comprehensive electronic corpus of Scottish Gaelic texts for students and researchers of Scottish Gaelic language, literature and culture to provide the textual basis for the interuniversity project Faclair na Gàidhlig (‘Dictionary of the Scottish Gaelic Language’) upon which the future historical dictionary will be based to provide a resource which will facilitate corpus planning and corpus development technology for Gaelic The first phase of Corpas na Gàidhlig aims to digitise 337 texts from all periods of Gaelic literature and to include a wide variety of genres, including poetry, prose, song, and folklore. These texts (listed below) have been prioritised in order to provide part of the textual basis for the interuniversity dictionary project, Faclair na Gàidhlig. It is envisaged as Corpas na Gàidhlig progresses that a broad range of other texts will be added, and in time, that speech will also be represented by text and sound files. In the long term, the Corpus will be used to update the dictionary.

To date over 19 million words, mostly Gaelic, have been captured.

The 337 texts to be digitised as part of Phase 1 are listed here (if the appropriate permissions are received).

Corpus contents:

conversation.txt - an informal conversation lecture.txt - a university lecture on philosophy sermon.txt - a sermon from a Church of Scotland communion service service.txt - a second sermon talk.txt - an informal educational/historical/religious talk All files are encoded in UTF-8 format.

Contribute

Please add stuff!

License

The Unlicense

gaelic-resources's People

Contributors

richardlitt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.