Multilingual NLP

This list of free and open-source NLP resources, and pointers to language-specific directories of resources, was originally created for a presentation at UCLA on teaching multilingual digital humanities, on May 15, 2019.

This is not a directory but a moderately-opinionated, potentially one-time list of resources that might be of use to digital humanities folks working with languages other than English. That said, if you have suggestions, you can make a pull request.

* indicates resources I've tried out, ^ indicates resources I've created.

Language-agnostic tools & methods

These tools and methods are not tied to any particular language. The caveat is that words have to be separated by a space (and what a "word" is may vary language-to-language, and not all languages put spaces between languages). A further caveat is that highly-inflected languages (e.g. languages with a lot of grammatical cases, like Latin, Russian, or Finnish) may perform poorly without lemmatization (using the "dictionary form" of words, versus whatever inflected form is actually present in the text), especially for smaller text corpora.

Voyant
Lexos
Topic modeling - like Mallet; if you use the topic modeling tool for a GUI-based interface, be sure to go into the "optional settings" and remove the text in the "Tokenize with regular expression" field. Also, your text files must be saved as UTF-8 otherwise it won't work.
Word vectors - Ryan Heuser has a nice set of blog posts introducing word vectors for literary analysis, and you can adapt this Jupyter notebook for Russian text cleaning & word vectors to most languages (more generalized word vector notebook / tutorial coming this summer!)
Word counting, keyword-in-context, and similar approaches (many of which are included in Voyant and Lexos, but you could also write Python or R code)

Modern languages

If you're comfortable working with Python, the Polyglot library provides language detection for 196 languages, tokenization in 165 languages, named entity recognition in 40 languages, part-of-speech tagging in 16 languages, sentiment analysis in 136 languages, and morphological analysis in 135 languages. It can also manage text in multiple languages at once. If you're working a lot with one particular language, it's probably best to find more language-specific tools, but as a better-than-nothing option for highly underresourced languages, it's an option.

A few other general thoughts & notes:

Be very wary of stopword lists. Make sure you have someone who can read the language review it before you pick it up and use it, or worst case, start running it through Google Translate. Stopword lists often include all sorts of words that only count as "stopwords" in the domain they're being used for, and you might inadvertently be exclusing, for instance, all words about computers. The longer the stopword list, the more suspicious you should be.
For very underresourced languages (endangered languages, languages with very small speaker groups, especially languages with unique writing systems) you may find scholarly articles about NLP, but in most cases, whatever proof-of-concept is presented in the paper is a long way from being usable, and odds aren't great that it will get there.

Arabic

Arabic has to be segmented (clitic segmentation) before it can be used well with language-agnostic tools. The Stanford Word Segmenter supports Arabic; usage should be similar to the Chinese segmenter tutorials.

NLP resource directory: Github has a tag for Arabic NLP including pointers to repos for sentiment analysis for tweets, reviews, and standard Arabic, named entity recognition, etc.
Part-of-speech tagger: Stanford NLP has a part-of-speech tagger for Arabic
Python: the Polyglot library supports language detection, named entity extraction (using Wikipedia data), morphological analysis (standard and Egyptian Arabic), transliteration, and sentiment analysis (standard and Egyptian Arabic)

Armenian

I stumbled onto Armenian recently while looking at full-text PDFs in HathiTrust. The OCR for all the Armenian books I came across was Latin or Greek jibberish, though I was able to get (what looked to me, playing match-the-squiggles) reasonable OCR out of Tesseract. I had a nice exchange with HathiTrust about it, suggesting that I report the errors I came across. In the meantime, though, plan to re-OCR the text if you're getting Armenian from HathiTrust.

Named-entity recognition: training data for Armenian NER using Wikipedia
Python: the Polyglot library supports language detection, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for Armenian

Chinese

Chinese needs to be segmented (spaces artificially inserted between words) before it can be used with language-agnostic tools. Stanford NLP Group has a Chinese segmenter. Michelle Fullwood has written a tutorial on using the segmenter.

Tutorial: ^Chinese part-of-speech tagging with Stanford NLP using Stanford NLP tools
Tutorial: ^Chinese named-entity recognition with Stanford NLP using Stanford NLP tools
Python: *xpinyin (for Mandarin) and python-jyutping (for Cantonese), for transliterating Chinese into a phonetic representation, with or without tones. (Example: ^Taiwanese rap analyzer Jupyter notebook for identifying lines of Taiwanese rap lyrics that include repeated tones.)
Python: *PyCantonese: includes jyutping converter/search, stopwords, and part-of-speech tagging for Cantonese.
Python: the Polyglot library supports language detection, named entity extraction (using Wikipedia data), morphological analysis (Chinese and Gan Chinese), transliteration, and sentiment analysis (Chinese and Gan Chinese)
Platform: MARKUS: reading and markup platform with automatic tagging and identification of Chinese personal and place names, time references, bureaucratic offices, and Buddhist terms.
Platform: DocuSky: collaboration platform supporting XML markup and database creation for Chinese.
Platform: Ten Thousand Rooms Project (廣廈千萬間項目): platform for pre-modern textual studies focusing on Chinese.

Dutch

Python: a fork of Sequence Tagging (NER using TensorFlow) has models for Dutch named entity recognition
OCR: Tesseract 4.0 has training data for Dutch

French

French is partly supported by Stanford Core NLP, so the instructions for doing part-of-speech tagging should be almost identical to other languages that can use that software. Stanford Core NLP doesn't support French named-entity recognition, but there are other tools you can use like OpenNER.

Tutorial (with modifications): ^Part-of-speech tagging with Stanford NLP: this is the German tutorial, but in step 3, replace german-hgc.tagger with french.tagger in the code that you run. You can also use a Universal Dependencies-based tagger (also described in the German tutorial) by replacing german-hgc.tagger with french-ud.tagger. The standard French tagger uses tags from the French treebank.
Named-entity recognition: OpenNER supports French
Python: SpaCy offers POS tags, dependency parse and named entities for French based on a news corpus
CamemBERT language model: for part-of-speech (POS) tagging, dependency parsing, named entity recognition (NER) and natural language inference (NLI).
Flaubert: word embeddings compatible with Hugging Face's Transformers library
Python: the Polyglot library supports language detection, part-of-speech tagging, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for French

German

There is a large community of DH folks doing text analysis on German under the "Digital Humanities im deutschsprachigen Raum" organization. Projects include QuaDramA – Quantitative Drama Analytics and Rhythmicalizer. A digital tool to identify free verse prosody.

Tutorial: ^German part-of-speech tagging with Stanford NLP
Tutorial: ^German named-entity recognition with Stanford NLP
Book & code: Andrew Piper's Enumerations: Data and Literary Study (Chicago 2018) includes numerous German examples. The data and code from the book are available on Github
Jupyter notebooks: Example code for doing German NLP with different packages
Directory: NLP resources and tools for German
Python: SpaCy offers POS tags, dependency parse and named entities for German based on a news corpus
Python: a fork of Sequence Tagging (NER using TensorFlow) has models for modern and historical German named entity recognition
Python: the Polyglot library supports language detection, part-of-speech tagging, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for German
Blog post: Common pitfalls with the pre-processing of German text for NLP - geared towards commercial applications, but provides a useful overview and comparison of different part-of-speech taggers, stopword lists, compound splitting, etc.

Hebrew

I've recently been working on a Hebrew NLP project, and should have more experience with these tools soon. Because Hebrew is a right-to-left language, I've noticed a few challenges, including file-renaming when the file names include both Hebrew and Latin characters. You may also have to navigate the right-to-left mark Unicode character when processing the text.

Directory: Hebrew NLP resources
Topic modeling: LemLDA: an LDA Package for Hebrew - you'll probably need to run the rule-based Hebrew tokenizer (below) on your text before trying it with this tool-- punctuation like parentheses breaks it.
Python: *rule-based Hebrew tokenizer - I've had some problems with this (Mac, Python 3.7) with regard to successfully saving the output file, but I've stuck the core functions in a Jupyter notebook and added my own input/output code, and it's worked well.
Python: the Polyglot library supports language detection, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for Hebrew

Hindi

Jupyter notebooks: Tokenizer and classifier (positive/neutral/negative) for Hindi based on Wikipedia, movie reviews, BBC news
Jupyter notebook: Training Hindi word embeddings using Wikipedia data
Python: Tokenizer and stemmer for Hindi
Python: Hindi dependency parser
Tutorial/Python: Hindi part-of-speech tagging using NLTK
Python: the Polyglot library supports language detection, named entity extraction (using Wikipedia data), morphological analysis (Hindi and Fiji Hindi), transliteration, and sentiment analysis for Hindi

Indonesian

Directory: Indonesian NLP resources
Directory: Bahasa Indonesia Natural Language Processing
Python: the Polyglot library supports language detection, part-of-speech tagging, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for Indonesian

Italian

The major tool available for Italian is *Tint, which is based on (and depends on) Stanford NLP, but not all of the features work well. If you try one output format and it doesn't work, try another. (I can vouch for the .conll format.)

Tutorial: ^Italian part-of-speech tagging with Tint
Tutorial: ^Italian named-entity recognition with Tint
Python: a fork of Sequence Tagging (NER using TensorFlow) has models for Italian named entity recognition
Python: SpaCy offers POS tags, dependency parse and named entities for Italian based on a news corpus
Python: the Polyglot library supports language detection, part-of-speech tagging, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for Italian

Japanese

Japanese has to be segmented before it can be used with language-agnostic tools, though Japanese segmentation is built into Voyant in theory (your mileage may vary, it just crashed for me when I tried it with a small corpus).

The most commonly used tool for Japanese text processing is MeCab, which provides segmentation and part-of-speech tagging. There are options for using it with Python, with Python on Mac and with R, but it depends on a library in C++ that may be a problem to get running. (I failed to get any version of MeCab working on a Mac, but I've seen others using it successfully on Windows.) A number of the people I've worked with haven't been very happy with the quality of its segmentation, and have preferred RakutenMA, which is what I've used.

Directory: Japanese text analysis by Molly Des Jardin
Jupyter notebook: Japanese segmentation with RakutenMA - doesn't work yet on Windows due to Unicode issues
Tutorial: Japanese part-of-speech tagging with RakutenMA - uses the demo web interface for RakutenMA
Tutorial: Japanese named-entity recognition with Apache OpenNLP
Python: the Polyglot library supports language detection, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for Indonesian

Korean

Python: KoNLPy: Korean NLP in Python, includes part-of-speech tagging, corpora, dictionaries
R: KoNLP, part-of-speech tagging
Directory: Awesome-Korean-NLP, a curated directory of resources, hasn't been updated in about two years
Python: the Polyglot library supports language detection, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for Korean

Mongolian

Directory: Mongolian NLP - includes named-entity recognition, data sets (e.g. with personal and clan names)
Python: the Polyglot library supports language detection, morphological analysis, and sentiment analysis for Mongolian

Portuguese

Portuguese is comapratively underresourced for text analysis relative to other colonial languages. While there's materials for training named-entity recognition for Portuguese, you need larger-than-laptop compute to train it. I mean to get back to it as an excuse to learn how to use our local high-performance computing cluster.

Tutorial: ^Brazilian Portuguese part-of-speech tagger
Incomplete tutorial: ^Portuguese named-entity recognition, based on this tutorial by André Pires and using materials from his master's thesis
Python: SpaCy offers POS tags, dependency parse and named entities for Portuguese based on a news corpus
Tutorial: Portuguese examples for Natural Language Processing with Python (NLTK)
Python: the Polyglot library supports language detection, part-of-speech tagging, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for Portuguese

Russian

*MyStem from Yandex (Russia's equivalent to Google) is the best NLP toolkit for Russian, and can be downloaded as standalone code. There's a wrapper for Python with PyMyStem3.

Because Russian is highly inflected (i.e. a word can appear in many forms depending on how it's used in a sentence), and each word form is treated as a separate "word" for language-agnostic tools and methods, you may get better results by lemmatizing Russian text before using it with these tools. MyStem can do this, and Python code for doing it is included in the Russian text cleaning & word vectors Jupyter notebook.

Tutorial: ^Russian part-of-speech tagger
Tutorial: ^Russian named-entity recognition, uses Natasha Python module
Jupyter notebook: ^Russian text cleaning & word vectors
Python: *Natasha module for named-entity recognition
Python: the Polyglot library supports language detection, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for Russian

Spanish

Tutorial: ^Spanish part-of-speech tagging with Stanford NLP
Tutorial: ^Spanish named-entity recognition with Stanford NLP
Python: SpaCy offers POS tags, dependency parse and named entities for Spanish based on a news corpus
Directory: Spanish NLP
OCR: Tesseract 4.0 has training data for Spanish and Old Spanish
Python: the Polyglot library supports language detection, part-of-speech tagging, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for Spanish

Tagalog

Part-of-speech tagger: Filipino tagger for use with Stanford NLP tagger
Sentiment analysis: Sentiment analysis for Filipino tweets
Python: the Polyglot library supports language detection, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for Tagalog

Thai

Directory: Thai NLP
Directory: Github has a Thai NLP tag
Python: PyThaiNLP - transliteration, tokenizing, part-of-speech tagger, collation, and other features.
R: thainltk: Thai National Language Toolkit
Python: the Polyglot library supports language detection, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for Thai

Tibetan

Python: Pybo Tibetan tokenizer
Transliteration: Wylie converter, a wrapper for an existing Perl tool
Transliteration: Tibetan Phonetics Engine, transliteration based on different schemes / dialects
Part-of-speech tagger: Universal Dependencies Part-of-Speech Tagger for Tibetan
Python: the Polyglot library supports language detection, morphological analysis, and sentiment analysis for Tibetan

Vitenamese

VnCoreNLP: A Vietnamese natural language processing toolkit (Java) - provides word segmentation, POS tagging, named entity recognition (NER) and dependency parsing
Directory: Github has a Vietnamese NLP tag
Sentiment analysis: Vietnamese sentiment analysis for tweets
Python: the Polyglot library supports language detection, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for Vietnamese

Welsh

Python: CyTag - text segmenter, sentence splitter, tokeniser, part-of-speech tagger
There's a few papers (e.g. Towards a Welsh Semantic Annotation System) talking about work on CySemTagger (a Welsh semantic annotation tool), but there doesn't seem to be a usable version yet

Yiddish

The Yiddish Book Center has thousands of scanned PDFs of books in Yiddish, but without OCR. To get plain text versions of the books (OCR'd using Jochre), you can create an account here.

Python: the Polyglot library supports language detection, morphological analysis, transliteration, and sentiment analysis for Yiddish

Historical languages

NLP tools for historical languages make the most sense when the language is attested in many (thousands+) documents, and the documents haven't already received a lot of scholarly attention for manual markup and analysis. Akkadian cuneiform tablets continue to be unearthed, and many of those found in archaeological digs over the last century have not yet been published in any usable form. In contrast, there have been far fewer new discoveries of texts in Old Church Slavonic, and the known manuscripts have already been thoroughly marked up by experts. As such, Akkadian is a better target for developing NLP than Old Church Slavonic.

Classical Languages Toolkit (multilingual)

CLTK provides NLP for the languages of Ancient, Classical, and Medieval Eurasia. While Greek, Latin, Akkadian, and the Germanic languages are the most complete, there is also some support for Arabic, Chinese, Ancient Egyptian, Ottoman Turkish, and various classical languages of India. Read the documentation for more information about the extent and nature of the library's coverage.

Jupyter notebooks: CLTK offers Jupyter notebook tutorials for how to use its functionality

Latin

Linked data: LiLa - Linking Latin is developing a linked data knowledge base for Latin
Texts: Digital Latin Library publishes critical editions of Latin texts, and facilitates finding texts online that are written in Latin
Texts: Perseus Digital Library, a longstanding digital humanities project, with texts in Greek, Arabic, and English along with Latin

Coptic

Tool suite: Coptic Scriptorium has multiple tools for annotating, converting, and processing Coptic, including normalizers, a loan-word tagger, lemmatizer, universal dependency treebank, part-of-speech tagger, etc.

Other language families & groups

Languages of Africa

Kathleen Siminyu has been working on developing NLP resources for languages of Africa, and posting update on LinkedIn. A February 2019 post describes work on a Luganda-Kinyarwanda translation model based on word vector embeddings. There's also a collaborative project underway to develop translation models for African languages.

Directory: Ethopian speech and text corpora
Directory: text corpora for languages of Nigeria

Indigenous languages of the Americas

"Challenges of language technologies for the indigenous languages of the Americas" (Manuel Mager, Ximena Gutierrez-Vasques, Gerardo Sierra, & Ivan Meza) in Proceedings of the 27th International Conference on Computational Linguistics, 2018) has an excellent overview of the current state of NLP for a variety of indigenous languages of the Americas.

The authors also maintain an updated directory of NLP resources for indigenous languages of the Americas.

Cherokee OCR: Tesseract 4.0 has trained data for Cherokee
Inuktitut OCR: Tesseract 4.0 has trained data for Inuktitut

thenightweburnedchrome / nlp-resources Goto Github PK

nlp-resources's Introduction