Git Product home page Git Product logo

uniparser-grammar-meadow-mari's Introduction

Meadow Mari morphological analyzer

This is a rule-based morphological analyzer for Meadow Mari (mhr; Uralic > Mari). It is based on a formalized description of literary Meadow Mari morphology, which also includes a number of dialectal elements, and uses uniparser-morph for parsing. It performs full morphological analysis of Meadow Mari words (lemmatization, POS tagging, grammatical tagging, glossing).

How to use

Python package

The analyzer is available as a Python package. If you want to analyze Meadow Mari texts in Python, install the module:

pip3 install uniparser-meadow-mari

Import the module and create an instance of MeadowMariAnalyzer class. Set mode='strict' if you are going to process text in standard orthography, or mode='nodiacritics' if you expect some words to lack the diacritics (which often happens in social media). After that, you can either parse tokens or lists of tokens with analyze_words(), or parse a frequency list with analyze_wordlist(). Here is a simple example:

from uniparser_meadow_mari import MeadowMariAnalyzer
a = MeadowMariAnalyzer(mode='strict')

analyses = a.analyze_words('Морфологийыште')
# The parser is initialized before first use, so expect
# some delay here (usually several seconds)

# You will get a list of Wordform objects
# The analysis attributes are stored in its properties
# as string values, e.g.:
for ana in analyses:
        print(ana.wf, ana.lemma, ana.gramm, ana.gloss)

# You can also pass lists (even nested lists) and specify
# output format ('xml' or 'json')
# If you pass a list, you will get a list of analyses
# with the same structure
analyses = a.analyze_words([['А'], ['Мый', 'тыйым', 'йӧратем', '.']],
	                       format='xml')
analyses = a.analyze_words(['Морфологийыште', [['А'], ['Мый', 'тыйым', 'йӧратем', '.']]],
	                       format='json')

Refer to the uniparser-morph documentation for the full list of options.

Disambiguation

Apart from the analyzer, this repository contains a very small set of Constraint Grammar rules that can be used for partial disambiguation of analyzed Meadow Mari texts, as well assigning nonposs tag to all nominal forms without possessive affixes. If you want to use them, set disambiguation=True when calling analyze_words:

analyses = a.analyze_words(['Мый', 'тыйым', 'йӧратем'], disambiguate=True)

In order for this to work, you have to install the cg3 executable separately. On Ubuntu/Debian, you can use apt-get:

sudo apt-get install cg3

On Windows, download the binary and add the path to the PATH environment variable. See the documentation for other options.

Note that each time you call analyze_words() with disambiguate=True, the CG grammar is loaded and compiled from scratch, which makes the analysis even slower. If you are analyzing a large text, it would make sense to pass the entire text contents in a single function call rather than do it sentence-by-sentence, for optimal performance.

Word lists

Alternatively, you can use a preprocessed word list. The wordlists directory contains a list of words from a 2.6-million-word Meadow Mari corpus (wordlist_main.csv), list of analyzed tokens (wordlist_analyzed.txt; each line contains all possible analyses for one word in an XML format), and list of tokens the parser could not analyze (wordlist_unanalyzed.txt). The recall of the analyzer on the standard corpus texts is about 91%.

Description format

The description is carried out in the uniparser-morph format and involves a description of the inflection (paradigms.txt), a grammatical dictionary (mhr_lexemes_XXX.txt files), a list of rules that annotate combinations of lexemes and grammatical values with additional Russian translations (lex_rules.txt), and a short list of analyses that should be avoided (bad_analyses.txt). The dictionary contains descriptions of individual lexemes, each of which is accompanied by information about its stem, its part-of-speech tag and some other grammatical/borrowing information, its inflectional type (paradigm), and Russian translation. See more about the format in the uniparser-morph documentation.

uniparser-grammar-meadow-mari's People

Contributors

timarkh avatar

Watchers

 avatar  avatar

Forkers

ankan2013

uniparser-grammar-meadow-mari's Issues

"ден" или "дене"

image
Во многих случаях комитативное "дене" звучит как союзный "ден" (и наоборот). По идее надо всегда два варианта предлагать...

N-P.2SG-P.3SG

йӱк-ет-ше ‘твой голос’ – голос-poss.2sg-poss.3sg

йӱк есть в mhr_lexemes_N.txt, но такой разбор не предлагается

POSTP-P.3SG

пелен-же POSTP-P.3SG

Пелен тоже есть в mhr_lexemes_unchangeable.txt, но не разбирается

STEM-man

тол-ман (stem-суффикс)
image

(еще встретился в словах ыштыман и пурман)

Сразу напишу это ишью. Когда Айгуль скажет, как она хочет его глоссировать, то допишу сюда

PRO-P.3SG

тый-же - ты-poss.3sg

тый есть в mhr_lexemes_N.txt, но такой разбор не предлагается

Слова с "="

Слова с "=" всё еще почему-то пытаются разобраться
image

STEM-INF-DAT

Тол-аш-лан
толаш есть в mhr_lexemes_V.txt, но не разбирается

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.