eddieantonio / fst-lookup Goto Github PK

View Code? Open in Web Editor NEW

5.0 3.0 3.0 479 KB

Do lookups on an FST in Python!

License: MIT License

Python 80.57% Makefile 2.99% C 15.93% Shell 0.51%

fst lookup foma transducer

fst-lookup's Introduction

FST Lookup

Implements lookup for Foma finite state transducers.

Supports Python 3.5 and up.

Install

pip install fst-lookup

Usage

Import the library, and load an FST from a file:

Hint: Test this module by downloading the eat FST!

>>> from fst_lookup import FST
>>> fst = FST.from_file('eat.fomabin')

Assumed format of the FSTs

fst_lookup assumes that the lower label corresponds to the surface form, while the upper label corresponds to the lemma, and linguistic tags and features: e.g., your LEXC will look something like this—note what is on each side of the colon (:):

Multichar_Symbols +N +Sg +Pl
Lexicon Root
    cow+N+Sg:cow #;
    cow+N+Pl:cows #;
    goose+N+Sg:goose #;
    goose+N+Pl:geese #;
    sheep+N+Sg:sheep #;
    sheep+N+Pl:sheep #;

If your FST has labels on the opposite sides—e.g., the upper label corresponds to the surface form and the upper label corresponds to the lemma and linguistic tags—then instantiate the FST by providing the labels="invert" keyword argument:

fst = FST.from_file('eat-inverted.fomabin', labels="invert")

Hint: FSTs originating from the HFST suite are often inverted, so try to loading the FST inverted first if .generate() or .analyze() aren't working correctly!

Analyze a word form

To analyze a form (take a word form, and get its linguistic analyzes) call the analyze() function:

def analyze(self, surface_form: str) -> Iterator[Analysis]

This will yield all possible linguistic analyses produced by the FST.

An analysis is a tuple of strings. The strings are either linguistic tags, or the lemma (base form of the word).

FST.analyze() is a generator, so you must call list() to get a list.

>>> list(sorted(fst.analyze('eats')))
[('eat', '+N', '+Mass'),
 ('eat', '+V', '+3P', '+Sg')]

Generate a word form

To generate a form (take a linguistic analysis, and get its concrete word forms), call the generate() function:

def generate(self, analysis: str) -> Iterator[str]

FST.generate() is a Python generator, so you must call list() to get a list.

>>> list(fst.generate('eat+V+Past')))
['ate']

Contributing

If you plan to contribute code, it is recommended you use Poetry. Fork and clone this repository, then install development dependencies by typing:

poetry install

Then, do all your development within a virtual environment, managed by Poetry:

poetry shell

Type-checking

This project uses mypy to check static types. To invoke it on this package, type the following:

mypy -p fst_lookup

Running tests

To run this project's tests, we use py.test:

poetry run pytest

C Extension

Building the C extension is handled in build.py

To disable building the C extension, add the following line to .env:

export FST_LOOKUP_BUILD_EXT=False

(by default, this is True).

To enable debugging flags while working on the C extension, add the following line to .env:

export FST_LOOKUP_DEBUG=TRUE

(by default, this is False).

Fixtures

If you are creating or modifying existing test fixtures (i.e., mostly pre-built FSTs used for testing), you will need the following dependencies:

GNU make
Foma

Fixtures are stored in tests/data/. Here, you will use make to compile all pre-built FSTs from source:

make

License

Licensed under the MIT license.

fst-lookup's People

Contributors

Stargazers

Watchers

Forkers

madoshakalaka pombredanne kaungmt

fst-lookup's Issues

multichar symbols being split up

For some reason some multichar symbols are split up in analysis: The following are results I have recieved.:

{('asotamawêw', '+V', '+TA', '+Ind', '+Prs', '+3Sg', '+12Pl', 'O'), ('IV', '+Num', '+Rom', '+'), 
('asênêw', '+V', '+TA', '+Ind', '+Prs', '+3Sg', '+12Pl', 'O'), 
('wanisimêw', '+V', '+TA', '+Ind', '+Prs', '+3Sg', '+12Pl', 'O'), 
('î', '+Ipc'), ('awa', '+Pron', '+Dem', '+Dist', '+A', 'N', '+Pl'), 
('wanisimêw', '+V', '+TA', '+Ind', '+Fut', '+Int', '+3Sg', '+12Pl', 'O'), 
('PV/tita+', 'kitimâkinawêw', '+V', '+TA', '+Cnj', '+Prs', '+3Pl', '+12Pl', 'O'),
 ('kitimâkêyimêw', '+V', '+TA', '+Fut', '+Cond', '+3Sg', '+12Pl', 'O'), 
('kitimâkinawêw', '+V', '+TA', '+Fut', '+Cond', '+4Sg/Pl', '+12Pl', 'O'), 
('wîhtamawêw', '+V', '+TA', '+Ind', '+Prs', '+3Pl', '+12Pl', 'O'), 
('PV/ta+', 'pâhpihêw', '+V', '+TA', '+Cnj', '+Prs', '+3Pl', '+12Pl', 'O'),
 ('nisitohtawêw', '+V', '+TA', '+Ind', '+Prs', '+3Sg', '+12Pl', 'O'), 
('itêw', '+V', '+TA', '+Ind', '+Prs', '+X', '+12Pl', 'O'), ('IX', '+Num', '+Rom', '+'), 
('X', '+Num', '+Rom', '+'), 
('ôma', '+Pron', '+Dem', '+Med', '+I', 'N', '+Pl'), 
('awa', '+Pron', '+Dem', '+Dist', '+A', 'N', '+Sg'), 
('XI', '+Num', '+Rom', '+'), 
('wîcihêw', '+V', '+TA', '+Ind', '+Fut', '+Int', '+X', '+12Pl', 'O'), 
('ôma', '+Pron', '+Dem', '+Dist', '+I', 'N', '+Pl'), 
('PV/e+', 'kakwêcimêw', '+V', '+TA', '+Cnj', '+Prs', '+3Pl', '+12Pl', 'O'), 
('V', '+Num', '+Rom', '+'), 
('itêyimêw', '+V', '+TA', '+Fut', '+Cond', '+3Sg', '+12Pl', 'O'), 
('VIII', '+Num', '+Rom', '+'), 
('â', '+Ipc'), 
('mêscihêw', '+V', '+TA', '+Ind', '+Prs', '+3Pl', '+12Pl', 'O'), 
('VI', '+Num', '+Rom', '+'), 
('XIV', '+Num', '+Rom', '+'), 
('ispayihêw', '+V', '+TA', '+Ind', '+Prs', '+3Sg', '+12Pl', 'O'), 
('asawâpamêw', '+V', '+TA', '+Ind', '+Prs', '+3Sg', '+12Pl', 'O'), 
('nakatêw', '+V', '+TA', '+Ind', '+Fut', '+Int', '+3Sg', '+12Pl', 'O'), 
('awiyak', '+Pron', '+Ind', 'ef', '+A', 'N', '+Pl'), 
('XIII', '+Num', '+Rom', '+'), 
('asotamawêw', '+V', '+TA', '+Ind', '+Prt', '+3Sg', '+12Pl', 'O'), 
('VII', '+Num', '+Rom', '+'), 
('sâkôcihêw', '+V', '+TA', '+Fut', '+Cond', '+3Sg', '+12Pl', 'O'), 
('awa', '+Pron', '+Dem', '+Med', '+A', 'N', '+Pl'), 
('XII', '+Num', '+Rom', '+')}

These can be broken down into 3 main types:

Roman numerals with an empty + analysis
12PlO being split into 12Pl and O
Demonstrative sequences IN and AN being split into single characters

This renders your package literally unplayable 4/10

Changing how non-analyses work

Could you make it so that non-analyzed strings return a ? instead of an empty list?

Love,

Atticus

.generate() does not work on crk-normative-generator.fomabin

I'm using the latest cree fomabin

fst = FST('crk-normative-generator.fomabin')
list(fst.generate('nîskâw+V+II+Cnj+Prs+3Sg'))

>>> []

fst = FST('crk-normative-generator.fomabin')
list(fst.analyze('nîskâw+V+II+Cnj+Prs+3Sg'))

>>> [('nîskâk',)]

I'm confused:
generate should work, analyze shouldn't, but it's the other way around here.

Also a small inconsistency: readme.md mentions word generation returns type Generator[str], while right here it gives Generator[Tuple[str]]

Feature request: random_upper() and random_lower()

FST.random_upper() — transduces randomly, outputting the upper side.
FST.random_lower() — transduces randomly, outputting the lower side.

Lookup+alignments

It would be great if we could get alignments out of the FST while doing lookup.

For example, given the Senchothen word "SȻÁĆEL", the IPA would be something like "ʃkʷet͡ʃəl". Our speech-text aligner is working from IPA, so we have time alignments to the IPA characters, but we need to project this back to the orthographic word for presentation to the reader/listener. (E.g., we need to know that a timespan annotation over the kʷ corresponds to the Ȼ.)

In the general case it won't necessarily be possible to align every letter to every letter, because (e.g.) sometimes there will be rewrite rules that rewrite these two letters to these three letters and there's no recoverable fact-of-the-matter which correspond to which. But that's fine, so long as we can reconstruct what is recoverable, it'll be a big help.

Remove chainmap

Salvage part of #3 by reimplementing removing the chain map — it should be unnecessary.

Inproper concatenation of lemma in .analyze()

With the Plains Cree analyzer

Analyzing: pimitâskosin

Returns:

[('pimitâskosi', 'n', '+V', '+AI', '+Ind', '+Prs', '+3Sg')]`

It should return:

[('pimitâskosin', '+V', '+AI', '+Ind', '+Prs', '+3Sg')]

Analyze does not give same results as 'up' in foma.

Have a complicated agglutinative language with vowel elision and other alterations. For word roots and simple affixes seems to work, but more most others fst-lookup returns an empty list where as foma returns an analysis. Examples from foma:
`apply up> grukchi
N gruko.NROOT.1.NALN.KN-chi.UNPSSSD

apply up> grukoya
N gruko.NROOT.1.NALN.KN-ya.{ESS/ALL/ABL/PRP}
N g.PSS2P-gruko.NROOT.1.NALN.KN-ya.{ESS/ALL/ABL/PRP}
`

fst_lookup gets the first case of grukchi, but returns an empty list for the 2nd case, and many more complicated cases.