kawu / concraft-pl Goto Github PK

View Code? Open in Web Editor NEW

20.0 20.0 2.0 338 KB

A morphosyntactic tagger for Polish based on conditional random fields

Home Page: http://zil.ipipan.waw.pl/Concraft

License: BSD 2-Clause "Simplified" License

Shell 0.94% Haskell 94.70% Python 4.03% Dhall 0.33%

concraft-pl's People

Contributors

Stargazers

Watchers

Forkers

tindzk krzynio

concraft-pl's Issues

Improve format-related error messages

Concraft should inform the user about the potential errors in the formatting of the input file, and remind the user how many columns it requires.

Show the guessed interpretations for the unknown words

Output should contain not only the chosen MSD interpretation of unknown words, but also the guessed interpretations.

Server mode: tagging analysed data

Concraft is putting EOS tags in inappropriate places

Especially in places where segmentation-level ambiguities are involved.

Probable cause:

Concraft tries to assign EOS markes to segmentation-ambiguous edges.
Some of such edges are not chosen and their total probability mass is around 0.
Concraft nevertheless considers EOS markers for such highly improbable edges, and some of them get marked with EOS. Concraft relies on probabilities when deciding whether to mark a given edge with EOS, it is thus not surprising that it does not work for edges with total probability around 0.
Improbable or not, if an edge is marked with EOS, Concraft uses it to cut the sentence DAG in two.

Add `reana` mode

Perhaps reanalysis should be made available as a separate concraft-pl mode. It would make it possible to reanalyze corpus on one machine (with Maca available) and then run the training on a different machine.

Check that input graphs are DAGs

Concraft does not check if the input graphs are actually DAGs.

Model pre-trained on the National Corpus of Polish and Morfeusz version

National Corpus of Polish is not compatible with "current version of Morfeusz SGJP (i.e., the version from September 1st 2018 or newer)", e.g. in NCP is tag qub, but in Morfeusz is tag part. So how it was trained?

Remove column with meta-information

Input DAG format

Is there some tool to obtain input DAG format from Morfeusz output? In other words: how to use pre-trained model of Concraft-pl 2.0 with plain text?

BTW example files do not work:

$ concraft-pl tag DasModel-2019-10-08.gz -i example/test.dag -o output.dag
concraft-pl: parseRule: input too long in tag ppron3:pl:acc:f:ter:neut:praep
CallStack (from HasCallStack):
  error, called at ./Data/Tagset/Positional.hs:125:22 in tagset-positional-0.3.1-LwkfvYfoWWCIFQIVumc6gj:Data.Tagset.Positional
$ concraft-pl tag DasModel-2019-10-08.gz -i example/train.dag -o output.dag
concraft-pl: parseRule: no value for acm attribute in tag num:pl:acc:m2:ncol
CallStack (from HasCallStack):
  error, called at ./Data/Tagset/Positional.hs:118:27 in tagset-positional-0.3.1-LwkfvYfoWWCIFQIVumc6gj:Data.Tagset.Positional

Training option: improve existing model

Perhaps it should be possible to train an existing model on a new corpus?

It looks like the layered CRF already allows retraining the model. Similar functionality could be probably implemented (in a similar way) for the regular, first-order constrained CRF.

Thread blocked indefinitely in an MVar operation

On GHC 7.4.1 you may get the following error when running Concraft in the multi-threaded mode:

ThreadId X: thread blocked indefinitely in an MVar operation

It is probably related to the bug of the GHC compiler: http://ghc.haskell.org/trac/ghc/ticket/5943.

README: how to install Concraft-pl globally?

Currently, there is no information in README that Concraft-pl will be installed under ~/.cabal/bin by default.

Add `trim` (or `prune`) as a separate concraft-pl mode

It will allow the user to trim the model without the need to perform the training.

It also makes sense to implement the functionality of visualizing the model parameters on the level of concraft-pl. It has been already done withing the concraft disambiguation library and should be probably moved here.

Concraft should not randomly choose the lemma

More precisely, Concraft should not randomly choose the lemma when several interpretations with the same (disambiguated) tag exist.

Old version (which handled the issue correctly):
2 3 M M brev:pun 1.000 disamb
2 3 M mianownik brev:pun 1.000 disamb
2 3 M miasto brev:pun 1.000 disamb
2 3 M morze brev:pun 1.000 disamb
2 3 M męski brev:pun 1.000 disamb

DAG-based version:
2 3 M M brev:pun 1.0000 disamb
2 3 M mianownik brev:pun
1.0000
2 3 M miasto brev:pun 1.0000
2 3 M morze brev:pun 1.0000
2 3 M męski brev:pun 1.0000

Add option to tag sentence using the guessing model only

It would make it easier to perform external validation of the guesser, for example. Second, some users may want to use just the guesser for their applications.

Morfeusz2 - Concraft version comparability pairing

Hello,

thank you for contributing to this tool :) is there a possibility to get list of pairs representing compatible version of Concraft and Morfeusz2? I still have problems with pairing Concraft model version, Concraft version and Morfeusz2 version.

With some combination I got "10 columns instead of 11" error and so I checked out to earlier version of Concraft but now I am getting "too long input on nwok" error (error messages are not literal). I would try further but rebuilding Concraft is a long process and I think information about compatibility could be very useful. Maybe consider repo tags and information in README? Or did I miss some information?

Cheers
Tom

Improve API

Every important thing should be imported from the main module. Right now, there are some core types which have to be imported from submodules.

At the same time, there are overlapping names in different modules (e.g. tag).

Add data type which includes both the maca pool and the model.

Concraft hangs on input with non-printable characters

Apparently Maca ignores (i.e. do not provide in the output) some special characters (e.g. '\x200b'). Since many of such characters are not spaces (notably non-printable characters), the current algorithm for reading Maca output expects more characters than given and the process freezes.

Pre-trained models

Hello,

first of all, great job with Concraft v. 2.0! :-)

Are there by any chance any pre-trained models available? Or, alternatively, is there an easy way to generate them using NKJP?

Thanks in advance for help!

Update the example to make it compatible with the pre-trained model

Pre-trained model missing

Pre-trained model (mentioned in the README) missing.
Url: http://mozart.ipipan.waw.pl/~wkieras/DasModel-2019-10-08.gz
Response: 404 Not Found

Automatically download the default model

A default model for the Polish language could be downloaded automatically.

.plain input format

New version doesn't allow for input already preprocessed with MACA.

It would be reasonable to have this option - as the user may have no MACA or Morfeusz installed, or uses different version of Morfeusz.

Tagger crash

Ubuntu 12.04.1 LTS
Concraft-pl 0.2.1

sudo apt-get install haskell-platform
cabal update
cabal install concraft-pl
wget "http://zil.ipipan.waw.pl/Concraft?action=AttachFile&do=get&target=nkjp-model-0.2.gz" -O nkjp-model-0.2.gz
concraft-pl tag nkjp-model-0.2.gz < polish.txt > output

Output

concraft-pl: fd:9: hClose: resource vanished (Broken pipe)
concraft-pl: thread blocked indefinitely in an MVar operation

polish.txt

Any idea what's wrong?

New model: no distinction between certain tags

Tagging extract (with -p maxprobs option):

4       5       kraśniejsze     krasny  adj:pl:nom:f:com                        1.0000                  disamb
4       5       kraśniejsze     krasny  adj:pl:nom:m2:com                       1.0000
4       5       kraśniejsze     krasny  adj:pl:nom:m3:com                       1.0000
4       5       kraśniejsze     krasny  adj:pl:nom:n:com                        1.0000

These are different and should not therefore all obtain the max probability 1.0.

DAG: do not differenciate between the sentence type of the tool (-pl) and the underlying concrafrt library

At the moment, almost the same type is used to represent sentences in concraft-pl and in the underlying concraft library. The only difference is that there are spaces assigned to DAG nodes in concraft-pl. Perhaps this could be avoided.

Incorrect paragraph segmentation

An example where the paragraph-level segmentation doesn't work as expected:

... od odtwarzaczy MP3 i zabawek po roboty przemysłowe.

 Komputer od typowego kalkulatora ...

The problem seems to be in the middle, empty line, which contains some whitespace characters.

Add model analysis mode

Move the model analysis functionality here from the core library.

Improve sentence text recovery

In order to perform reanalysis of the input data, an original textual representation of each sentence in the input is determined. This recovery procedure should be more precise.

Tag with probabilities

Add option which will make concraf-pl annotate input tags with probabilities.

As a result, output will be presented in an extended version of the plain format (incompatible with Corpus2 tools).

Idea: limit possibile tags based on shape

It may be a good idea to limit possible interpretations of OOV words on the basis of the shape. For example, given word x with shape sh(x), limit the set of x's possible interpretations to a set of tags assigned to words in the training corpus with shape sh(x).

Add column with segment-related values

Such values (e.g., information about the preceding space) should be then rewritten from input to output.

Explain, how to convert NKJP to the plain format

Currently, there is no information on the subject in the README file.

Rewrite orthographical forms as lemmas in case of unknown words

Consider using websockets library for client/server communication

In particular, if websockets simplify things, you can also try to make connections between a client and a server a little more safe (e.g. you could check if the client is compatible with the server).

Slight tagging quality regression

It looks like the current version of the tagger gives slightly lower cross-validation results (as reported by Adam) in comparison to what have been written in the COLING paper.

It is probably related to the fact that the original cross-validation has been performed with a slightly different observation schema, where 4 consecutive orthographic forms were taken into account, while the 0.4.1 version (and 0.3.X and 0.2.X probably also) uses only 3 consecutive orthographic forms (for the previous, the current and the next word).

It should be noted, though, that the reported regression in cross-validation results is 91.1187 vs 91.0377, which is not very significant, and employing the fourth orthographic form in the observation schema will increase significantly the number of model features. On the other hand, it may be better to use a larger observation schema at the level of training, and then use the pruning method to get rid of useless features.

Observation schema configuration

It should be possible to supply Concraft with configuration files which would describe the observation schema, for example. Right now, it is not possible to change the schema without changing the code.

Info about the influence of the Maca and Morfeusz tools

It should be mentioned in README (and also on the homepage) that the same versions of Maca and Morfeusz should be used for training and tagging.

Performance regression with Maca in the background

It seems that Maca needs a considerable amount of time to analyze big paragraphs, which makes Concraft choke (at least that's what it looks like). Somehow Concraft with the --noana option is much faster than the one which uses Maca in the background.

Use guessing estimations as a priori estimations for the disambiguation model

It may decrease the number of iterations needed to find a proper disambiguation parameters. It may also render more disambiguation parameters superfluous, which would be a good thing, because it would make the model trimming more effective.

nkjp-tagset.cfg file

Suggested by Adam:

Sugerowane wywołanie concraft-pl train config/nkjp-tagset.cfg train.plain:

Być może warto dodać w instrukcji, że plik .cfg może być trzeba dociągnąć sobie samemu z repo (przy instalacji przez cabal) albo jeszcze lepiej dodać go gdzieś do jakiegoś share'a tak, by tager mógł sam go znaleźć.

Add "Maca not present" error message

The current error message:

concraft-pl: fd:9: hClose: resource vanished (Broken pipe)
concraft-pl: thread blocked indefinitely in an MVar operation

is not very informative and may confuse Concraft users (see #4).