Git Product home page Git Product logo

concraft-pl's People

Contributors

kawu avatar tindzk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

tindzk krzynio

concraft-pl's Issues

Improve format-related error messages

Concraft should inform the user about the potential errors in the formatting of the input file, and remind the user how many columns it requires.

Concraft is putting EOS tags in inappropriate places

Especially in places where segmentation-level ambiguities are involved.

Probable cause:

  • Concraft tries to assign EOS markes to segmentation-ambiguous edges.
  • Some of such edges are not chosen and their total probability mass is around 0.
  • Concraft nevertheless considers EOS markers for such highly improbable edges, and some of them get marked with EOS. Concraft relies on probabilities when deciding whether to mark a given edge with EOS, it is thus not surprising that it does not work for edges with total probability around 0.
  • Improbable or not, if an edge is marked with EOS, Concraft uses it to cut the sentence DAG in two.

Add `reana` mode

Perhaps reanalysis should be made available as a separate concraft-pl mode. It would make it possible to reanalyze corpus on one machine (with Maca available) and then run the training on a different machine.

Input DAG format

Is there some tool to obtain input DAG format from Morfeusz output? In other words: how to use pre-trained model of Concraft-pl 2.0 with plain text?

BTW example files do not work:

$ concraft-pl tag DasModel-2019-10-08.gz -i example/test.dag -o output.dag
concraft-pl: parseRule: input too long in tag ppron3:pl:acc:f:ter:neut:praep
CallStack (from HasCallStack):
  error, called at ./Data/Tagset/Positional.hs:125:22 in tagset-positional-0.3.1-LwkfvYfoWWCIFQIVumc6gj:Data.Tagset.Positional
$ concraft-pl tag DasModel-2019-10-08.gz -i example/train.dag -o output.dag
concraft-pl: parseRule: no value for acm attribute in tag num:pl:acc:m2:ncol
CallStack (from HasCallStack):
  error, called at ./Data/Tagset/Positional.hs:118:27 in tagset-positional-0.3.1-LwkfvYfoWWCIFQIVumc6gj:Data.Tagset.Positional

Training option: improve existing model

Perhaps it should be possible to train an existing model on a new corpus?

It looks like the layered CRF already allows retraining the model. Similar functionality could be probably implemented (in a similar way) for the regular, first-order constrained CRF.

Add `trim` (or `prune`) as a separate concraft-pl mode

It will allow the user to trim the model without the need to perform the training.

It also makes sense to implement the functionality of visualizing the model parameters on the level of concraft-pl. It has been already done withing the concraft disambiguation library and should be probably moved here.

Concraft should not randomly choose the lemma

More precisely, Concraft should not randomly choose the lemma when several interpretations with the same (disambiguated) tag exist.

Old version (which handled the issue correctly):
2 3 M M brev:pun 1.000 disamb
2 3 M mianownik brev:pun 1.000 disamb
2 3 M miasto brev:pun 1.000 disamb
2 3 M morze brev:pun 1.000 disamb
2 3 M męski brev:pun 1.000 disamb

DAG-based version:
2 3 M M brev:pun 1.0000 disamb
2 3 M mianownik brev:pun
1.0000
2 3 M miasto brev:pun 1.0000
2 3 M morze brev:pun 1.0000
2 3 M męski brev:pun 1.0000

Morfeusz2 - Concraft version comparability pairing

Hello,

thank you for contributing to this tool :) is there a possibility to get list of pairs representing compatible version of Concraft and Morfeusz2? I still have problems with pairing Concraft model version, Concraft version and Morfeusz2 version.

With some combination I got "10 columns instead of 11" error and so I checked out to earlier version of Concraft but now I am getting "too long input on nwok" error (error messages are not literal). I would try further but rebuilding Concraft is a long process and I think information about compatibility could be very useful. Maybe consider repo tags and information in README? Or did I miss some information?

Cheers
Tom

Improve API

Every important thing should be imported from the main module. Right now, there are some core types which have to be imported from submodules.

At the same time, there are overlapping names in different modules (e.g. tag).

Add data type which includes both the maca pool and the model.

Concraft hangs on input with non-printable characters

Apparently Maca ignores (i.e. do not provide in the output) some special characters (e.g. '\x200b'). Since many of such characters are not spaces (notably non-printable characters), the current algorithm for reading Maca output expects more characters than given and the process freezes.

Pre-trained models

Hello,

first of all, great job with Concraft v. 2.0! :-)

Are there by any chance any pre-trained models available? Or, alternatively, is there an easy way to generate them using NKJP?

Thanks in advance for help!

.plain input format

New version doesn't allow for input already preprocessed with MACA.

It would be reasonable to have this option - as the user may have no MACA or Morfeusz installed, or uses different version of Morfeusz.

Tagger crash

Ubuntu 12.04.1 LTS
Concraft-pl 0.2.1

sudo apt-get install haskell-platform
cabal update
cabal install concraft-pl
wget "http://zil.ipipan.waw.pl/Concraft?action=AttachFile&do=get&target=nkjp-model-0.2.gz" -O nkjp-model-0.2.gz
concraft-pl tag nkjp-model-0.2.gz < polish.txt > output

Output

concraft-pl: fd:9: hClose: resource vanished (Broken pipe)
concraft-pl: thread blocked indefinitely in an MVar operation

polish.txt

Any idea what's wrong?

New model: no distinction between certain tags

Tagging extract (with -p maxprobs option):

4       5       kraśniejsze     krasny  adj:pl:nom:f:com                        1.0000                  disamb
4       5       kraśniejsze     krasny  adj:pl:nom:m2:com                       1.0000
4       5       kraśniejsze     krasny  adj:pl:nom:m3:com                       1.0000
4       5       kraśniejsze     krasny  adj:pl:nom:n:com                        1.0000

These are different and should not therefore all obtain the max probability 1.0.

Incorrect paragraph segmentation

An example where the paragraph-level segmentation doesn't work as expected:

... od odtwarzaczy MP3 i zabawek po roboty przemysłowe.

 Komputer od typowego kalkulatora ...

The problem seems to be in the middle, empty line, which contains some whitespace characters.

Improve sentence text recovery

In order to perform reanalysis of the input data, an original textual representation of each sentence in the input is determined. This recovery procedure should be more precise.

Tag with probabilities

Add option which will make concraf-pl annotate input tags with probabilities.

As a result, output will be presented in an extended version of the plain format (incompatible with Corpus2 tools).

Idea: limit possibile tags based on shape

It may be a good idea to limit possible interpretations of OOV words on the basis of the shape. For example, given word x with shape sh(x), limit the set of x's possible interpretations to a set of tags assigned to words in the training corpus with shape sh(x).

Slight tagging quality regression

It looks like the current version of the tagger gives slightly lower cross-validation results (as reported by Adam) in comparison to what have been written in the COLING paper.

It is probably related to the fact that the original cross-validation has been performed with a slightly different observation schema, where 4 consecutive orthographic forms were taken into account, while the 0.4.1 version (and 0.3.X and 0.2.X probably also) uses only 3 consecutive orthographic forms (for the previous, the current and the next word).

It should be noted, though, that the reported regression in cross-validation results is 91.1187 vs 91.0377, which is not very significant, and employing the fourth orthographic form in the observation schema will increase significantly the number of model features. On the other hand, it may be better to use a larger observation schema at the level of training, and then use the pruning method to get rid of useless features.

Observation schema configuration

It should be possible to supply Concraft with configuration files which would describe the observation schema, for example. Right now, it is not possible to change the schema without changing the code.

Performance regression with Maca in the background

It seems that Maca needs a considerable amount of time to analyze big paragraphs, which makes Concraft choke (at least that's what it looks like). Somehow Concraft with the --noana option is much faster than the one which uses Maca in the background.

nkjp-tagset.cfg file

Suggested by Adam:

  1. Sugerowane wywołanie concraft-pl train config/nkjp-tagset.cfg train.plain:

Być może warto dodać w instrukcji, że plik .cfg może być trzeba dociągnąć sobie samemu z repo (przy instalacji przez cabal) albo jeszcze lepiej dodać go gdzieś do jakiegoś share'a tak, by tager mógł sam go znaleźć.

Add "Maca not present" error message

The current error message:

concraft-pl: fd:9: hClose: resource vanished (Broken pipe)
concraft-pl: thread blocked indefinitely in an MVar operation

is not very informative and may confuse Concraft users (see #4).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.