dtuggener / charsplit Goto Github PK

View Code? Open in Web Editor NEW

101.0 8.0 24.0 38.78 MB

Compound splitter for German

License: MIT License

Python 100.00%

charsplit's Introduction

CharSplit - An ngram-based compound splitter for German

Splits a German compound into its body and head, e.g.

Autobahnraststätte -> Autobahn - Raststätte

Implementation of the method decribed in the appendix of the thesis:

Tuggener, Don (2016). Incremental Coreference Resolution for German. University of Zurich, Faculty of Arts.

TL;DR: The method calculates probabilities of ngrams occurring at the beginning, end and in the middle of words and identifies the most likely position for a split.

The method achieves ~95% accuracy for head detection on the Germanet compound test set.

A model is provided, trained on 1 Mio. German nouns from Wikipedia.

Usage

Train a new model:

training.py --input_file --output_file

from command line, where input_file contains one word (noun) per line and output_file is a json file with computed n-gram probabilities.

Compound splitting

In python

>> from charsplit import Splitter
>> splitter = Splitter()
>> splitter.split_compound("Autobahnraststätte")

returns a list of all possible splits, ranked by their score, e.g.

[(0.7945872450631273, 'Autobahn', 'Raststätte'), 
(-0.7143290887876655, 'Auto', 'Bahnraststätte'), 
(-1.1132332878581173, 'Autobahnrast', 'Stätte'), ...]

By default, Splitter uses the data from the file charsplit/ngram_probs.json. If you retrained the model, you may specify a custom file with

>> splitter = Splitter(ngram_path=<json_data_file_with_ngram_probs>)

charsplit's People

Contributors

Stargazers

Watchers

Forkers

idoraquel laubeee mttmartin kldtz beeekey pakos47 bewe11 franp9am bminixhofer lilvlv maayanorner learnpythontheew cschaefer26 juliatrofim digitalgopnik e-orlov brianpowellqc

charsplit's Issues

Score is not the same as presented in paper

I try the same code with same compound string but the score is -0.65. Why is that so? e.g: "Autobahnraststätte" returns:
[-0.6561174724342663, 'Autobahn', 'Raststätte'],
[-0.719082070992539, 'Autobahnrast', 'Stätte'],
[-2.0207606162242056, 'Auto', 'Bahnraststätte'],
[-2.0883545770567786, 'Autobahnrasts', 'Tätte'],
[-2.116115029842648, 'Autobahnrastst', 'Ätte'],
[-2.1366906474820144, 'Autobahnras', 'Tstätte'],
[-2.155172413793103, 'Autobahnra', 'Ststätte'],
[-2.1557478368356, 'Autobahnr', 'Aststätte'],
[-2.237077877325982, 'Aut', 'Obahnraststätte'],
[-2.458303592671901, 'Autob', 'Ahnraststätte'],
[-2.709178455383891, 'Autobahnraststä', 'Tte'],
[-2.785514345696291, 'Autobah', 'Nraststätte'],
[-3, 'Autoba', 'Hnraststätte']]

How to filter what words to split and what not?

This compound splitter is amazing thank you.
The problem I have with it is that it tends to split even words that it really shouldn't, for example the word "Präsident" is split into "Prasid", "ent".

This flaw makes it impossible for me to use in my NLP project.

I was wondering if you had a way to filter out words to be split and words not to be split.

This would be really helpful, thank you.

Problem using "-" in the string

Thank you a lot for this great compount splitter. It works really good. Unfortunately it does have some problems with hyphens.

char_split.split_compound("Kraftfahrzeug-Haftpflichtversicherung")
returns [[1.0, 'Kraftfahrzeug-Haftpflichtversicherung', 'Haftpflichtversicherung']]

The whole word is returned in the first part of the result. Do you know how to fix this issue?
thanks in advance :)

Split algorithm does not work as intended

Thank you for sharing the code. Your compound splitter works very well, however, maybe not as intended. For each split position n instead of

score(n) = max p(prefix) + max p(suffix) − min p(infix),

what is actually computed by the code is

score(n) = p(word[:n]) + p(word[n:]) - min p(infix).

That is, the code iterates over all infix ngrams, but only uses the full prefix and suffix for the other two probabilities. I simplified the code to reflect this and removed some redundant steps: https://github.com/kldtz/CharSplit. Now the GermaNet evaluation runs in about half of the original time on my machine (yielding the same result).

I also tried using all prefix and suffix ngrams as described in the appendix of your thesis, but this has a negative effect on the performance for the GermaNet evaluation set.

possible pypi package？

Thanks for sharnig your great repo.

Would it be possble for you to make available a pypi package for this repo?

Training data

Is it possible to find (or generate) the training data to experiment with algorithm?

dtuggener / charsplit Goto Github PK

charsplit's Introduction

CharSplit - An ngram-based compound splitter for German

Usage

charsplit's People

Contributors

Stargazers

Watchers

Forkers

charsplit's Issues

Score is not the same as presented in paper

How to filter what words to split and what not?

Problem using "-" in the string

Split algorithm does not work as intended

possible pypi package？

Training data

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent