Git Product home page Git Product logo

charsplit's Introduction

CharSplit - An ngram-based compound splitter for German

Splits a German compound into its body and head, e.g.

Autobahnraststätte -> Autobahn - Raststätte

Implementation of the method decribed in the appendix of the thesis:

Tuggener, Don (2016). Incremental Coreference Resolution for German. University of Zurich, Faculty of Arts.

TL;DR: The method calculates probabilities of ngrams occurring at the beginning, end and in the middle of words and identifies the most likely position for a split.

The method achieves ~95% accuracy for head detection on the Germanet compound test set.

A model is provided, trained on 1 Mio. German nouns from Wikipedia.

Usage

Train a new model:

training.py --input_file --output_file

from command line, where input_file contains one word (noun) per line and output_file is a json file with computed n-gram probabilities.

Compound splitting

In python

>> from charsplit import Splitter
>> splitter = Splitter()
>> splitter.split_compound("Autobahnraststätte")

returns a list of all possible splits, ranked by their score, e.g.

[(0.7945872450631273, 'Autobahn', 'Raststätte'), 
(-0.7143290887876655, 'Auto', 'Bahnraststätte'), 
(-1.1132332878581173, 'Autobahnrast', 'Stätte'), ...]

By default, Splitter uses the data from the file charsplit/ngram_probs.json. If you retrained the model, you may specify a custom file with

>> splitter = Splitter(ngram_path=<json_data_file_with_ngram_probs>)

charsplit's People

Contributors

dtuggener avatar franp9am avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

charsplit's Issues

Score is not the same as presented in paper

I try the same code with same compound string but the score is -0.65. Why is that so? e.g: "Autobahnraststätte" returns:
[-0.6561174724342663, 'Autobahn', 'Raststätte'],
[-0.719082070992539, 'Autobahnrast', 'Stätte'],
[-2.0207606162242056, 'Auto', 'Bahnraststätte'],
[-2.0883545770567786, 'Autobahnrasts', 'Tätte'],
[-2.116115029842648, 'Autobahnrastst', 'Ätte'],
[-2.1366906474820144, 'Autobahnras', 'Tstätte'],
[-2.155172413793103, 'Autobahnra', 'Ststätte'],
[-2.1557478368356, 'Autobahnr', 'Aststätte'],
[-2.237077877325982, 'Aut', 'Obahnraststätte'],
[-2.458303592671901, 'Autob', 'Ahnraststätte'],
[-2.709178455383891, 'Autobahnraststä', 'Tte'],
[-2.785514345696291, 'Autobah', 'Nraststätte'],
[-3, 'Autoba', 'Hnraststätte']]

How to filter what words to split and what not?

This compound splitter is amazing thank you.
The problem I have with it is that it tends to split even words that it really shouldn't, for example the word "Präsident" is split into "Prasid", "ent".

This flaw makes it impossible for me to use in my NLP project.

I was wondering if you had a way to filter out words to be split and words not to be split.

This would be really helpful, thank you.

Problem using "-" in the string

Thank you a lot for this great compount splitter. It works really good. Unfortunately it does have some problems with hyphens.

char_split.split_compound("Kraftfahrzeug-Haftpflichtversicherung")
returns [[1.0, 'Kraftfahrzeug-Haftpflichtversicherung', 'Haftpflichtversicherung']]

The whole word is returned in the first part of the result. Do you know how to fix this issue?
thanks in advance :)

Split algorithm does not work as intended

Thank you for sharing the code. Your compound splitter works very well, however, maybe not as intended. For each split position n instead of

score(n) = max p(prefix) + max p(suffix) − min p(infix),

what is actually computed by the code is

score(n) = p(word[:n]) + p(word[n:]) - min p(infix).

That is, the code iterates over all infix ngrams, but only uses the full prefix and suffix for the other two probabilities. I simplified the code to reflect this and removed some redundant steps: https://github.com/kldtz/CharSplit. Now the GermaNet evaluation runs in about half of the original time on my machine (yielding the same result).

I also tried using all prefix and suffix ngrams as described in the appendix of your thesis, but this has a negative effect on the performance for the GermaNet evaluation set.

possible pypi package?

Thanks for sharnig your great repo.

Would it be possble for you to make available a pypi package for this repo?

Training data

Is it possible to find (or generate) the training data to experiment with algorithm?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.