msiemens / hyphenn-de Goto Github PK

A neural network hyphenator for the German language

License: MIT License

Python 57.12% Rust 42.88%

machine-learning neural-network hyphenation german-language tensorflow keras rust

hyphenn-de's Introduction

HypheNN-de

A neural network that hyphenates German words. Following B. Fritzke and C. Nasahl, "A neural network that learns to do hyphenation", it uses a window of 8 characters and determines whether the word can be hyphenated after position 4 of the current window. Currently the network achieves a success rate of 99.2 %.

The network input is encoded using one-hot encoding the character set and outputs a single value indicating the probability of a hyphenation at the current position being valid.

Dependencies

Python 3 with TensorFlow and Keras installed
Rust

Download & prepare data

Download Wiktionary dump from https://dumps.wikimedia.org/dewiktionary/latest/ (look for dewiktionary-{timestamp}-pages-articles.xml.bz2) and unpack it
Compile and run prepare_data:
- $ cd prepare_data
- $ cargo run --release ../data/dewiktionary-*-pages-articles.xml ../wordlist.txt

Notes: There's a fair amount of post-processing happening to cleanup the data. The whole process may work with other languages, but the data cleanup probably will need some adaption. Note: The preprocessing randomizes the order of entries.

Train

To train the network, run:

$ python train.py
Using TensorFlow backend.
Building model...
Done

Training model...
2017-05-13 21:58:40.012420: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:857] OS X does not support NUMA - returning NUMA node zero
2017-05-13 21:58:40.012582: I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] Found device 0 with properties: 
name: Quadro K2000
major: 3 minor: 0 memoryClockRate (GHz) 0.954
pciBusID 0000:02:00.0
Total memory: 2.00GiB
Free memory: 1001.96MiB
2017-05-13 21:58:40.012595: I tensorflow/core/common_runtime/gpu/gpu_device.cc:927] DMA: 0 
2017-05-13 21:58:40.012600: I tensorflow/core/common_runtime/gpu/gpu_device.cc:937] 0:   Y 
2017-05-13 21:58:40.012610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Quadro K2000, pci bus id: 0000:02:00.0)
Epoch 1/50
1253797/1253797 [==============================] - 7s - loss: 0.0450 - acc: 0.9468       
Epoch 2/50
1253797/1253797 [==============================] - 7s - loss: 0.0218 - acc: 0.9753       
...
Epoch 48/50
1253797/1253797 [==============================] - 7s - loss: 0.0076 - acc: 0.9920      
Epoch 49/50
1253797/1253797 [==============================] - 7s - loss: 0.0076 - acc: 0.9919       
Epoch 50/50
1253797/1253797 [==============================] - 7s - loss: 0.0076 - acc: 0.9920      
Done
Time: 0:06:05.55

The network weights are stored in data/model.h5

Validate

$ python validate.py
Using TensorFlow backend.
Building model...
Done

2017-05-14 00:50:01.683992: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:857] OS X does not support NUMA - returning NUMA node zero
2017-05-14 00:50:01.684174: I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] Found device 0 with properties: 
name: Quadro K2000
major: 3 minor: 0 memoryClockRate (GHz) 0.954
pciBusID 0000:02:00.0
Total memory: 2.00GiB
Free memory: 720.18MiB
2017-05-14 00:50:01.684190: I tensorflow/core/common_runtime/gpu/gpu_device.cc:927] DMA: 0 
2017-05-14 00:50:01.684195: I tensorflow/core/common_runtime/gpu/gpu_device.cc:937] 0:   Y 
2017-05-14 00:50:01.684206: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Quadro K2000, pci bus id: 0000:02:00.0)
Validating model...
1253536/1253797 [============================>.] - ETA: 0s   
Done
Result: [0.0073155245152076989, 0.99240706430147785]
Time: 0:00:40.17

Note: The first value is the mean square error, the second value is the achieved accuracy.

Run

$ python predict.py Silbentrennung
Using TensorFlow backend.
Building model...
Done

2017-05-14 00:49:08.660959: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:857] OS X does not support NUMA - returning NUMA node zero
2017-05-14 00:49:08.661122: I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] Found device 0 with properties: 
name: Quadro K2000
major: 3 minor: 0 memoryClockRate (GHz) 0.954
pciBusID 0000:02:00.0
Total memory: 2.00GiB
Free memory: 738.18MiB
2017-05-14 00:49:08.661136: I tensorflow/core/common_runtime/gpu/gpu_device.cc:927] DMA: 0 
2017-05-14 00:49:08.661141: I tensorflow/core/common_runtime/gpu/gpu_device.cc:937] 0:   Y 
2017-05-14 00:49:08.661152: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Quadro K2000, pci bus id: 0000:02:00.0)
Input: Silbentrennung
Hyphenation: Sil·ben·tren·nung

Tensorflow build notes

Setup: See https://gist.github.com/Mistobaan/dd32287eeb6859c6668d

Configure:

PYTHON_BIN_PATH=$(which python) CUDA_TOOLKIT_PATH="/usr/local/cuda" CUDNN_INSTALL_PATH="/usr/local/cuda" TF_UNOFFICIAL_SETTING=1 TF_NEED_CUDA=1 TF_CUDA_COMPUTE_CAPABILITIES="3.0" TF_CUDNN_VERSION="6" TF_CUDA_VERSION="8.0" TF_CUDA_VERSION_TOOLKIT=8.0 ./configure

Note: Use defaults everywhere.

Build:

bazel build -c opt --copt=-mavx --copt=-msse4.2 --config=cuda //tensorflow/tools/pip_package:build_pip_package

If encountering problems with Library not loaded: @rpath/libcudart.8.0.dylib, follow http://stackoverflow.com/a/40007947/997063.

If encountering problems regarding -lgomp, replace -lgomp with -L/usr/local/Cellar/llvm/4.0.0/lib/libiomp5.dylib in third_party/gpus/cuda/BUILD.tpl or comment out this line.

References

http://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/
http://blog.aloni.org/posts/backprop-with-tensorflow/
B. Fritzke and C. Nasahl, "A neural network that learns to do hyphenation," IJCNN-91-Seattle International Joint Conference on Neural Networks, Seattle, WA, USA, 1991, pp. 960 vol.2-. (doi: 10.1109/IJCNN.1991.155602)

hyphenn-de's People

Contributors

Stargazers

Watchers

Forkers

salikin ptljeet superblaubeere27 pranavkutralingam felzim th-schmidt

hyphenn-de's Issues

enwiktionary extrahieren

Hi Markus,

Ich würde gerne die Silbifizierten Wörter aus der Englischen wiktionary extrahieren.

Ich komme aber nicht drauf, wie ich die Cargo kompilieren soll.

Kannst du mir weiterhelfen?

Danke.

Not able to process data

Not able to prepare the data because I do not know anything about rust. In other words this command returns an error.

$ cargo run --release ../data/dewiktionary-*-pages-articles.xml ../wordlist.txt

Is there any other way to achieve this? Or can you share the wordlist.txt file?

Weird anomaly with a certain word

Hi, first of all thank you for sharing this on Github @msiemens :)

I am having issues with the hyphenation of the word "Geo" or geo-like words in general.
In my dataset I prepared all words starting with "Geo, Leo...." etc like this "Ge·o..., Le·o...".

After training the output specifically for "Geo" looks like this

Input: Geo
Hyphenation: Ge·oo

same for Leo, Feo, Meo etc
Input: Leo
Hyphenation: Le·oo
....
....

It adds for some reason an additional o at the end. The hyphenation is correct though. Does anyone have this issue or would you have any idea why it does this? Help would be appreciated!

It works totally fine with other words

example:

Input: Donaudampfschifffahrtsgesellschaftskapitänswitwe
Hyphenation: Do·nau·dampf·schiff·fahrts·ge·sell·schafts·ka·pi·täns·wit·we

Input: Silbentrennung
Hyphenation: Sil·ben·tren·nung

ps: I am new to DL and NNs

Greetings
Andi

No progress in training

Hi,

first of all, thanks for updating HypheNN-de to Tensorflow 2.13!

I have difficulties in reproducing your training results (Result: [0.0073155245152076989, 0.99240706430147785]) and am not sure where to look for an error.

Here are the steps I took for training:

I downloaded the latest wiktionary-dump from the provided link in the readme: (dewiktionary-latest-pages-articles.xml.bz2 / 20-Jul-2023 17:03)
I ran cargo run --release ../wiki-articles/dewiktionary-latest-pages-articles.xml ../wordlist.txt which finished with no error and the following output:

cargo run --release ../wiki-articles/dewiktionary-latest-pages-articles.xml ../wordlist.txt
warning: use of deprecated trait `std::ascii::AsciiExt`: use inherent methods instead
 --> src/main.rs:7:17
  |
7 | use std::ascii::AsciiExt;
  |                 ^^^^^^^^
  |
  = note: `#[warn(deprecated)]` on by default

warning: unused import: `std::ascii::AsciiExt`
 --> src/main.rs:7:5
  |
7 | use std::ascii::AsciiExt;
  |     ^^^^^^^^^^^^^^^^^^^^
  |
  = note: `#[warn(unused_imports)]` on by default

warning: `prepare_data` (bin "prepare_data") generated 2 warnings
    Finished release [optimized] target(s) in 0.03s
     Running `target/release/prepare_data ../wiki-articles/dewiktionary-latest-pages-articles.xml ../wordlist.txt`
Processed 1000 words
Processed 2000 words
Processed 3000 words
Processed 4000 words
Processed 5000 words
...
Processed 652000 words
Processed 653000 words
Processed 654000 words
Processed 655000 words
Randomizing line order...
Writing output file...
Done!

I started the training with python train.py and realized (if I am not mistaken) that you hardcoded a word limit for training in dataset.py, which is set to TRAINING_SET = 150000. That is a lot less than the available 655000 words generated through your Rust script.
I therefore changed the 150000 limit to 90 % of the available training data (655000 words) in dataset.py: TRAINING_SET = int(len(words) * 0.9) which results in approx. 589500 words.
When starting the training through python train.py everything works fine, but the training itself doesn't make a lot of progress. Here is the training output:

python train.py                                
Building model...
2023-07-25 08:40:12.192699: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1
2023-07-25 08:40:12.192721: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 16.00 GB
2023-07-25 08:40:12.192725: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 5.33 GB
2023-07-25 08:40:12.193053: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:303] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-07-25 08:40:12.193339: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:269] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
Done

Preparing training data...
Processed 589500 entries (100 %)
Storing data...
Done

Training model...
Epoch 1/50
2023-07-25 08:40:24.986655: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
10489/10489 [==============================] - 44s 4ms/step - loss: 0.0511 - accuracy: 0.9386
Epoch 2/50
10489/10489 [==============================] - 46s 4ms/step - loss: 0.0493 - accuracy: 0.9416
Epoch 3/50
10489/10489 [==============================] - 45s 4ms/step - loss: 0.0494 - accuracy: 0.9417
Epoch 4/50
10489/10489 [==============================] - 45s 4ms/step - loss: 0.0496 - accuracy: 0.9416
...
Epoch 45/50
10489/10489 [==============================] - 44s 4ms/step - loss: 0.0532 - accuracy: 0.9419
Epoch 46/50
10489/10489 [==============================] - 44s 4ms/step - loss: 0.0533 - accuracy: 0.9419
Epoch 47/50
10489/10489 [==============================] - 43s 4ms/step - loss: 0.0533 - accuracy: 0.9420
Epoch 48/50
10489/10489 [==============================] - 44s 4ms/step - loss: 0.0537 - accuracy: 0.9418
Epoch 49/50
10489/10489 [==============================] - 44s 4ms/step - loss: 0.0537 - accuracy: 0.9419
Epoch 50/50
10489/10489 [==============================] - 44s 4ms/step - loss: 0.0536 - accuracy: 0.9420
Done
Time: 0:36:48.59

Maybe you are able to point me in the right direction, as I am in a bit of a loss here. I noticed that the training output in your readme works with a lot more data: 1253797/1253797. Did you increase the training data limit while running this training?

Your model is also a lot more accurate: 0.99240706430147785 I am trying to reproduce your results (although you worked with an older wiktionary data set) and would be really happy, if you could help.

Thanks!

Training issue

When training the training loss quickly drops to under 7.1927e-06 (after the first epoch). When I finished the training and predicted the output of some verbs, I noticed that the output for every character tends towards 0. The AI never decides to split at the syllable. The data is read properly. What can this be caused by?