With the code in this project you can train and test a Recurrent Neural Network to try and guess the correct gender of German nouns.
To run it first you have to download translation data from dict.cc at https://www1.dict.cc/translation_file_request.php?l=e
Then you have to run split_german_train_test.py with the path to the file you downloaded, the output drectory and (optional) the percentage of data for the testing set (int). This script filters out all words that are either not nouns, plural or contain characters that are not a-z, ä, ö, ü and ß. These include apostrophes, hypens, dots, etc. Moreover it removes all words that end with another word, i.e. compound words. This must be done because 1) the gender of a compound word is only determined by the last word and 2) we don't want to have words with the same ending splitted among training and testing set. This will take a while to complete.
After you generated the train and test sets you can run the script german_article_guesser.py that will train a RNN on words in the training file and will test them on words in the testing file.
You can adjust the parameter max_word_length (default: 20) inside german_article_guesser.py: words that are shorter than this number will be padded on the right with '-' to reach the chosen length, words that are longer will be skipped.
With default parameters I reached an accuracy of ~85% on the testing set.
You can test the trained network on a file or on a list of words with test_words.py