Small Script to train the Punkt Sentence Tokenizer. This step is very important while making use of the tokenizer. A sample text file(preferably UTF-8) is provided as input,a model is created via unsupervised machine learning. The model is saved as a .pickle file which can be then used to tokenize sentences for the concerned language. The redistributable is updated along with the core script everytime any changes are done. For older files refer the other branches.
chaitradangat / punkttrainer Goto Github PK
View Code? Open in Web Editor NEWSmall Script to train the Punkt Sentence Tokenizer. This step is very important while making use of the tokenizer. A sample text file(preferably UTF-8) is provided as input,a model is created via unsupervised machine learning.