Git Product home page Git Product logo

core_nlp's Introduction

Core NLP algorithms for Vietnamese

Expectation:

We want to contribute knowledge and source codes to develop an active Vietnamese Natural Language Processing community. Beginners don't have to build basic code again. They can use code for implementing in their system or doing research for improving performance.

Modules

1. Tokenizer

Function: Convert text into tokens E.g.

from tokenization.dict_models import LongMatchingTokenizer
lm_tokenizer = LongMatchingTokenizer()
tokens = lm_tokenizer.tokenize("Thuế thu nhập cá nhân")

Output will be an array: ['Thuế_thu_nhập', 'cá_nhân']

Example 2: Using crf

from tokenization.crf_tokenizer import CrfTokenizer

crf_tokenizer_obj = CrfTokenizer()
crf_tokenizer_obj.train('data/tokenized/samples/training')
# Note: If you trained your model, please set correct model path and do not train again!
# crf_tokenizer_obj = CrfTokenizer(model_path='models/pretrained_tokenizer.crfsuite')

test_sent = "Thuế thu nhập cá nhân"
tokenized_sent = crf_tokenizer_obj.get_tokenized(test_sent)
print(tokenized_sent)

Output will be an array: ['Thuế_thu_nhập', 'cá_nhân']

2. Word Embedding

Download data from websites such as Wikipedia, Vietnamese News. You can use any method and here we propose one for you:

from word_embedding.utils import download_html
url_path = "https://dantri.com.vn/su-kien/anh-huong-bao....htm"
output_path = "data/word_embedding/real/html/html_data.txt"
download_html(url_path, output_path, should_clean=True)

Clean and tokenize text in files. We propose a method to clean each file in a directory.

input_dir = 'data/word_embedding/real/html'
output_dir = 'data/word_embedding/real/training'
from tokenization.crf_tokenizer import CrfTokenizer
from word_embedding.utils import clean_files_from_dir
crf_config_root_path = "tokenization/"
crf_model_path = "models/pretrained_tokenizer.crfsuite"
tokenizer = CrfTokenizer(config_root_path=crf_config_root_path, model_path=crf_model_path)
clean_files_from_dir(input_dir, output_dir, should_tokenize=True, tokenizer=tokenizer)

Training

from word_embedding.word2vec_gensim import train
data_path="data/word_embedding/samples/training"
model_path="models/word2vec.model"
train(data_path=data_path, model_path=model_path)

Test

from word_embedding.word2vec_gensim import test
model_path = "models/word2vec.model"
test_word = "thu_nhập"
test(model_path=model_path, word=test_word)

Use pre-trained word2vec model

model_path = "models/pretrained_word2vec.bin"

Other links for pre-trained word2vec models: https://github.com/sonvx/word2vecVN https://github.com/Kyubyong/wordvectors

3. Text classification

(Run: python demo.py)

from tokenization.crf_tokenizer import CrfTokenizer
from word_embedding.word2vec_gensim import Word2Vec
from text_classification.short_text_classifiers import BiDirectionalLSTMClassifier, load_synonym_dict

# Please give the correct paths
word2vec_model = Word2Vec.load('models/pretrained_word2vec.bin')

tokenizer = CrfTokenizer(config_root_path='tokenization/',
                         model_path='models/pretrained_tokenizer.crfsuite')
sym_dict = load_synonym_dict('data/sentiment/synonym.txt')
keras_text_classifier = BiDirectionalLSTMClassifier(tokenizer=tokenizer, word2vec=word2vec_model.wv,
                                                    model_path='models/sentiment_model.h5',
                                                    max_length=10, n_epochs=10,
                                                    sym_dict=sym_dict)
X, y = keras_text_classifier.load_data(['data/sentiment/samples/positive.txt',
                                       'data/sentiment/samples/negative.txt'],
                                       load_method=keras_text_classifier.load_data_from_file)

keras_text_classifier.train(X, y)
label_dict = {0: 'tích cực', 1: 'tiêu cực'}
test_sentences = ['Dở thế', 'Hay thế', 'phim chán thật', 'nhảm quá']
labels = keras_text_classifier.classify(test_sentences, label_dict=label_dict)
print(labels)  # Output: ['tiêu cực', 'tích cực', 'tiêu cực', 'tiêu cực']

Requirements

Please check requirments.txt

For tokenization modules:

  • python-crfsuite
  • sklearn-crfsuite (default with pre-trained model, but optional. Feel free to use python-crfsuite)

For word_embedding vector:

  • tokenization module
  • gensim

For text classification

  • tokenization module
  • word_embedding module
  • keras

FYI: sklearn-crfsuite is a wrapper of python-crfsuite. sklearn-crfsuite offers scikit-learn styled API with grid-search for hyper-parameters optimization

core_nlp's People

Contributors

caothanhha10 avatar pencilforever avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.