Git Product home page Git Product logo

nlp-id's Introduction

Kumparan's NLP Services

nlp-id is a collection of modules which provides various functions for Natural Language Processing for Bahasa Indonesia. This repository contains all source code related to NLP services.

Installation

To install nlp-id, use the following command:

$ pip install nlp-id     

Usage

Description on how to use the lemmatizer, tokenizer, POS-tagger, etc. will be explained in more detail in this section.

Lemmatizer

Lemmatizer is used to get the root words from every word in a sentence.

from nlp_id.lemmatizer import Lemmatizer 
lemmatizer = Lemmatizer() 
lemmatizer.lemmatize('Saya sedang mencoba') 
# saya sedang coba 

Tokenizer

Tokenizer is used to convert text into tokens of word, punctuation, number, date, email, URL, etc. There are two kinds of tokenizer in this repository, standard tokenizer and phrase tokenizer. The standard tokenizer tokenizes the text into separate tokens where the word tokens are single-word tokens.

from nlp_id.tokenizer import Tokenizer 
tokenizer = Tokenizer() 
tokenizer.tokenize('Lionel Messi pergi ke pasar di daerah Jakarta Pusat.') 
# ['Lionel', 'Messi', 'pergi', 'ke', 'pasar', 'di', 'daerah', 'Jakarta', 'Pusat', '.']

The phrase tokenizer tokenizes the text into separate tokens where the word tokens are phrases (single or multi-word tokens).

from nlp_id.tokenizer import PhraseTokenizer 
tokenizer = PhraseTokenizer() 
tokenizer.tokenize('Lionel Messi pergi ke pasar di daerah Jakarta Pusat.') 
# ['Lionel Messi', 'pergi', 'ke', 'pasar', 'di', 'daerah', 'Jakarta Pusat', '.']

POS Tagger

POS tagger is used to obtain the Part-Of-Speech tag from a text. There are two kinds of POS tagger in this repository, standard POS tagger and phrase POS tagger. The tokens in standard POS Tagger are single-word tokens, while the tokens in phrase POS Tagger are phrases (single or multi-word tokens).

from nlp_id.postag import PosTag
postagger = PosTag() 
postagger.get_pos_tag('Lionel Messi pergi ke pasar di daerah Jakarta Pusat.') 
# [('Lionel', 'NNP'), ('Messi', 'NNP'), ('pergi', 'VB'), ('ke', 'IN'), ('pasar', 'NN'), ('di', 'IN'), ('daerah', 'NN'),  
  ('Jakarta', 'NNP'), ('Pusat', 'NNP'), ('.', 'SYM')]

postagger.get_phrase_tag('Lionel Messi pergi ke pasar di daerah Jakarta Pusat.') 
# [('Lionel Messi', 'NP'), ('pergi', 'VP'), ('ke', 'IN'), ('pasar', 'NN'), ('di', 'IN'), ('daerah', 'NN'), 
  ('Jakarta Pusat', 'NP'), ('.', 'SYM')]

Description of tagset used for POS Tagger:

No. Tag Description Example
1 ADV Adverbs. Includes adverb, modal, and auxiliary verb sangat, hanya, justru, boleh, harus, mesti
2 CC Coordinating conjunction. Coordinating conjunction links two or more syntactically equivalent parts of a sentence. Coordinating conjunction can link independent clauses, phrases, or words. dan, tetapi, atau
3 DT Determiner/article. A grammatical unit which limits the potential referent of a noun phrase, whose basic role is to mark noun phrases as either definite or indefinite. para, sang, si
4 FW Foreign word. Foreign word is a word which comes from foreign language and is not yet included in Indonesian dictionary online, e-commerce
5 IN Preposition. A preposition links word or phrase and constituent in front of that preposition and results prepositional phrase. dalam, dengan, di, ke
6 JJ Adjective. Adjectives are words which describe, modify, or specify some properties of the head noun of the phrase bersih, panjang, jauh, marah
7 NEG Negation tidak, belum, jangan
8 NN Noun. Nouns are words which refer to human, animal, thing, concept, or understanding meja, kursi, monyet, perkumpulan
9 NNP Proper Noun. Proper noun is a specific name of a person, thing, place, event, etc. Indonesia, Jakarta, Piala Dunia, Idul Fitri, Jokowi
10 NUM Number. Includes cardinal and ordinal number 9876, 2019, 0,5, empat
11 PR Pronoun. Includes personal pronoun and demonstrative pronoun saya, kami, kita, kalian, ini, itu
12 RP Particle. Particle which confirms interrogative, imperative, or declarative sentences pun, lah, kah
13 SC Subordinating Conjunction. Subordinating conjunction links two or more clauses and one of the clauses is a subordinate clause. sejak, jika, seandainya, dengan, bahwa, yang
14 SYM Symbols and Punctuations +,%,@
15 UH Interjection. Interjection expresses feeling or state of mind and has no relation with other words syntactically. ayo, mari, aduh
16 VB Verb. Includes transitive verbs, intransitive verbs, active verbs, passive verbs, and copulas. tertidur, bekerja, membaca
17 WH Question words siapa, apa, kapan, bagaimana
18 ADJP Adjective Phrase. A group of words headed by an adjective that describes a noun or a pronoun sangat tinggi
19 DP Date Phrase. Date written with whitespaces 1 Januari 2020
20 NP Noun Phrase. A phrase that has a noun (or indefinite pronoun) as its head Jakarta Pusat, Lionel Messi
21 NUMP Number Phrase. 10 juta
22 VP Verb Phrase. A syntactic unit composed of at least one verb and its dependents tidak makan

Stopword

nlp-id also provide list of Indonesian stopword.

from nlp_id.stopword import StopWord 
stopword = StopWord() 
stopword.get_stopword() 
# [{list_of_nlp_id_stopword}]    

Stopword Removal is used to remove every Indonesian stopword from the given text.

from nlp_id.stopword import StopWord 
text = "Lionel Messi pergi Ke pasar di area Jakarta Pusat" # single sentence
stopword = StopWord() 
stopword.remove_stopword(text)
# Lionel Messi pergi pasar area Jakarta Pusat  

paragraph = "Lionel Messi pergi Ke pasar di area Jakarta Pusat itu. Sedangkan Cristiano Ronaldo ke pasar Di area Jakarta Selatan. Dan mereka tidak bertemu begini-begitu."
stopword.remove_stopword(text)
# Lionel Messi pergi pasar area Jakarta Pusat. Cristiano Ronaldo pasar area Jakarta Selatan. bertemu.

nlp-id's People

Contributors

zavliju avatar frandyeddy avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.