Natural Language Processing (Data Analysis on Text Data)

Performing data analysis on text data.

1. Tokenization

Tokenization is the procedure of splitting text into a set of meaningful fragments. These pieces are called tokens.

Output:

2. Stop words Removal

Stop words are the words which are commonly filtered out before processing a natural language. These are the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and does not add much information to the text. Examples of a few stop words in English are “the”, “a”, “an”, “so”, “what”.

Removal of stop words certainly reduces the dataset size and gives more emphasis to the crucial information for detailed analysis.

Output:

3. Stemming

Stemming is one of the most common data pre-processing operations we do in text preprocessing. Stemming is the process of removing a part of a word or reducing a word to its stem or root. We use a few algorithms to decide how to chop a word off.

Output:

4. Lemmatization

Lemmatization is like stemming in reducing inflected words to their word stem but differs in the way that it makes sure the root word (also called as lemma) belongs to the language.

As a result, this one is generally slower than stemming process. Since lemmatization requires the part of speech, it is a less efficient approach than stemming.

Output:

5. TF-IDF (Word Cloud)

TF-IDF is a statistical measure that evaluates how related a word is to a document in a set of documents.

Output:

6. N-Grams

An n-gram is a contiguous sequence of n items in the text. In our case, we will be dealing with words being the item, but depending on the use case, it could be even letters, syllables, or sometimes in the case of speech, phonemes.

Output:

7. VSM (Vector Space Model)

VSM is a statistical model for representing text information for Information Retrieval, NLP, Text Mining.

Representing documents in VSM is called "vectorizing text" contains the following information: how many documents contain a term, and what are important terms each document has.

Output:

Dataset Source

Author: Larxel

URL: www.kaggle.com/datasets/andrewmvd/udemy-courses

Dataset Description:

This dataset contains records of courses from 4 subjects (Business Finance, Graphic Design, Musical Instruments and Web Development) taken from Udemy. Udemy is a massive online open course (MOOC) platform that offers both free and paid courses. Anybody can create a course, a business model by which allowed Udemy to have hundreds of thousands of courses. This version modifies column names, removes empty columns, and aggregates everything into a single csv file for ease of use.

zeeshan23s / natural-language-processing Goto Github PK

natural-language-processing's Introduction

Natural Language Processing (Data Analysis on Text Data)

1. Tokenization

Output:

2. Stop words Removal

Output:

3. Stemming

Output:

4. Lemmatization

Output:

5. TF-IDF (Word Cloud)

Output:

6. N-Grams

Output:

7. VSM (Vector Space Model)

Output:

Dataset Source

Dataset Description:

natural-language-processing's People

Watchers

Recommend Projects

Recommend Topics

Recommend Org