Performing data analysis on text data.
Tokenization is the procedure of splitting text into a set of meaningful fragments. These pieces are called tokens.
Stop words are the words which are commonly filtered out before processing a natural language. These are the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and does not add much information to the text. Examples of a few stop words in English are “the”, “a”, “an”, “so”, “what”.
Removal of stop words certainly reduces the dataset size and gives more emphasis to the crucial information for detailed analysis.
Stemming is one of the most common data pre-processing operations we do in text preprocessing. Stemming is the process of removing a part of a word or reducing a word to its stem or root. We use a few algorithms to decide how to chop a word off.
Lemmatization is like stemming in reducing inflected words to their word stem but differs in the way that it makes sure the root word (also called as lemma) belongs to the language.
As a result, this one is generally slower than stemming process. Since lemmatization requires the part of speech, it is a less efficient approach than stemming.
TF-IDF is a statistical measure that evaluates how related a word is to a document in a set of documents.
An n-gram is a contiguous sequence of n items in the text. In our case, we will be dealing with words being the item, but depending on the use case, it could be even letters, syllables, or sometimes in the case of speech, phonemes.
VSM is a statistical model for representing text information for Information Retrieval, NLP, Text Mining.
Representing documents in VSM is called "vectorizing text" contains the following information: how many documents contain a term, and what are important terms each document has.
Author: Larxel
URL: www.kaggle.com/datasets/andrewmvd/udemy-courses
This dataset contains records of courses from 4 subjects (Business Finance, Graphic Design, Musical Instruments and Web Development) taken from Udemy. Udemy is a massive online open course (MOOC) platform that offers both free and paid courses. Anybody can create a course, a business model by which allowed Udemy to have hundreds of thousands of courses. This version modifies column names, removes empty columns, and aggregates everything into a single csv file for ease of use.