Git Product home page Git Product logo

natural-language-processing's Introduction

Natural Language Processing (Data Analysis on Text Data)

Performing data analysis on text data.

1. Tokenization

Tokenization is the procedure of splitting text into a set of meaningful fragments. These pieces are called tokens.

Output:

image

2. Stop words Removal

Stop words are the words which are commonly filtered out before processing a natural language. These are the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and does not add much information to the text. Examples of a few stop words in English are “the”, “a”, “an”, “so”, “what”.

Removal of stop words certainly reduces the dataset size and gives more emphasis to the crucial information for detailed analysis.

Output:

image

3. Stemming

Stemming is one of the most common data pre-processing operations we do in text preprocessing. Stemming is the process of removing a part of a word or reducing a word to its stem or root. We use a few algorithms to decide how to chop a word off.

Output:

image

4. Lemmatization

Lemmatization is like stemming in reducing inflected words to their word stem but differs in the way that it makes sure the root word (also called as lemma) belongs to the language.

As a result, this one is generally slower than stemming process. Since lemmatization requires the part of speech, it is a less efficient approach than stemming.

Output:

image

5. TF-IDF (Word Cloud)

TF-IDF is a statistical measure that evaluates how related a word is to a document in a set of documents.

Output:

image

6. N-Grams

An n-gram is a contiguous sequence of n items in the text. In our case, we will be dealing with words being the item, but depending on the use case, it could be even letters, syllables, or sometimes in the case of speech, phonemes.

Output:

Screenshot from 2022-10-12 10-55-30

7. VSM (Vector Space Model)

VSM is a statistical model for representing text information for Information Retrieval, NLP, Text Mining.

Representing documents in VSM is called "vectorizing text" contains the following information: how many documents contain a term, and what are important terms each document has.

Output:

image

Dataset Source

Author: Larxel

URL: www.kaggle.com/datasets/andrewmvd/udemy-courses

Dataset Description:

This dataset contains records of courses from 4 subjects (Business Finance, Graphic Design, Musical Instruments and Web Development) taken from Udemy. Udemy is a massive online open course (MOOC) platform that offers both free and paid courses. Anybody can create a course, a business model by which allowed Udemy to have hundreds of thousands of courses. This version modifies column names, removes empty columns, and aggregates everything into a single csv file for ease of use.

natural-language-processing's People

Watchers

Muhammad Zeeshan Tassawar avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.