Git Product home page Git Product logo

rafaykhattak / toxiscan Goto Github PK

View Code? Open in Web Editor NEW
5.0 1.0 0.0 3.67 MB

ToxiScan is a text analysis tool that leverages the power of Natural Language Toolkit (NLTK) and the Naive Bayes classifier to determine the presence of toxicity in textual data.

Home Page: https://rafaykhattak-toxiscan-app-40qikl.streamlit.app/

Python 100.00%
machine-learning naive-bayes-classifier nltk supervised-learning text-classification text-preprocessing

toxiscan's Introduction

ToxiScan

ToxiScan is an advanced text analysis tool designed to detect toxicity in textual data. By leveraging the power of Natural Language Toolkit (NLTK), TfidfVectorizer, and the Naive Bayes classifier, ToxiScan provides accurate predictions on whether a given text is toxic or non-toxic. With its simple user interface built using Streamlit, ToxiScan makes toxicity analysis easily accessible to users.

imgonline-com-ua-twotoone-vzmYnnxxlrjvC

Key Features

  • Toxicity Detection: ToxiScan uses the Naive Bayes classifier, trained on a diverse dataset of labeled toxic and non-toxic comments, to predict the presence of toxicity in a given text.
  • Text Preprocessing: ToxiScan employs NLTK, a powerful natural language processing library, for comprehensive text preprocessing. It performs essential tasks such as tokenization, part-of-speech tagging, lemmatization, and stopword removal to ensure the input text is properly prepared for analysis.
  • Feature Extraction: TfidfVectorizer is utilized to extract relevant features from the preprocessed text. This vectorization technique transforms text into numerical feature vectors, enabling the Naive Bayes classifier to make predictions.
  • Accuracy Evaluation: To assess the performance of the classifier, ToxiScan employs metrics such as roc_auc_score and roc_curve, providing insights into the accuracy and efficiency of the toxicity detection model.

Training Data

The training data used for ToxiScan was obtained from Kaggle, specifically the "Toxic Tweets Dataset" created by ASHWIN U IYER. The dataset consists of a collection of labeled toxic and non-toxic tweets, providing valuable examples for training the Naive Bayes classifier. The use of this dataset ensures the model's ability to recognize patterns and features indicative of toxicity in various text inputs.

Installation

To run ToxiScan on your local machine, follow these steps:

  1. Clone the repository:
git clone https://github.com/<username>/<repository>.git
cd <repository>
  1. Install the required dependencies:
pip install -r requirements.txt
  1. Launch the ToxiScan application:
streamlit run toxiscan.py
  1. Access ToxiScan in your web browser:
http://localhost:8501

Usage

  1. Input Text: Enter the text you want to analyze for toxicity in the provided text input box.
  2. Analyze: Click the "Analyze" button to trigger the toxicity prediction process.
  3. Result: ToxiScan will display the prediction result, indicating whether the text is classified as toxic or non-toxic.

Dependencies

ToxiScan utilizes the following libraries and resources:

  • NLTK - Natural Language Toolkit for text preprocessing.
  • Scikit-learn - Machine learning library for feature extraction and classification.
  • Streamlit - Framework for building interactive web applications.

toxiscan's People

Contributors

rafaykhattak avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.