Git Product home page Git Product logo

fake_news_detection's Introduction

fake_news_detection

A machine learning model built to predict if a piece of text, in the portuguese language, is real or fake.

This project was built in two weeks as the final project for the 'Programing for Biosciences' (CFB017) undergraduate course in the Federal University of Rio de Janeiro, taught by Professor Pedro Henrique Monteiro Torres.

I have taken this oportunity to learn about natural language processing (NLP) in machine learning, an area which I'm very interested but don't have experience in.

Table of contents

Table of contents generated with markdown-toc

Usage

This project was built in Python 3.9.7 and therfore may not work in older version of Python.
All the third-party packages necessary can be found in the 'requirements.txt' file. Use 'pip install requirements.txt' to install them.

The model can be utilized through the 'main.py' file.
The input can be given directly through the command line or using one or more '.txt' files.
All the input files must have the '.txt' extension or an error will be raised.

Command line:
python main.py -t "your text here"

Text files:
python main.py -f ./your_text.txt
python main.py -f ./*.txt

Making the project

Planning

To create such a model there are two major factors to be considered:

  • The media that contains the 'fake news' (e.g. text, images, videos, etc).
  • The vehicle for the dessimination of the 'fake news' (e.g. a specific social media, newspapers, etc). In this model I have chosen to utilize text media disseminated by digital news articles.

Creating a training dataset

The training dataset was collected utilizing web scrapping from the following websites:

Building the model

I choose to use the "Bag-of-words" represantation. The text was processed utilizing a pipeline with the NLP following methods:

The result of this pipeline is then given as input to a ML estimator. I compared the accuracy of 5 estimators, using GridSearchCV to find their best parameter configuration:

The best result found, achieving an accuray of 85.7%, was with the SGDClassifier and parameters: alpha=0.01, loss='log', random_state=17.

Further improvements

This project is not novel and similar projects can be found, such as FakeCheck and the Fake.Br corpus.
As this was my first foray in NLP and an assignment for one of my undegraduate classes, I choose to build both the training dataset and the model from scratch. Here are some considerations that could improve the accuaracy of the model.

  • Utilizing a better training dataset

    The training dataset used was fairly small as I had a limited amount of time to collect, clean and manually classify each text. Having a bigger or more refined dataset would probally increase the accuracy of the model.

    The dataset could also be more diverse and better balanced. Some of the sources had only 'real' or 'fake' classified texts. Considering that different sources will have different writing style, this could lead to an overfitting problem. This could be resolved with the same amount of 'real' and 'fake' texts from each source.

  • Using other preprocessing techniques

    In this model's pipeline, the only preprocessing in the input is the removal of stopwords, tokenazation and normalization. I've tested using stemming with RSLP but ir resulted in a lower accuracy. Still there are other techinques that could be utilized, such as Lemmatization, which I have not tested.

  • Selecting more features from the text

    This model only utilizes the 'Bag-of-words' representation but there are other ways to select features with NLP such as POS Tags, semantic classes, pauses, emotiveness, etc.

  • Testing other classification estimators

    As I'm new to the field of NLP and ML, I only tested five different estimators with a limited of parameters permutations. There may be other estimators, such as neural networks, or combinations of parameters that yield a model with an higher accuracy.

fake_news_detection's People

Contributors

jpvasquesc avatar deepsource-autofix[bot] avatar deepsourcebot avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.