Git Product home page Git Product logo

malay-fake-news-classification's Introduction

malay-fake-news-classification

Malay Fake News Classification using:
1. CNN [3]
2. BiLSTM [4]
3. C-LSTM [5]
4. RCNN [6]
5. FT-BERT [7]
6. BERTCNN (A unique method in this project that uses the sequence output from the last BERT layer to be provided to CNN layers).

The preprocessed Word2Vec collaterals which Item 1 - 4 heavily depended on can be obtained via:
https://www.dropbox.com/s/pm9rrynspp16det/malay_word2vec.zip?dl=0
Please see my "malay-word2vec-tsne" repo to see how they are preprocessed.

The result of this project produced a filtered Malay fake news dataset which can be downloaded from malaya_fake_news_preprocessed_dataframe.pkl
(available via the link in malay-fake-news-dataset.txt or at
https://www.dropbox.com/s/i5yx6e426m8frgs/malaya_fake_news_preprocessed_dataframe.pkl?dl=0).
The news articles from the original dataset [1] that cannot be correctly classified by all the models are treated as outliers and filtered out.

The following command in Python will load and display the dataset:

import pandas as pd
df_allnews_unpickled = pd.read_pickle("./malaya_fake_news_preprocessed_dataframe.pkl")
df_allnews_unpickled

Column descriptions:
news: Original news articles that have been cleaned minimally - lowercased, added space between specific symbols, "hb" & "th" e.g. 4th/13hb -> 4 th/ 13 hb.
tokens: Tokenized words from news column. Changed numbers from digits to ordinal spellings. (See image above)
rejoined: Rejoined sentences from tokens column. Mostly used for BERT models as they have their own tokenizer.
length: Length of sentences based on tokens.
label: Class label. 1 for real news. 0 for fake news.
real: One-hot encoding column for real news.
fake: One-hot encoding column for fake news.
Further information regarding this dataset can be found in the following table.


The following experiments/modifications were done before filtering the outliers to achieve the best result/dataset:

  • Normal: All fake news articles originally from [1] are considered.
  • <1000: Only news articles with less than 1000 words are considered because those with more are very few in numbers.
  • Trunc128: All news articles are truncated to have a maximum sequence length of 128 (the standard for BERT models in this project).
  • Summarized: News articles with more than 200 words are first summarized using TF-IDF scores and Hopfield Network and are then truncated at 128 sequence length. The summarization method can be found in my "article-summarization" project.
  • Filtered: All news articles that cannot be classified by all models are considered as outliers and removed from the original dataset.


Disclaimer: The "how-to" files may display some old results though with accurate process and methodology.

The work done in this project is part of the following publication:
"A Benchmark Evaluation Study for Malay Fake News Classification Using Neural Network Architectures"
Published in Kazan Digital Week 2020. Methodical and Informational Science Journal, Vestnik NTsBZhD(4), pp. 5-13, 2020.
https://ncbgd.tatarstan.ru/rus/file/pub/pub_2610566.pdf
http://www.vestnikncbgd.ru/index.php?id=1&lang=en
https://kazandigitalweek.com/

The original dataset, toolkit and pre-trained BERT model are provided by:
[1] Zolkepli, Husein. “Malay-Dataset.” Github-huseinzol05/Malay-Dataset: Text corpus for Bahasa Malaysia. https://github.com/huseinzol05/Malay-Dataset
[2] Zolkepli, Husein. “Malaya.” Github-huseinzol05/Malaya: Natural-Language-Toolkit for Bahasa Malaysia. https://github.com/huseinzol05/Malaya

The chosen model architectures for this project are applications of the following papers:
[3] Kim, Yoon. "Convolutional neural networks for sentence classification." arXiv preprint arXiv:1408.5882 (2014).
[4] Nowak, Jakub, Ahmet Taspinar, and Rafał Scherer. "LSTM recurrent neural networks for short text and sentiment classification." In International Conference on Artificial Intelligence and Soft Computing, pp. 553-562. Springer, Cham, 2017.
[5] Zhou, Chunting, Chonglin Sun, Zhiyuan Liu, and Francis Lau. "A C-LSTM neural network for text classification." arXiv preprint arXiv:1511.08630 (2015).
[6] Lai, Siwei, Liheng Xu, Kang Liu, and Jun Zhao. "Recurrent convolutionalneural networks for text classification." In Twenty-ninth AAAI conference on artificial intelligence. 2015.
[7] Devlin, Jacob. "Github-google-research/bert: TensorFlow code and pre-trained models for BERT.” Github.com. https://github.com/google-research/bert (accessed March 09, 2020).

malay-fake-news-classification's People

Contributors

asyrafazlan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.