Git Product home page Git Product logo

bphigh_at_dravidian_lang_tech_acl-2022's Introduction

BpHigh@TamilNLP-ACL2022: Effects of Data Augmentation on Indic-Transformer based classifier for Abusive Comments Detection in Tamil

Kernels used for model training using Kaggle Notebooks and other experimental kernels as well as Kernels used for data augmentation purposes.

Repo-Visits

Visits Badge

Background

  • The shared task on Abusive comment detection in Tamil-ACL 2022 is a comment classification problem that can be further described as a multi-class text classification problem in Tamil native script and Tamil-English code-mixed.

  • Given a YouTube comment, the systems submitted by the participants should classify its abusive categories.

  • The participants were provided with development, training and test dataset in Tamil and Tamil-English.

  • The dataset is tagged using various classes namely, Homophobia, Misandry, Counter-speech, Misogyny, Xenophobia, Transphobic and hope speech.

  • The dataset consists of rows that contain the comment text and the label assigned to that comment

Contributors

  • Bhavish Pahwa 🏄‍♂️ (GitHub)

Methodology

  • We build a classifier using the MURIL Transformer as our embedding layer(all layers frozen) and attach a classifier head by adding subsequent convolution and dense layers. The final output dense layer has softmax activation, which gives us the final predictions.

  • We use two data Augmentation approaches to improve our model performance.

  • We define an equation to generate a balanced form of the original shared task dataset through our augmentation approaches.

  • We take the help of the NlpAug library , which provides the methods to perform word-level augmentation using contextual models as well as non-contextual word embeddings like Word2vec, fastText, and Glove.

Classifier Structure

Equation used for deciding the number of samples to augment for each class

Screenshot 2022-05-21 at 8 58 58 PM

  • The above equation shows us the multiplier value M, used while generating the augmented sentences. M refers to the value by which the number of occurrences of a label should change, and N is the number of occurrences of a label, also called the value count of a label. L refers to the set of class labels.

  • In terms of words, the above equation conveys that the multiplier value M(i) for label i is equal to the floor division of the value count for the label having maximum count and the value count for label i.

Data Augmentation Approaches Used

  • We use the MURIL Transformer again as a "Contextual Word Embedding Augmenter" to generate word-level augmented sentences. Then we train our classifier using this new balanced version of the train dataset.

  • We use the IndicNLP tokenizer for Indian languages for pre-processing the input sentences and the Tamil fastText model from the IndicNLP suite as a ’Word Embeddings Augmenter’ to generate word-level augmented sentences. Then we train our classifier using this new balanced version of the train dataset.

Results

Results

bphigh_at_dravidian_lang_tech_acl-2022's People

Contributors

bp-high avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.