Light

bp-high / bphigh_at_dravidian_lang_tech_acl-2022 Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 3.95 MB

Kernels used for model training using Kaggle Notebooks and other experimental kernels as well as Kernels used for data augmentation purposes.

License: MIT License

Jupyter Notebook 100.00%

bphigh_at_dravidian_lang_tech_acl-2022's Introduction

BpHigh@TamilNLP-ACL2022: Effects of Data Augmentation on Indic-Transformer based classifier for Abusive Comments Detection in Tamil

Kernels used for model training using Kaggle Notebooks and other experimental kernels as well as Kernels used for data augmentation purposes.

`Repo-Visits`

`Background`

The shared task on Abusive comment detection in Tamil-ACL 2022 is a comment classification problem that can be further described as a multi-class text classification problem in Tamil native script and Tamil-English code-mixed.
Given a YouTube comment, the systems submitted by the participants should classify its abusive categories.
The participants were provided with development, training and test dataset in Tamil and Tamil-English.
The dataset is tagged using various classes namely, Homophobia, Misandry, Counter-speech, Misogyny, Xenophobia, Transphobic and hope speech.
The dataset consists of rows that contain the comment text and the label assigned to that comment

`Contributors`

Bhavish Pahwa 🏄‍♂️ (GitHub)

`Methodology`

We build a classifier using the MURIL Transformer as our embedding layer(all layers frozen) and attach a classifier head by adding subsequent convolution and dense layers. The final output dense layer has softmax activation, which gives us the final predictions.
We use two data Augmentation approaches to improve our model performance.
We define an equation to generate a balanced form of the original shared task dataset through our augmentation approaches.
We take the help of the NlpAug library , which provides the methods to perform word-level augmentation using contextual models as well as non-contextual word embeddings like Word2vec, fastText, and Glove.

`Equation used for deciding the number of samples to augment for each class`

The above equation shows us the multiplier value M, used while generating the augmented sentences. M refers to the value by which the number of occurrences of a label should change, and N is the number of occurrences of a label, also called the value count of a label. L refers to the set of class labels.
In terms of words, the above equation conveys that the multiplier value M(i) for label i is equal to the floor division of the value count for the label having maximum count and the value count for label i.

`Data Augmentation Approaches Used`

We use the MURIL Transformer again as a "Contextual Word Embedding Augmenter" to generate word-level augmented sentences. Then we train our classifier using this new balanced version of the train dataset.
We use the IndicNLP tokenizer for Indian languages for pre-processing the input sentences and the Tamil fastText model from the IndicNLP suite as a ’Word Embeddings Augmenter’ to generate word-level augmented sentences. Then we train our classifier using this new balanced version of the train dataset.

`Results`

bphigh_at_dravidian_lang_tech_acl-2022's People

Contributors

Stargazers

Watchers

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.