Git Product home page Git Product logo

hinglishnlp's Introduction

Hinglish NLP

Welcome! This repository contains NLP resources for Hinglish.

Hinglish is a compound word made from Hindi and English. Hinglish is the code mixed (and code switched) mode of communication used by the bilinguals fluent in Hindi and English. In this repository, you'll find NLP resources developed/adapted for the Hinglish data.

  • Trained NLP models for Hinglish
  • Effective algorithms for various tasks in Hinglish
  • Data used for training
  • Other Hinglish data assets

Data Directory Structure

Here's how the data directory is structured. Some data files will not be present in the Github repo as they are not final yet or are big in size.

data
├── README.md
├── assets
│   ├── README.md
│   ├── eng_vocab
│   ├── hindi_chars
│   ├── stop_hindi
│   └── stop_hinglish
├── normalization
│   ├── README.md
│   ├── train_manual_all.csv
│   ├── train_released_1.csv
│   └── train_synthetic.csv
└── transliteration
    ├── README.md
    ├── en_hi_rel.csv
    └── en_hi_wiki.csv

The READMEs in each folder will explain in detail what each csv/txt file is and how they were created. All the citations can also be found there if the datasets were derived from other published datasets.

Data Generation

All the data generation/creation scripts and code are present in the datagen directory. To execute the scripts you have to go inside the datagen directory.

  1. To get the extracted dataset from the JSON dump of Wikidata run the scripts in the following order with all the paths and filenames changed accordingly.

    cd datagen
    python wikidata2.py
    python wiki_trans_align.py
    python wiki_trans_filter.py

    Note: At the end you wont get the final dataset. You'll have to process this dataset manually. More detailed process can be found at this blog entry - Wikidata for Transliteration Pairs

Blog Posts

Below are the list of blog posts I wrote in order to explain different parts of the work present in this repository and other concepts around Hinglish, Transliteration and NLP.

  1. Intro to the Hinglish and Transliteration: https://trigonaminima.github.io/2018/06/hinglish-and-transliteration/;
  2. What started this all: (Mis)adventures of Building a Chat Bot;
  3. Intro to WX Notation, something which I expect to use in the final model: Understanding WX notation;
  4. Intuition of the components of the Seq2Seq models: Seq2Seq Components.
  5. Training data creation for transliteration from Wikidata: Wikidata for Transliteration Pairs

hinglishnlp's People

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.