Git Product home page Git Product logo

manuel-silvan / simple-language-recognition-in-european-parliament-proceedings-parallel-corpus Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 24 KB

The European Parliament Proceedings Parallel Corpus (1996-2011) (https://www.statmt.org/europarl/) is a well-known dataset in Natural Language Processing tasks, it contains proceedings of the European Parliament in 21 European languages. In this project we will only extract data from 6 languages (German, French, Spanish, Italian, Polish and English), we will extract, preprocess, clean and normalize the data and after that we will train on that data some quite simple classifiers that will be able to tell in which language a sentence is written. This was originally a project i did on university.

Python 100.00%
classification-algorithm language-recognition natural-language-processing from-scratch

simple-language-recognition-in-european-parliament-proceedings-parallel-corpus's Introduction

Simple-Language-recognition-in-European-Parliament-Proceedings-Parallel-Corpus

The European Parliament Proceedings Parallel Corpus (1996-2011) (https://www.statmt.org/europarl/) is a well-known dataset in Natural Language Processing tasks, it contains proceedings of the European Parliament in 21 European languages. In this project we will only extract data from 6 languages (German, French, Spanish, Italian, Polish and English), we will extract, preprocess, clean and normalize the data and after that we will train on that data some quite simple classifiers that will be able to tell in which language a sentence is written.

This was originally an small project i did on university, and now I'm trying to formalize it so many other models or techniques can be tested in it.

To test it, put any of the files that can be found in https://www.statmt.org/europarl/ in the folder that the Python files are in, these files contain a bunch of sentences in a certain language, the program works so that each file is a different language. There are three simple classification models set:

  • TF-IDF programmed from-scratch + dot product
  • Bigram
  • TF-IDF + DecissionTree both from sklearn

The code lets you execute it with 2 parameters, add "-n" or "--normalize" to remove the 100 most frecuent word from the corpus and "-m" or "--model" to choose the technique to clasify.

simple-language-recognition-in-european-parliament-proceedings-parallel-corpus's People

Contributors

manuel-silvan avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.