m4t1ss / parallel-corpora-tools Goto Github PK
View Code? Open in Web Editor NEWTools for filtering and cleaning parallel and monolingual corpora for machine translation and other natural language processing tasks.
License: MIT License
Tools for filtering and cleaning parallel and monolingual corpora for machine translation and other natural language processing tasks.
License: MIT License
Remove sentences where the number of non-space characters is equal (or very close?) to the number of tokens.
English
( c o n t i n u a t i o n )
Slovenian
( n a d a l j e v a n j e )
The language identification is the single slowest part out of all these scripts.
It would be best to give it the least amount of data to process...
Also, specify a temporary directory location for sort
in the unique-parallel script.
Hi ,
Thanks for the nice tool. I tried using the non-repeating token filter repeating-tokens.php
for cleaning my back translated data. But the resultant files I am getting are not parallel corpora. I am not sure if I did any mistake.
Initial parallel corpora size - 836843 lines ( of both source and target language)
after filtering
source file - 734928 lines
target file - 746393 lines
The command I used was
php repeating-tokens.php source_corpus target_corpus
The input files could be split into parts as many as available cores and the results could be concatenated after all parts are processed
Do not remove ALL source sentences that align to multiple target sentences and target sentences align to multiple source sentences! Keep the first one ๐
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.