microsofttranslator / ntrex Goto Github PK
View Code? Open in Web Editor NEWNTREX -- News Test References for MT Evaluation
License: Creative Commons Attribution Share Alike 4.0 International
NTREX -- News Test References for MT Evaluation
License: Creative Commons Attribution Share Alike 4.0 International
Some files contain templatic fragments <seg id=12">
. It may be a problem from preprocessing but also problem that translators misunderstood the request. It needs human verification:
NTREX/NTREX-128/newstest2019-ref.fij.txt
Line 484 in 1c14c94
As per email conversation:
One of your published datasets is named newstest2019-ref.nso (which I assume is Northern Sotho, commonly referred to as Sepedi - because it's from the northern part of South Africa), however the content inside is Sesotho (the Southern Sotho - the Sotho from the Southern part), with code st, or sometimes sot.
Thanks a lot for your test set!
I noticed that some lines are empty in https://github.com/MicrosoftTranslator/NTREX/blob/main/NTREX-128/newstest2019-ref.vie.txt . Is the translation here normal?
The Zulu reference line count (1998) seems one off from the other references and the English source. There might be a misalignment issue with Zulu.
Wolof seems to have 2019 lines, compared to the other lines that have 1997. I tried filtering for empty lines, and it does not seem to make a difference - unlike other files, that actually had many empty lines!
Seems to be an outlier, can someone please look into this?
Some files have not 1997 lines:
1998 newstest2019-ref.fij.txt
1998 newstest2019-ref.tgk-Cyrl.txt
1998 newstest2019-ref.zul.txt
2019 newstest2019-ref.wol.txt
2031 newstest2019-ref.urd.txt
2042 newstest2019-ref.vie.txt
Am i right, this doesn't allow to match them with newstest2019-src.eng.txt ?
Thank you this is really great.
I have two comments:
the punctuation is really erratic. I think it would be great if everything was normalized and then post-processed (both according to language). quote before or after full period or comma, space or not before/after quote, ... small things that makes a difference.
Even though these are human translated, I am looking at French (my native language) and there are some inconsistencies.
Just one example, line 49. It has been considered with line 48 but cut. it is not the English line 49 translation.
When comparing FR and CA_FR also I am seeing weird things.
Again, great stuff.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.