microsofttranslator / ntrex Goto Github PK

View Code? Open in Web Editor NEW

73.0 73.0 14.0 15.17 MB

NTREX -- News Test References for MT Evaluation

License: Creative Commons Attribution Share Alike 4.0 International

ntrex's People

Contributors

Stargazers

Watchers

Forkers

kocmitom cbaziotis styloxsumair shivamp9 israaar derxter muhidin114 hquangit icaswell seanpm2001 bcitd chanberg ayie99969

ntrex's Issues

Templatic fragments

Some files contain templatic fragments <seg id=12">. It may be a problem from preprocessing but also problem that translators misunderstood the request. It needs human verification:

NTREX/NTREX-128/newstest2019-ref.fij.txt

Line 484 in 1c14c94

 Na Amerika ena itekivu ni yabaki <seg id="12">A tauyavutaka na Tabacakacaka ni Veitaqomaki e dua na Koro ni Veivakaukauwataki Ni Vuku Cokovata, ka kena inaki me ra vakaitavi kina na itokani mai na tabana ni cakacaka kei na vuli, ka kacivaka na White House na tauyavutaki ni Komiti digitaki ena Artificial Intelligence.

One of your published datasets is named newstest2019-ref.nso (which I assume is Northern Sotho, commonly referred to as Sepedi - because it's from the northern part of South Africa), however the content inside is Sesotho (the Southern Sotho - the Sotho from the Southern part), with code st, or sometimes sot.

empty lines in Vietnamese

Thanks a lot for your test set!
I noticed that some lines are empty in https://github.com/MicrosoftTranslator/NTREX/blob/main/NTREX-128/newstest2019-ref.vie.txt . Is the translation here normal?

Zulu reference line count

The Zulu reference line count (1998) seems one off from the other references and the English source. There might be a misalignment issue with Zulu.

Wolof seems to have incorrect no. of lines

Wolof seems to have 2019 lines, compared to the other lines that have 1997. I tried filtering for empty lines, and it does not seem to make a difference - unlike other files, that actually had many empty lines!

Seems to be an outlier, can someone please look into this?

Different number of lines in some files

Some files have not 1997 lines:

1998 newstest2019-ref.fij.txt
1998 newstest2019-ref.tgk-Cyrl.txt
1998 newstest2019-ref.zul.txt
2019 newstest2019-ref.wol.txt
2031 newstest2019-ref.urd.txt
2042 newstest2019-ref.vie.txt

Am i right, this doesn't allow to match them with newstest2019-src.eng.txt ?

Small things

Thank you this is really great.

I have two comments:

the punctuation is really erratic. I think it would be great if everything was normalized and then post-processed (both according to language). quote before or after full period or comma, space or not before/after quote, ... small things that makes a difference.
Even though these are human translated, I am looking at French (my native language) and there are some inconsistencies.
Just one example, line 49. It has been considered with line 48 but cut. it is not the English line 49 translation.
When comparing FR and CA_FR also I am seeing weird things.

Again, great stuff.

microsofttranslator / ntrex Goto Github PK

ntrex's People

Contributors

Stargazers

Watchers

Forkers

ntrex's Issues

Templatic fragments

Bad language code?

empty lines in Vietnamese

Zulu reference line count

Wolof seems to have incorrect no. of lines

Different number of lines in some files

Small things

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent