helsinki-nlp / opustools Goto Github PK

Python 100.00%

opustools's Issues

opus_get downloads all corpora with just the -s switch

Hi,

I'm trying to create a monolingual corpus by taking all the source language data for a particular language, in this case ro. Here's how I'm trying to do it:

opus_get -s ro -p raw -q

I used this exact command a longer time ago and it only downloaded 'ro' corpora (as I expected). Now, it downloads everything; I'm seeing GNOME_latest_raw_sw.zip to CCMatrix_latest_raw_de.zip. I ran this command yesterday evening, and now (~one day later) I got a nice 650GB folder of zips and it's still going strong, almost finished all 74 GB of English Paracrawl even though it has nothing to do with "ro".

While it's easy for me to just delete all the non-"ro" files after everything finishes, it's a waste of bandwidth and time..
Or did the behavior change and I'm using the command wrongly?

Could you please advise?
Regards!

opus_express not checking correctly root directory

Hi,

I have previously downloaded some corpora with opus_get. After that, opus_express does not found the already downloaded files and asks to download them again.

$ opus_get -q -s en -t mt -p raw -d DGT -dl downloads/en-mt
$ ll downloads/en-mt
downloads/en-mt/DGT_latest_raw_en.zip  downloads/en-mt/DGT_latest_raw_mt.zip  downloads/en-mt/DGT_latest_xml_en-mt.xml.gz
$ opus_express --root-dir downloads/en-mt -s en -t mt -c DGT
Skipping DGT (no en-mt)...
Checking out DGT...
No alignment file "downloads/en-mt/DGT/latest/xml/en-mt.xml.gz" or "./DGT_latest_xml_en-mt.xml.gz" found
The following files are available for downloading:

  17 MB https://object.pouta.csc.fi/OPUS-DGT/v2019/xml/en-mt.xml.gz
   1 GB https://object.pouta.csc.fi/OPUS-DGT/v2019/xml/en.zip
 401 MB https://object.pouta.csc.fi/OPUS-DGT/v2019/xml/mt.zip

   2 GB Total size
Downloading 3 file(s) with the total size of 2 GB. Continue? (y/n)

Thanks!

tool for language identification

Create a tool for adding language identification to each sentence of a corpus (in XML) using pycld2 (https://pypi.org/project/pycld2/) or langid.py (https://github.com/saffsd/langid.py):

<s cld2="en" cld2conf="0.98" langid="en" langidconf="0.89" id="...

opus_read should then also be able to filter on language labels (also combined with confidence value thresholds).

Opus_read: SentenceParserError

Hey,
I am trying to download and concat a bunch of English Bulgarian corpora and the EMEA seem problematic.
It wouldn't fail gracefully so it breaks the whole pipeline, with the following configuration:

common: 
  output_directory: "."
steps: 
  - 
    parameters: 
      corpus_name: EMEA
      preprocessing: xml
      release: v3
      source_language: en
      src_output: EMEA.en.gz
      target_language: bg
      tgt_output: EMEA.bg.gz
    type: opus_read

With the following error:

File "/data/anaconda/envs/traingoo/lib/python3.7/site-packages/opustools-0.0.54-py3.7.egg/opustools/parse/block_parser.py", line 76, in parse_line
    '{error}'.format(document=self.document.name, error=e.args[0]))
opustools.parse.sentence_parser.SentenceParserError: Sentence file "EMEA/xml/bg/humandocs/PDFs/EPAR/mmrvaxpro/H-604-PI-bg.xml" could not be parsed: not well-formed (invalid token): line 17, column 16

problem with os.rename in opus_langid

opus_langid throws an error if tmpdir and the file to be checked are not on the same file system. Here is the error message:

opus_langid -f FIH-SWH/xml/sv/Yle_MeMAD_013_FinSweSubs01_1/MEDIA_2015_01032469_SUBTITLE.xml -t test.xml
Traceback (most recent call last):
File "/homeappl/home/tiedeman/.local/bin/opus_langid", line 23, in
OpusLangid(**vars(args)).processFiles()
File "/homeappl/home/tiedeman/.local/lib/python3.4/site-packages/opustools/opus_langid.py", line 151, in processFiles
os.rename(tempname[1], self.target_file_path)
OSError: [Errno 18] Invalid cross-device link: '/tmp/tiedeman/tmpbs1b0mve' -> 'test.xml'

Memory Issue: opus_read fails to extract MultiCCAligned

Using v1.2.1, the following command successfully downloads the resources of MultiCCAligned. After the download, however, the conversion to Moses-format fails without any error message due to a lack of memory (RAM).

opus_read --directory MultiCCAligned -r v1 --source en --target de --write en-de.en en-de.de --write_mode moses

opus_read seems to read the dataset into memory. The memory increases above 60GB before the process dies.

A similar operation to download the WMT dataset works:

opus_read --directory WMT-News -r v2019 --source en --target de --write en-de.en en-de.de --write_mode moses

Thanks for this library. A tool to collect and filter the ever-increasing datasets is of great use.

Query to get list of existing corpora (by language)

Hi,

Is there a way to query the data in such a way that I could get all available corpora that have a certain language? For example, right now I would need to create a monolingual aggregated corpus for a number of languages. For example I would like to get all the corpora that contain "hr" -> xx (no matter target language as long as one side has "hr" sentences").

Is there a way to achieve that automatically?

Thanks a lot!
BTW. Thank you, throughout the years I've been using OPUS from time to time, great resource.

What is the tokenizer for all languages?

The provided tmx file contain the tokenized text, and I wonder what tokenizer is used for the language like Thai, Chinese etc.
Is there any docs to find this?
Thx!

Alignment problem with JW300 corpora?

I have tried to obtain bitext from the JW300 corpus in plain text format. The webpage http://opus.nlpl.eu/JW300-v1.php gives the instruction to use opus-tools to extract bitext from the alignment XML files.

For example, for the language pair English (en) - Burmese (my) I used the following command:

opus_read -d JW300 -s en -t my -wm moses -w jw300.en jw300.my

While the resulting text files have the same number of lines, the alignment seems to be off.

The resulting files look like this:

$ head jw300.??
==> jw300.en <==
Can You Get By for Less ? PRICES keep going in one direction  — up !
The soaring cost of living today threatens to wipe out what little savings some have managed to scrape together .


Especially hard hit are people on fixed incomes . Is there anything that you can do to neutralize the impact of rising prices ?

Let us consider approaches to the problem that certain persons have found practical .

Must You Have It ?


==> jw300.my <==
အကုန်အကျ နည်း နည်း ဖြင့် သင် ရ နိုင် ပါ သလော
ကုန်ဈေးှုန်း သည် လား ရာ တစ်ဖက် တည်း ဖြစ် သော အထက်သို့ သာ တရိပ်ရိပ် တက်နေ သည် !
ယနေ့ လူ နေ ှု စရိတ် မြင့် တက်နေ ခြင်း က အချို့ သူများ ခြစ် ခြစ် ခြုတ် ခြုတ် စုဆောင်း ထား သည့် စု ငွေ လေး ကုန် သွား စေ ရန် ခြိမ်းခြောက် လျက် ှိ သည် ။
အထူးသဖြင့် ပို ၍ အခက်အခဲ ကြုံ ရ သူများ မှာ ပုံသေ ဝင်ငွေ ရ သူများ ဖြစ်သည် ။
ကုန်ဈေးှုန်း မြင့် တက် ခြင်း ၏ ဂယက်ရိုက် ှု ကို တားဆီး ရန် သင် လုပ်ဆောင် နိုင် သည့် အရာ တစ်စုံတစ်ရာ ှိ ပါ သလော ။
လက်တွေ့ ကျသည် ဟု အချို့ သူများ တွေ့ ှိ ခဲ့ ကြ သည့် ပြဿနာ ဖြေ ရှင်း နည်း များ ကို သုံးသပ် ကြည့် ကြ စို့ ။
ယင်း သည် သင့် တွင် ှိ ဖို့ လို သလော
တစ်ခုခု ကို ဝယ် မည် ဟု သင် စဉ်းစား သည့် အခါ “ ဤ အရာ သည် ကျွ်ုပ် အတွက် အမှန် လိုအပ် သလော ” ဟု မေး ခြင်း သည် အကျိုးဖြစ်ထွန်း ကြောင်း သင် တွေ့ မြင် ရ ပါ မည် ။
ဥပမာ ၊
ကား မှ ရှိ သော အသုံးတည့် ှုသည် ကား ဈေး ကျ သွား ၍ ဆုံှုံး ခြင်း ကို မ ဆို ဘဲ ယင်း ကို ဝယ် ခြင်း ၊

Multiple English sentences are aligned to a single Burmese sentence, and some English lines are empty.

If I look at the result of Google Translate on the Burmese part, it looks like all the information is there in principle, but the alignment is off:

Can you get it for a small fee?
Commodity prices are skyrocketing!
Today's rising cost of living threatens to deplete some of their scrap savings.
Particularly disadvantaged are those with a fixed income.
Is there anything you can do to prevent the effects of rising commodity prices?
Let us consider some of the solutions that some have found practical.
Do you have to have it?
When you are thinking of buying something, you will find it helpful to ask, "Is this really what I need?"
For example
The usefulness of a car is not limited to the loss of a car, but to the fact that it can be bought or sold.

Is there anything that can be done to fix this? It looks like even when only considering 1:1 alignments, there is an offset that causes the wrong sentence pairs to align.

OPUS returns no data

Hi. I want to work on Kalaallisut and I saw OPUS has a corpus for it. I have been trying to create sets using opus_express and Danish as source, but it says that OPUS can find no data, when I choose collection "ALL". I have also tried other more generic combinations, like en and fr but still the same.

Missing alignment data for English(en) - Oromo(om)?

I am attempting to retrieve the parallel corpus data for English(en) - Oromo (en) and I used the following command:

opus_read -d JW300 -s en -t om -wm moses -w jw300.en jw300.om

I'm getting this error and was wondering if anyone knew if this meant there is no parallel data available?

Alignment file /proj/nlpl/data/OPUS/JW300/latest/xml/en-om.xml.gz not found. The following files are available for downloading:

Unable to retrieve the data.
Unable to retrieve the data.
No alignment file "/proj/nlpl/data/OPUS/JW300/latest/xml/en-om.xml.gz" or "./JW300_latest_xml_en-om.xml.gz" found

On the OPUS website, the "download files"(bottom-left triangle) are available but there is no actual text, just an XML file with "certainty" and "xtargets" values. In the upper-right triangle, there's a sample file for en-om which makes me think there is a larger dataset. Any help on retrieving this?

Thank you in advance.

Misleading logging information in opus_express

The following lines in opus_express produce misleading logging info.

OpusTools/opustools_pkg/bin/opus_express

Lines 105 to 106 in da63fc3

 if not os.path.isfile(archive_path): 

 print('Skipping %s (no %s-%s)...' % (collection, src_lang, tgt_lang))

The output printed indicates that data for the given language pair is not available in the mentioned collection and that the collection will be skipped, which is not the case.

Is it possible to download all corpus associate with the given language pair?

malformed tmx from opus_read

I made some small changes to the code to work around issue 24.
The tmx file was produced but there is a mismatch in the tags.
E.g.

                <tu>
                        <tuv xml:lang="af"><seg>waarom treiter jy die mensheid met oorlog , pes , hongersnood ?</seg></tuv>
                        <tuv xml:lang="en"><seg>Hold !</seg></tuv>
                </tu>

                        <tuv xml:lang="en"><seg>Why dost thou scourge mankind with War , Plague , Famine ?</seg></tuv>
                </tu>

There is a blank line where there should be an opening <tu> attribute. This is repeated throughout the file. It's possible that I introduced the error with my code modifications. I've already checked, and I'll check again and update here if I find anything.

Using opus_read with -az, -sz, -tz options

Hi, I use opus_get -s ar -t en -d TED2013 --list to list the relevant files and download them manually. Then I run opus_read -s ar -t en -af ar-en.xml.gz -sz ar.zip -tz en.zip -d output, but it says:

There is no item named 'ar/ted2013.en-ar.xml.gz' in the archive 'ar.zip'
Continuing from next sentence file pair.

How should I retrieve plain texts from the downloaded files?

Add progress indicator to opus_express

When downloading large amounts of data using opus_express it would be useful to see some indication of the progress. Now it is hard to guess when the job will be finished.

implement opus-filter

Add a tool for corpus cleaning like in the WMT 2019 tasks. A simple script that can be used for any OPUS bitext and the result is a xces align file or filtered moses-style data. The script should combine language identification, language model filters, alignment-based filters, character code filters, etc ...

Spaces before punctation marks on opus_read output

Apostrophes, commas, question marks, etc, are all printed with a leading space. Is this by design? I couldn't see any options to modify the behaviour.

(src)="8"> She 's calling herself jolene parker .
(trg)="7"> Je ne peux pas te forcer à y croire .

opus_read fails to extract CCMatrix

I tried to extract the aligned sentence pairs from CCMatrix, previously downloaded using opus_express. The command I used was

opus_read --source en --target fi --directory CCMatrix --preprocess xml --leave_non_alignments_out --write_mode moses --write CCMatrix.raw.en CCMatrix.raw.fi --write_ids CCMatrix.raw.ids

The command runs for several days at 100% CPU, without producing any output. Perhaps expat is choking on some error in the data. To rule out package corruption after download, I allowed opus_read to download it again, with the same hanging result.

Traceback when killed:
  File "/home/stiggronroos/venvs/opustools/bin/opus_read", line 135, in <module>
    OpusRead(**vars(args)).printPairs()
  File "/home/stiggronroos/venvs/opustools/lib/python3.6/site-packages/opustools/opus_read.py", line 214, in printPairs
    self.alignmentParser.collect_links()
  File "/home/stiggronroos/venvs/opustools/lib/python3.6/site-packages/opustools/parse/alignment_parser.py", line 107, in collect_links
    blocks = self.bp.get_complete_blocks()
  File "/home/stiggronroos/venvs/opustools/lib/python3.6/site-packages/opustools/parse/block_parser.py", line 98, in get_complete_blocks
    self.parse_line(line)
  File "/home/stiggronroos/venvs/opustools/lib/python3.6/site-packages/opustools/parse/block_parser.py", line 82, in parse_line
    self.p.Parse(line)
KeyboardInterrupt

Workaround: (re)download the corpus directly in moses format from https://opus.nlpl.eu/CCMatrix.php

support search with 3-letter language codes or BCP-47

Make it possible to search with 3-letter language IDs (like in mtdata) - integrate the ID conversion tools implemented in mtdata
(https://github.com/thammegowda/mtdata). Even better is to support extensions like script and region as well (BCP-47 with 3-letter codes). Maybe also the original BCP-47 with a mix of 2-letter and 3-letter language codes?

preserve inline tags

Add a flag to preserve inline tags inside of sentences such as <time id="T1S" value="00:00:51,301" /> in subtitle files (or possible also <b>...</b> in the output of opus readers.

Cannot download resource due to `DH_KEY_TOO_SMALL`

I tried to use opus_get

Tried the most simple command from README:

$ opus_get --directory RF --source en --target sv

Downloading 3 file(s) with the total size of 121 KB. Continue? (y/n) y
Unable to retrieve the data.

When debugged, error comes like this.

OpusTools/opustools_pkg/opustools/opus_get.py

Lines 166 to 168 in 8eef88c

 except urllib.error.URLError as e: 

 print('Unable to retrieve the data.') 

 return

>>> e
URLError(SSLError(1, '[SSL: DH_KEY_TOO_SMALL] dh key too small (_ssl.c:1007)'))

I guess it is due to old policy of data server..?

Reported environment (if it helps):

Ubuntu 22.04
python 3.10.12
opustools pip version: 1.6.1

List of datasets | Monolingual raw files

Hi,

Just stumbling across this tool, which looks promising!

I've been wondering:

is there an option to display the list of available datasets (and their available languages) on opus? That'd be great for browsing & programmatically download a bunch/all of them.
also: is there an option to download the raw monolingual files ? I'm looking into unsupervised language modeling, and so that's the one I'd be using.

Thanks a lot in advance!

PyPI wheel includes old files

The whl distributed by PyPI (downloaded and installed with Pip) includes old versions of this project's source which are extracted into site-packages. These include:

the entire opustools_pkg directory, which dates back to v0.0.50 (plus some filter files dating back even further)
several files in the opustools directory which have been deleted between v0.0.50 and the current v1.2.1

This seems to be because the automated wheel build is adding the latest files to an existing wheel archive, rather than creating a fresh one.

This could cause confusion if users run import opustools_pkg instead of import opustools and get an outdated version of the library, in particular one that doesn't match with the shell scripts added to PATH.

change in OPUS yaml files

There has been a slight change in the yaml files in OPUS: the item 'latest release' is now renamed to 'latest_release' (with underscore instead of space). This also affects the OPUS-API and any db update.

Format of downloaded files does not match the format expected by opus_read

I downloaded files with the following command

python opus_get -s th -t ru -d Opensubtitles

After the files were downloaded I ran the following command

python opus_read -d Opensubtitles -s th -t ru -wm tmx -w th-ru.tmx

Error messages of the following form were repeatedly displayed

There is no item named 'ru/2004/304141/158903.xml.gz' in the archive '.\Opensubtitles_latest_xml_ru.zip'
Continuing from next sentence file pair.

The format of the files in the downloaded zip is

OpenSubtitles\xml\ru\1191\3276470\5646552.xml

Problems in OpusRead interface with moses preprocessing

The addition of moses preprocess option to OpusRead caused problems for OpusFilter, see Helsinki-NLP/OpusFilter#75.

Found so far:

the use of exit() on line 212 breaks the interface used from external code
download_dir is added to the final output files in contrast to what happens with other preprocess options
util.file_open() is not used but the files are just moved, thus lacking compression that can be indicated in the output file names

not sure if I can ask questions here, but I got stuck on this when I try to download a TMX from the OPUS JW300 set. It has nothing to do with the set, I think.

Alignment file /proj/nlpl/data/OPUS/JW300/latest/xml/en-ta.xml.gz not found. The following files are available for downloading:

   8 MB https://object.pouta.csc.fi/OPUS-JW300/v1/xml/en-ta.xml.gz
 263 MB https://object.pouta.csc.fi/OPUS-JW300/v1/xml/en.zip
  94 MB https://object.pouta.csc.fi/OPUS-JW300/v1/xml/ta.zip

 365 MB Total size
Downloading 3 file(s) with the total size of 365 MB. Continue? (y/n) y
JW300_latest_xml_en-ta.xml.gz ... 100% of 8 MB
JW300_latest_xml_en.zip ... 100% of 263 MB
JW300_latest_xml_ta.zip ... 100% of 94 MB
Traceback (most recent call last):
  File "your_script.py", line 3, in <module>
    opus_reader.printPairs()
  File "C:\Users\gertv\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\opustools_pkg\opus_read.py", line 350, in printPairs
    lastline = self.readAlignment(gzipAlign)
  File "C:\Users\gertv\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\opustools_pkg\opus_read.py", line 308, in readAlignment
    lastline = self.outputPair(self.par, line)[1]
  File "C:\Users\gertv\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\opustools_pkg\opus_read.py", line 251, in outputPair
    self.sendPairOutput(wpair)
  File "C:\Users\gertv\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\opustools_pkg\opus_read.py", line 210, in sendPairOutput
    self.resultfile.write(wpair[0])
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.1264.0_x64__qbz5n2kfra8p0\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 83-89: character maps to <undefined>

I have no idea how to fix this. Your help is highly appreciated.

Originally posted by @gertva in #3 (comment)

opus_express without confirmation?

Hi,

opus_get has an option to skip download confirmation

  -q, --suppress_prompts
                        Download necessary files without prompting "(y/n)"

Could opus_express have this option?

Thank you!

DB for off-line search

Add the possibility to create and use an off-line index of OPUS data instead of relying on the public OPUS-API. opus-tools could build a local DB in the same as the OPUS-API update script is doing and then query that database to find resources and how to download them. This would be very convenient to reduce dependence on some on-line service that might break or might have some downtime. Also add an update option that can refresh that local index DB.

Question: monolingual dialogs (Finnish language)

I try create Finnish language Chatbot.
Question: Is it possible create dialog pairs using OpusTools?
Something like:

No joo se on totta <tab> Mitä on mielessäsi

Any help, hints etc.. appreciated.
Kiitos.

Where are the missing language pairs?

There seem to be 417 language varieties represented in https://opus.nlpl.eu/JW300.php. This would imply 417C2 = 86,736 undirected language pairs. However, I only count 54,376 of them, and the paper confirms this number. Do you know where the missing 32,360 language pairs are, and would you be willing to provide them?

I notice that the adjacency matrix seems to have only one fully connected component, so e.g. although ady has no parallel data with en, it has parallel data with "jw_rmv", which has parallel data with en. So it seems likely that ady and en can be aligned. Just to demonstrate that it's conceptually possible, I found these two pairs in the respective corpora:

jw_rmv: Пала со амэ подаса дума андэ авэр статья ?
ady: Сыда къыкІэлъыкІорэ статьям щызэхэтфыщтыр ?

jw_rmv: Пала со амэ подаса дума андэ авэр статья ?
en: What will we consider in the following article ?

Implication: the following is a sentence pair between English and Adyghe:

ady: Сыда къыкІэлъыкІорэ статьям щызэхэтфыщтыр ?
en: What will we consider in the following article ?

(Interestingly, jw_rmv, which actually seems to be Vlax Romany in Cyrillic script, is the one language that is aligned with the most other languages -- more than English!)

Can you adjust the paths so that the users don't have to manually enter "y" and the download will begin automatically? Thanks!

Hi, not sure if this is related, but I get these prompts with opus_read that the file is not found, and then the right file is suggested. Can you adjust the paths so that the users don't have to manually enter "y" and the download will begin automatically? Thanks!

`Alignment file /proj/nlpl/data/OPUS/JW300/v1/xml/en-hi.xml.gz not found. The following files are available for downloading:

6 MB https://object.pouta.csc.fi/OPUS-JW300/v1/xml/en-hi.xml.gz
263 MB https://object.pouta.csc.fi/OPUS-JW300/v1/xml/en.zip
75 MB https://object.pouta.csc.fi/OPUS-JW300/v1/xml/hi.zip

344 MB Total size
Downloading 3 file(s) with the total size of 344 MB. Continue? (y/n) y`

Originally posted by @e-matusov in #3 (comment)

Recreate sample files shown in OpenSubtitles corpus

I would like to create files in the same format as those labeled "view" in the upper triangle of the top block on the OpenSubtiltes page. Is it possible to generate those files with OpusTools?

This is an example from the beginning of one of the files.

# xml/hi/1980/81505/4254655.xml.gz
# xml/zh_tw/1980/81505/6363116.xml.gz

(src)="2"> मेरा श्री उलमन के साथ एक अपॉइंटमेंट है .
(src)="3"> मेरा नाम जैक टॉरंस है .
(trg)="2"> 你好 ， 我 和 尤 爾曼 先生 有 約

option for adding document boundaries

add an option to opus_read to print document boundaries (for example marked by a tag) when converting to moses format.

	if not os.path.isfile(archive_path):
	print('Skipping %s (no %s-%s)...' % (collection, src_lang, tgt_lang))

	except urllib.error.URLError as e:
	print('Unable to retrieve the data.')
	return

helsinki-nlp / opustools Goto Github PK

opustools's Issues

Recommend Projects

Recommend Topics

Recommend Org