Git Product home page Git Product logo

Comments (5)

miau1 avatar miau1 commented on September 12, 2024 1

Unfortunately, those files are not proper xml and currently opus_read crashes if the parsing fails. In the future, we might have an update where opus_read continues to run even if some of the files are not proper xml.

from opustools.

Stamenov avatar Stamenov commented on September 12, 2024

Furthermore, I am getting an error with EuBookshop dataset:

opustools.opus_read.AlignmentParserError: Alignment file "./EUbookshop_v2_xml_bg-en.xml.gz" could not be parsed: mismatched tag: line 225123, column 2

from opustools.

miau1 avatar miau1 commented on September 12, 2024

The latest version of OpusTools, 1.0.0, opus_read continues parsing from the next sentence file if a sentence file with invalid xml is encountered. If there is an error in an alignment file, the file is parsed up to the error, but cannot be parsed any further. There are plans to fix broken xml files in Opus.

from opustools.

Lauler avatar Lauler commented on September 12, 2024

@miau1 Any progress on fixing the xml files?

An error occured during the creation of parallel-sentences2/EUbookshop-en-sv.tsv.gz
type error: Error while parsing alignment file: Document './opus/EUbookshop_latest_xml_en-sv.xml.gz' could not be parsed: mismatched tag: line 1964268, column 2

The file EUbookshop_latest_xml_en-sv.xml.gz seems to have many missing </linkGrp> closing tags. The first <linkGrp> has a closing tag, then none of them have one until the very end, where about 50 of them have a closing tag.

I somehow managed to sentence align this dataset a couple of months ago by downloading through here instead: https://opus.nlpl.eu/download.php?f=EUbookshop/v2/moses/en-sv.txt.zip

and using the non-corrupt alignment file EUbookshop.en-sv.ids to sentence align the data. But I can't for the life of me remember what terminal command args I used to successfully do this. Everything I try now fails. Yet I have a successfully aligned file from a couple of months ago that is sitting there (just not able to recreate it...).

from opustools.

ZenBel avatar ZenBel commented on September 12, 2024

Keeping this thread alive by reporting the same issue with opustools 1.3.1 and the following command:

opus_read --directory EUbookshop \
    --suppress_prompts \
    --source en \
    --target ar \
    --preprocess raw \
    --leave_non_alignments_out \
    --write_mode moses \
    --write EUbookshop.en.ar.txt

The error reads:

opustools.parse.alignment_parser.AlignmentParserError: Error while parsing alignment file: Document './EUbookshop_latest_xml_ar-en.xml.gz' could not be parsed: mismatched tag: line 1767, column 2

from opustools.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.