Git Product home page Git Product logo

b64filter's People

Contributors

jelmervdl avatar kpu avatar wwaites avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

b64filter's Issues

Bad I/O interaction with perl

For some input, e.g.

csd3:/rds/project/t2_vol4/rds-t2-cs119/internet_archive/wide00006-shards/mt/78/1/sentences.gz

the program deadlocks when given

echo 'while ($var = <STDIN>) { print($var) }' > cat.perl
gzip -dc sentences.gz | b64filter -q 1 perl cat.perl > /dev/null

however

gzip -dc sentences.gz | b64filter -q 1 cat > /dev/null

works just fine

Lines being split

Hi. I am having an issue using b64filter in some Catalan tests with Apertium. Looks like given some kind of lines, b64filter splits them. I was able to simplify the issue without apertium, so this happens with a simple cat:

sentences.gz

zcat sentences.gz | base64 -d | wc -l
25217
zcat sentences.gz | ~/go/bin/b64filter cat - | base64 -d | wc -l
2020/05/08 13:50:34 b64filter.go:198: writeDocs: written 100 docs, 4992 lines in 23.551143ms
2020/05/08 13:50:34 b64filter.go:198: writeDocs: written 200 docs, 18275 lines in 46.941439ms
2020/05/08 13:50:35 b64filter.go:274: processed 252 documents
25469

See that the wc -l of all lines in the input base64 decoded documents is 25217 and in the output after using cat - with b64filter gets more lines (25469).

This generates wrong output files using translation engines, as the translations don't belong to the sentences they are attached to.

In case you can't reproduce the same output as I did in the example, here I attach it (differences start at line 137):

sentences_output.gz

End of file error when running b64filter in Bitextor

Hi,
I am running Bitextor for martacrawl es-eu, and it crashed after outputting this:

2020/08/03 10:52:15 b64filter.go:198: writeDocs: written 22200 docs, 2233737 lines in 2m46.222660934s
2020/08/03 10:52:16 b64filter.go:198: writeDocs: written 22300 docs, 2242692 lines in 2m46.230375568s
terminate called after throwing an instance of 'util::EndOfFileException'
  what():  End of file
2020/08/03 10:52:16 b64filter.go:258: error waiting for command: signal: aborted

@lpla told me that this is a b64filter issue, so this is why I'm reporting here :)

Any clue on what's going on? I kinda know which shards are failing, so I can provide them if you need to reproduce the issue.

Thanks!

buffers - error introduced

The change ported from /paracrawl/b64map/1 caused the readNLines function to fail because the underlying buffer will return io.EOF if it is full. readNLines sensibly takes this to mean there is no more data to read but that is not, in fact, the right thing to do. Masked this for now by simply not treating io.EOF as an error which will be a problem if the command being wrapped does not, in fact, produce enough lines.

b64filter does not halt when child command fails

I've been able to reproduce it with this command:

gzip -cd ~/hieu/nn/0/1/sentences.gz | b64filter parallel --halt 2 -j 16 --pipe -k -l 1024 false

parallel will indicate that false failed, but b64filter won't return and hang indefinitely.

It only happens for small sets of input sentences. If I try this with a sentences.gz from English, it will fail and stop on a pipe error.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.