Git Product home page Git Product logo

Comments (4)

sugeeth14 avatar sugeeth14 commented on May 25, 2024 1

Hi,
Thanks a lot now it works fine !!

from parallel-corpora-tools.

M4t1ss avatar M4t1ss commented on May 25, 2024

Hi! Thanks! 😄
But this sounds bad.
What about the files generated in the /output/removed directory? Are those not parallel as well? What number of removed sentences does it report after filtering?
The command looks OK.

Perhaps you can try to split the files and run it on 4 parts? There may be some strange symbols in the data...

from parallel-corpora-tools.

sugeeth14 avatar sugeeth14 commented on May 25, 2024

Hi,
Sorry for the delay in reply.

What about the files generated in the /output/removed directory? Are those not parallel as well?

There isn't any specific /output/removed directory created but besides my input files I have two files created one each for each language. So the final files are input.txt , target.txt and removed lines having names input.txt.reptok , target.txt.reptok and they are not parallel.

What number of removed sentences does it report after filtering?

It reports as below
Removed 58304 sentence pairs with repeating tokens

and I get the two files mentioned above. Also on side note I am doing this for english to vietnamese translation so the input.txt file has english text and target.txt text has vietnamese text. But this is a parallel corpora .

from parallel-corpora-tools.

M4t1ss avatar M4t1ss commented on May 25, 2024

The /output and /output/removed directories are created when you run https://github.com/M4t1ss/parallel-corpora-tools/blob/master/parallel/0-do-it-all.sh which calls all filtering scripts in order.

But now I think I figured out a problem... The script assumes that there is /output in the input file paths and attempts to replace that with /output/removed to figure out the directory for saving removed sentences.

So like this it fails - I also got files with mismatching line counts:

matiss@tontons:~/data/test-pct$ php repeating-tokens.php general.clean.tc.id.en general.clean.tc.id.lt
Removed 14962 sentence pairs with repeating tokens
matiss@tontons:~/data/test-pct$ wc -l *.reptok
  2677792 general.clean.tc.id.en.reptok
  2677892 general.clean.tc.id.lt.reptok
  5355684 total

This works:

matiss@tontons:~/data/test-pct$ php repeating-tokens.php ./output/general.clean.tc.id.en ./output/general.clean.tc.id.lt
Removed 14962 sentence pairs with repeating tokens
matiss@tontons:~/data/test-pct$ wc -l output/*.reptok
  2687626 output/general.clean.tc.id.en.reptok
  2687626 output/general.clean.tc.id.lt.reptok
  5375252 total

Either change or comment out lines 14 and 15 (and 25-28 if commenting out) or use the directory structure that I intended and it should work fine. 😄

from parallel-corpora-tools.

Related Issues (5)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.