Comments (4)
Hi,
Thanks a lot now it works fine !!
from parallel-corpora-tools.
Hi! Thanks! 😄
But this sounds bad.
What about the files generated in the /output/removed
directory? Are those not parallel as well? What number of removed sentences does it report after filtering?
The command looks OK.
Perhaps you can try to split the files and run it on 4 parts? There may be some strange symbols in the data...
from parallel-corpora-tools.
Hi,
Sorry for the delay in reply.
What about the files generated in the
/output/removed
directory? Are those not parallel as well?
There isn't any specific /output/removed
directory created but besides my input files I have two files created one each for each language. So the final files are input.txt , target.txt and removed lines having names input.txt.reptok , target.txt.reptok and they are not parallel.
What number of removed sentences does it report after filtering?
It reports as below
Removed 58304 sentence pairs with repeating tokens
and I get the two files mentioned above. Also on side note I am doing this for english to vietnamese translation so the input.txt file has english text and target.txt text has vietnamese text. But this is a parallel corpora .
from parallel-corpora-tools.
The /output
and /output/removed
directories are created when you run https://github.com/M4t1ss/parallel-corpora-tools/blob/master/parallel/0-do-it-all.sh which calls all filtering scripts in order.
But now I think I figured out a problem... The script assumes that there is /output
in the input file paths and attempts to replace that with /output/removed
to figure out the directory for saving removed sentences.
So like this it fails - I also got files with mismatching line counts:
matiss@tontons:~/data/test-pct$ php repeating-tokens.php general.clean.tc.id.en general.clean.tc.id.lt
Removed 14962 sentence pairs with repeating tokens
matiss@tontons:~/data/test-pct$ wc -l *.reptok
2677792 general.clean.tc.id.en.reptok
2677892 general.clean.tc.id.lt.reptok
5355684 total
This works:
matiss@tontons:~/data/test-pct$ php repeating-tokens.php ./output/general.clean.tc.id.en ./output/general.clean.tc.id.lt
Removed 14962 sentence pairs with repeating tokens
matiss@tontons:~/data/test-pct$ wc -l output/*.reptok
2687626 output/general.clean.tc.id.en.reptok
2687626 output/general.clean.tc.id.lt.reptok
5375252 total
Either change or comment out lines 14 and 15 (and 25-28 if commenting out) or use the directory structure that I intended and it should work fine. 😄
from parallel-corpora-tools.
Related Issues (5)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from parallel-corpora-tools.