Git Product home page Git Product logo

Comments (5)

miau1 avatar miau1 commented on September 12, 2024

Yes, you can use opus_get for that. To list available raw datasets for a language:
opus_get --source <lang_id> --preprocess raw --list
And to download them, just remove the --list flag:
opus_get --source <lang_id> --preprocess raw
You can find more examples here: https://github.com/Helsinki-NLP/OpusTools/blob/master/opustools_pkg/README.md#opus_get

from opustools.

jchwenger avatar jchwenger commented on September 12, 2024

Amazing! Thank you so much!! Silly me for not reading properly. That's just great, it works nicely.

One last question, however. I notice that the 'raw' category gives me xml files, you can try:

opus_get --source fr --preprocess raw -d bible-uedin -dl test

I'm after the untokenized raw text, if at all possible. I can get it through the api by doing something like:

corpus="EUconst"
curl "http://opus.nlpl.eu/opusapi/?corpus=$corpus&source=fr&preprocessing=mono&version=latest" \
  | jq '.corpora[1].url' \
  | xargs wget

(As the api gives me an object with key "corpora", and the second option within that is the raw, untokenized monolingual text file.)

Cf. here.

Any chance you might integrate that option in the future? That list functionality is great, also with the size displayed, I'll be able to use that programmatically (to sort the queue downloads by size).

from opustools.

jchwenger avatar jchwenger commented on September 12, 2024

In case, I made this little script that uses the Python utility to download the mono raw files.

from opustools.

miau1 avatar miau1 commented on September 12, 2024

Ok, I see. Yes, the raw files are xml files that contain untokenized text. We can probably add the raw non-xml files to be able to be downloaded with opus_get, good suggestion!

There is also the opus_cat script that can be used to get plain monolingual text from a given corpus:
opus_cat --directory <corpus_name> --language <lang_id> --plain --no_ids
opus_cat was originally designed to take a quick look into a corpus, and its output always includes file names and the output is always tokenized. In the future, we will also probably add options to remove the filenames and to output untokenized text.

from opustools.

jchwenger avatar jchwenger commented on September 12, 2024

Sounds good, thanks for your answer! Awesome that you developed this utility, very useful. As you can see I could get what I wanted with a bit of Python scripting, but thanks for the opus_cat as well, I'm sure it'll come in handy in the future for me as well.

from opustools.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.