Git Product home page Git Product logo

oscar-tools's People

Contributors

ruamismail avatar uinelj avatar

Watchers

 avatar  avatar

oscar-tools's Issues

CLI Documentation

Cli documentation is very lackluster, we need to add more details about operations.

[BUG] can't `extract-text` on Japanese (ja) dataset for OSCAR 2201 (v2)

Describe the bug
Error when using extract-text on Japanese dataset for OSCAR 2201 (v2).

To Reproduce
Steps to reproduce the behavior:

  1. Ran oscar-tools v2 extract-text ja_meta_part_1.jsonl.gz ja_part_1.txt
  2. Gets: Error: Io(Error { kind: InvalidData, message: "stream did not contain valid UTF-8" })
  3. Produces empty destination file.

Expected behavior
Expected an OSCAR-2019 (v1) text format

Desktop (please complete the following information):

  • OS: Linux 4.19.0-16-amd64

Feature: content/metadata file splitter

Since the new pipeline outputs a unique .jsonl file containing both textual content and metadata, working exclusively with textual data may be less comfortable:

  1. Forced download of metadata that will be discarded
  2. Forced usage of a jsonl-compatible parser in order to extract content

For now, we should provide a tool that splits up a given OSCAR Schema v2 file into a OSCAR Schema v1.2-compatible file.

Port from Ungoliant: `package` command

The package command is in two steps:

  • Create one folder per language, and move splits into these folders
  • Compute sha256sums of each split, putting them in a sha256sum file.

The command name might be ambiguous, and we might have a way of doing all of this better.

[BUG]

Describe the bug
Wrong order of documents in split op

To Reproduce
Steps to reproduce the behavior:
split file in multiple splits

Expected behavior
part one has the firtst docs, and so on

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

Improve CLI ergonomics

For now, each operation implementation needs some boilerplate code to generate a CLI command and options.

Implementing a specific operation on a corpus version should automatically add the said operation into the operations list.

This should however constrain the user to choose explicitly an OSCAR Schema version. We could add a versions subcommand listing the different OSCAR Schema versions and their corpora:

OSCAR Schema v1:     19XX
OSCAR Schema v1.1:   2109
OSCAR Schema v2:     2201

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.