oscar-project / oscar-tools Goto Github PK
View Code? Open in Web Editor NEWThe original tooling for the OSCAR corpus rewritten in Rust
License: Apache License 2.0
The original tooling for the OSCAR corpus rewritten in Rust
License: Apache License 2.0
Cli documentation is very lackluster, we need to add more details about operations.
Describe the bug
Error when using extract-text
on Japanese dataset for OSCAR 2201 (v2).
To Reproduce
Steps to reproduce the behavior:
oscar-tools v2 extract-text ja_meta_part_1.jsonl.gz ja_part_1.txt
Error: Io(Error { kind: InvalidData, message: "stream did not contain valid UTF-8" })
Expected behavior
Expected an OSCAR-2019 (v1) text format
Desktop (please complete the following information):
We should print an error and then go into the next file rather than stopping everything.
Since the new pipeline outputs a unique .jsonl
file containing both textual content and metadata, working exclusively with textual data may be less comfortable:
jsonl
-compatible parser in order to extract contentFor now, we should provide a tool that splits up a given OSCAR Schema v2 file into a OSCAR Schema v1.2-compatible file.
The package
command is in two steps:
The command name might be ambiguous, and we might have a way of doing all of this better.
This is the command to rebuild the corpus from rebuild (avro) files.
It might be trickier to port because it has more dependencies, iirc.
Code is here: https://github.com/oscar-project/ungoliant/blob/v1.2.3/src/processing/rebuild.rs
Describe the bug
Wrong order of documents in split op
To Reproduce
Steps to reproduce the behavior:
split file in multiple splits
Expected behavior
part one has the firtst docs, and so on
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Smartphone (please complete the following information):
Additional context
Add any other context about the problem here.
Compress attempts to compress at max_depth=1, but splitting puts each language in a folder.
For now, each operation implementation needs some boilerplate code to generate a CLI command and options.
Implementing a specific operation on a corpus version should automatically add the said operation into the operations list.
This should however constrain the user to choose explicitly an OSCAR Schema version. We could add a versions
subcommand listing the different OSCAR Schema versions and their corpora:
OSCAR Schema v1: 19XX
OSCAR Schema v1.1: 2109
OSCAR Schema v2: 2201
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.