Git Product home page Git Product logo

vibrato's Introduction

🎤 vibrato: VIterbi-Based acceleRAted TOkenizer

Crates.io Documentation Build Status Slack

Vibrato is a fast implementation of tokenization (or morphological analysis) based on the Viterbi algorithm.

A Python wrapper is also available here.

Wasm Demo (takes a little time to load the model.)

Features

Fast tokenization

Vibrato is a Rust reimplementation of the fast tokenizer MeCab, although its implementation has been simplified and optimized for even faster tokenization. Especially for language resources with a large matrix (e.g., unidic-cwj-3.1.1 with a matrix of 459 MiB), Vibrato will run faster thanks to cache-efficient id mappings.

For example, the following figure shows an experimental result of tokenization time with MeCab and its reimplementations. The detailed experimental settings and other results are available on Wiki.

MeCab compatibility

Vibrato supports options for outputting tokenized results identical to MeCab, such as ignoring whitespace.

Training parameters

Vibrato also supports training parameters (or costs) in dictionaries from your corpora. The detailed description can be found here.

Basic usage

This software is implemented in Rust. First of all, install rustc and cargo following the official instructions.

1. Dictionary preparation

You can easily get started with Vibrato by downloading a precompiled dictionary. The Releases page distributes several precompiled dictionaries from different resources.

Here, consider to use mecab-ipadic v2.7.0. (Specify an appropriate Vibrato release tag to VERSION such as v0.5.0.)

$ wget https://github.com/daac-tools/vibrato/releases/download/VERSION/ipadic-mecab-2_7_0.tar.xz
$ tar xf ipadic-mecab-2_7_0.tar.xz

You can also compile or train system dictionaries from your own resources. See the docs for more advanced usage.

2. Tokenization

To tokenize sentences using the system dictionary, run the following command.

$ echo '本とカレーの街神保町へようこそ。' | cargo run --release -p tokenize -- -i ipadic-mecab-2_7_0/system.dic.zst

The resultant tokens will be output in the Mecab format.

本	名詞,一般,*,*,*,*,本,ホン,ホン
と	助詞,並立助詞,*,*,*,*,と,ト,ト
カレー	名詞,固有名詞,地域,一般,*,*,カレー,カレー,カレー
の	助詞,連体化,*,*,*,*,の,ノ,ノ
街	名詞,一般,*,*,*,*,街,マチ,マチ
神保	名詞,固有名詞,地域,一般,*,*,神保,ジンボウ,ジンボー
町	名詞,接尾,地域,*,*,*,町,マチ,マチ
へ	助詞,格助詞,一般,*,*,*,へ,ヘ,エ
ようこそ	感動詞,*,*,*,*,*,ようこそ,ヨウコソ,ヨーコソ
。	記号,句点,*,*,*,*,。,。,。
EOS

If you want to output tokens separated by spaces, specify -O wakati.

$ echo '本とカレーの街神保町へようこそ。' | cargo run --release -p tokenize -- -i ipadic-mecab-2_7_0/system.dic.zst -O wakati
本 と カレー の 街 神保 町 へ ようこそ 。

Notes for Vibrato APIs

The distributed models are compressed in zstd format. If you want to load these compressed models with the vibrato API, you must decompress them outside of the API.

// Requires zstd crate or ruzstd crate
let reader = zstd::Decoder::new(File::open("path/to/system.dic.zst")?)?;
let dict = Dictionary::read(reader)?;

Tokenization options

MeCab-compatible options

Vibrato is a reimplementation of the MeCab algorithm, but with the default settings it can produce different tokens from MeCab.

For example, MeCab ignores spaces (more precisely, SPACE defined in char.def) in tokenization.

$ echo "mens second bag" | mecab
mens	名詞,固有名詞,組織,*,*,*,*
second	名詞,一般,*,*,*,*,*
bag	名詞,固有名詞,組織,*,*,*,*
EOS

However, Vibrato handles such spaces as tokens with the default settings.

$ echo 'mens second bag' | cargo run --release -p tokenize -- -i ipadic-mecab-2_7_0/system.dic.zst
mens	名詞,固有名詞,組織,*,*,*,*
 	記号,空白,*,*,*,*,*
second	名詞,固有名詞,組織,*,*,*,*
 	記号,空白,*,*,*,*,*
bag	名詞,固有名詞,組織,*,*,*,*
EOS

If you want to obtain the same results as MeCab, specify the arguments -S and -M 24.

$ echo 'mens second bag' | cargo run --release -p tokenize -- -i ipadic-mecab-2_7_0/system.dic.zst -S -M 24
mens	名詞,固有名詞,組織,*,*,*,*
second	名詞,一般,*,*,*,*,*
bag	名詞,固有名詞,組織,*,*,*,*
EOS

-S indicates if spaces are ignored. -M indicates the maximum grouping length for unknown words.

Notes

There are corner cases where tokenization results in different outcomes due to cost tiebreakers. However, this would be not an essential problem.

User dictionary

You can use your user dictionary along with the system dictionary. The user dictionary must be in the CSV format.

<surface>,<left-id>,<right-id>,<cost>,<features...>

The first four columns are always required. The others (i.e., <features...>) are optional.

For example,

$ cat user.csv
神保町,1293,1293,334,カスタム名詞,ジンボチョウ
本とカレーの街,1293,1293,0,カスタム名詞,ホントカレーノマチ
ようこそ,3,3,-1000,感動詞,ヨーコソ,Welcome,欢迎欢迎,Benvenuto,Willkommen

To use the user dictionary, specify the file with the -u argument.

$ echo '本とカレーの街神保町へようこそ。' | cargo run --release -p tokenize -- -i ipadic-mecab-2_7_0/system.dic.zst -u user.csv
本とカレーの街	カスタム名詞,ホントカレーノマチ
神保町	カスタム名詞,ジンボチョウ
へ	助詞,格助詞,一般,*,*,*,へ,ヘ,エ
ようこそ	感動詞,ヨーコソ,Welcome,欢迎欢迎,Benvenuto,Willkommen
。	記号,句点,*,*,*,*,。,。,。
EOS

More advanced usages

The directory docs provides descriptions of more advanced usages such as training or benchmarking.

Slack

We have a Slack workspace for developers and users to ask questions and discuss a variety of topics.

License

Licensed under either of

at your option.

Acknowledgment

The initial version of this software was developed by LegalOn Technologies, Inc., but not an officially supported LegalOn Technologies product.

Contribution

See the guidelines.

References

Technical details of Vibrato are available in the following resources:

vibrato's People

Contributors

akiomik avatar kampersanda avatar vbkaisetsu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

vibrato's Issues

Loading a large dictionary such as UniDic cwj 2023-02 is slow in comparison to mecab

Is your feature request related to a problem? Please describe.
Loading a large dictionary such as UniDic 2023-02 is slow in comparison to mecab.

I've downloaded the latest unidic cwj 2023-02 at https://clrd.ninjal.ac.jp/unidic/download.html#unidic_bccwj and built my own compiled vibrato dictionary using

cargo run --release -p compile -- -l unidic-cwj-202302_full/lex.csv -m unidic-cwj-202302_full/matrix.def -u unidic-cwj-202302_full/unk.def -c unidic-cwj-202302_full/char.def -o system.dic.zst

and then I tried tokenizing the example sentence from the docs

> time echo '本とカレーの街神保町へようこそ。' | cargo run --release -p tokenize -- -i system.dic.zst
     Running `target/release/tokenize -i system.dic.zst`
Loading the dictionary...
Ready to tokenize
本      名詞,普通名詞,一般,*,*,*,ホン,本,本,ホン,本,ホン,漢,ホ濁,基本形,*,*,*,*,体,ホン,ホン,ホン,ホン,1,C3,*,9584176605045248,34867
と      助詞,格助詞,*,*,*,*,ト,と,と,ト,と,ト,和,*,*,*,*,*,*,格助,ト,ト,ト,ト,*,"名詞%F1,動詞%F1,形容詞%F2@-1",*,7099014038299136,25826
カレー  名詞,普通名詞,一般,*,*,*,カレー,カレー-curry,カレー,カレー,カレー,カレー,外,*,*,*,*,*,*,体,カレー,カレー,カレー,カレー,0,C2,*,2018162216411648,7342
の      助詞,格助詞,*,*,*,*,ノ,の,の,ノ,の,ノ,和,*,*,*,*,*,*,格助,ノ,ノ,ノ,ノ,*,名詞%F1,*,7968444268028416,28989
街      名詞,普通名詞,一般,*,*,*,マチ,街,街,マチ,街,マチ,和,*,*,*,*,*,*,体,マチ,マチ,マチ,マチ,2,C3,*,9827718430597632,35753
神保町  名詞,固有名詞,地名,一般,*,*,ジンボウチョウ,ジンボウチョウ,神保町,ジンボーチョー,神保町,ジンボーチョー,固,*,*,*,*,*,*,地名,ジンボウチョウ,ジンボウチョウ,ジンボウチョウ,ジンボウチョウ,
"3,0",*,*,5174035466035712,18823
へ      助詞,格助詞,*,*,*,*,ヘ,へ,へ,エ,へ,エ,和,*,*,*,*,*,*,格助,ヘ,ヘ,ヘ,ヘ,*,名詞%F1,*,9296104558567936,33819
よう    形容詞,非自立可能,*,*,形容詞,連用形-ウ音便,ヨイ,良い,よう,ヨー,よい,ヨイ,和,*,*,*,*,*,*,相,ヨウ,ヨイ,ヨウ,ヨイ,1,C3,*,10716957049496195,38988
こそ    助詞,係助詞,*,*,*,*,コソ,こそ,こそ,コソ,こそ,コソ,和,*,*,*,*,*,*,係助,コソ,コソ,コソ,コソ,*,"形容詞%F2@0,名詞%F2@1,動詞%F2@0",*,3501403402281472,12738
。      補助記号,句点,*,*,*,*,*,。,。,*,。,*,記号,*,*,*,*,*,*,補助,*,*,*,*,*,*,*,6880571302400,25
EOS

________________________________________________________
Executed in   13.96 secs    fish           external
   usr time   13.09 secs    0.00 micros   13.09 secs
   sys time    0.86 secs    0.00 micros    0.86 secs

but it takes around 14 seconds to load the dictionary.

In comparison, mecab is near instant

> time echo "本とカレーの街神保町へようこそ。" | mecab --dicdir="unidic-cwj-202302_full"
本      名詞,普通名詞,一般,*,*,*,ホン,本,本,ホン,本,ホン,漢,ホ濁,基本形,*,*,*,*,体,ホン,ホン,ホン,ホン,1,C3,*,9584176605045248,34867
と      助詞,格助詞,*,*,*,*,ト,と,と,ト,と,ト,和,*,*,*,*,*,*,格助,ト,ト,ト,ト,*,"名詞%F1,動詞%F1,形容詞%F2@-1",*,7099014038299136,25826
カレー  名詞,普通名詞,一般,*,*,*,カレー,カレー-curry,カレー,カレー,カレー,カレー,外,*,*,*,*,*,*,体,カレー,カレー,カレー,カレー,0,C2,*,2018162216411648,7342
の      助詞,格助詞,*,*,*,*,ノ,の,の,ノ,の,ノ,和,*,*,*,*,*,*,格助,ノ,ノ,ノ,ノ,*,名詞%F1,*,7968444268028416,28989
街      名詞,普通名詞,一般,*,*,*,マチ,街,街,マチ,街,マチ,和,*,*,*,*,*,*,体,マチ,マチ,マチ,マチ,2,C3,*,9827718430597632,35753
神保町  名詞,固有名詞,地名,一般,*,*,ジンボウチョウ,ジンボウチョウ,神保町,ジンボーチョー,神保町,ジンボーチョー,固,*,*,*,*,*,*,地名,ジンボウチョウ,ジンボウチョウ,ジンボウチョウ,ジンボウチョウ,
"3,0",*,*,5174035466035712,18823
へ      助詞,格助詞,*,*,*,*,ヘ,へ,へ,エ,へ,エ,和,*,*,*,*,*,*,格助,ヘ,ヘ,ヘ,ヘ,*,名詞%F1,*,9296104558567936,33819
よう    形容詞,非自立可能,*,*,形容詞,連用形-ウ音便,ヨイ,良い,よう,ヨー,よい,ヨイ,和,*,*,*,*,*,*,相,ヨウ,ヨイ,ヨウ,ヨイ,1,C3,*,10716957049496195,38988
こそ    助詞,係助詞,*,*,*,*,コソ,こそ,こそ,コソ,こそ,コソ,和,*,*,*,*,*,*,係助,コソ,コソ,コソ,コソ,*,"形容詞%F2@0,名詞%F2@1,動詞%F2@0",*,3501403402281472,12738
。      補助記号,句点,*,*,*,*,*,。,。,*,。,*,記号,*,*,*,*,*,*,補助,*,*,*,*,*,*,*,6880571302400,25
EOS

________________________________________________________
Executed in   28.32 millis    fish           external
   usr time    0.00 millis    0.00 micros    0.00 millis
   sys time   31.25 millis    0.00 micros   31.25 millis

I looked at the code and it seems like all the time is taken from deserializing bincode into the DictionaryInner struct. In particular, when it runs the read_common function

    fn read_common<R>(mut rdr: R) -> Result<DictionaryInner>
    where
        R: Read,
    {
        let mut magic = [0; MODEL_MAGIC.len()];
        rdr.read_exact(&mut magic)?;
        if magic != MODEL_MAGIC {
            return Err(VibratoError::invalid_argument(
                "rdr",
                "The magic number of the input model mismatches.",
            ));
        }
        let config = common::bincode_config();
        let data = bincode::decode_from_std_read(&mut rdr, config)?;
        Ok(data)
    }

It takes a long time to complete let data = bincode::decode_from_std_read(&mut rdr, config)?; so it seems like bincode deserialization is slow.

How is mecab able to return results so quickly despite not loading everything in memory like vibrato? It seems like it doesn't take any memory when I use mecab, whereas vibrato takes 1 GB memory to cache everything before being able to tokenize

Describe the solution you'd like
Could we use a faster serde framework like rkyv ? According to the benchmarks, it's a lot faster than bincode

According to the rkyv docs, it says

It’s similar to other zero-copy deserialization frameworks such as Cap’n Proto and FlatBuffers. However, while the former have external schemas and heavily restricted data types, rkyv allows all serialized types to be defined in code and can serialize a wide variety of types that the others cannot. Additionally, rkyv is designed to have little to no overhead, and in most cases will perform exactly the same as native types.

Not sure if there's any other way to speed it up? Could we somehow parallelize deserialization?

Describe alternatives you've considered
Apparently bincode is slow for structs that use Vec and byte slices, and the recommendation is to use serde_bytes

The features such as

pub struct UnkEntry {
    pub cate_id: u16,
    pub left_id: u16,
    pub right_id: u16,
    pub word_cost: i16,
    pub feature: String,
}
pub struct WordFeatures {
    features: Vec<String>,
}

are stored as strings, maybe they can be stored as Vec<u8> instead?

Additional context
I'm using vibrato version 0.5.1

And here are the compiled dictionary sizes

> du -sh system.dic.zst
291M    system.dic.zst

> du -sh system.dic
988M    system.dic

The difference between Vibrato and Vaporetto regarding the performance for a general tokenizing

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

I'd like to know the difference between Vibrato and Vaporetto regarding the performance for a general tokenizing as follows.
https://github.com/daac-tools/vibrato#fast-tokenization

Describe the solution you'd like
A clear and concise description of what you want to happen.

Could you add Vaporetto to the following image?
https://github.com/daac-tools/vibrato#fast-tokenization

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Nothing

Additional context
Add any other context or screenshots about the feature request here.

Nothing

Add N best results

Is your feature request related to a problem? Please describe.
Mecab has the flag -N which provides the N best results

mecab --help
...
 -N, --nbest=INT                output N best results (default 1)

However, I couldn't find docs or in the source code how to do this with vibrato

Describe the solution you'd like
Allow support for providing the N best results

Describe alternatives you've considered
N/A

Additional context
Vibrato 0.5.1

Take `BufRead` as arguments

First, see #31 (comment).

Here, I consider the two options:

  • Keep Read and removes the Notes section.
  • Fix csv crate to take the BufRead argument and use BufRead through all arguments.

Distribute compiled dictionaries from JumanDic

In v0.3.1, compiled dictionaries from JumanDic have not been distributed because the lexicon file is in an unexpected CSV format.
More precisely, we will get the following error message from the compile command.

Error: InvalidFormat(InvalidFormatError { arg: "lex.csv", msg: "A csv row of lexicon must have five items at least, \"\\n\"" })

We need to modify the code to compile this file.

65535を超える文字で分かち書きできない。

バグの説明
65535を超える文字で分かち書きできません。

再現手順
(誰もが同じ状況を再現できるように、 git clone コマンドを含め、すべてのコマンドを記述してください。)
以下のコードを実行しました:
textが65535を超える文字数の場合、100%再現します。

let mut worker = tokenizer.new_worker();
worker.reset_sentence(&text);
worker.tokenize();

以下のメッセージが出力されました:
thread 'main' panicked at 'assertion failed: input.len() <= 0xFFFF', /Users/saitoukosuke/.cargo/registry/src/github.com-1ecc6299db9ec823/vibrato-0.4.0/src/dictionary/lexicon/map/trie.rs:53:9

期待する結果
(期待した出力を明確かつ簡潔に記述してください。)
65535を超える文字数でも分かち書きできるようにしたいです。
https://github.com/daac-tools/vibrato/blob/main/vibrato/src/dictionary/lexicon/map/trie.rs#L53
を外せばうまくいくと思われますが、制限を付けている理由があるようでしたら、このままの仕様で構いません。

あなたの実行環境

  • OS: [e.g. Ubuntu 22.04] mac big sur
  • Rust: [e.g. 1.64.0] 1.67.1

Add compile option for user dictionary

In v0.3.1, user dictionaries can be given in a text CSV format. However, if the dictionary file is huge, the loading time can be a problem. This issue suggests adding a feature to compile user dictionaries into a binary format. This feature will help install Neologd entries.

Keep member functions of a struct in the same file

Currently, member functions of a struct are sometimes placed in different files.
e.g., Dictionary is placed at dictionary.rs, but Dictionary::from_readers is placed at dictionary/builder.rs.

This makes finding functions difficult in development.
However, simply moving the implementation of Dictionary::from_readers in dictionary.rs enlarges the file.

Solution

Defining a new structure Dictionary::Builder in dictionary/builder.rs to always keep member functions of a struct in the same file.

Keep an easy description in the top-level readme

This issue proposes to maintain command line tools related to training in one workspace and put the description of training under the workspace (like prepare) for keeping an easy description in the top-level readme.

Why

For many light users, the primary concern would be how to get started on tokenization using Vibrato. So, I think the top-level readme should keep an easy description.

The description of 3. Training in Section Basic usage is for advanced users because to understand this, it is necessary to know the configuration of dictionary files in MeCab.

Also, maintaining multiple workspaces increases the costs of updating dependencies in Cargo.toml.

Solution

We keep the command line tools train, dictgen, evaluate, and split in one workspace (such as named train) and write the usage in the readme under this workspace.

Benefits

  • Descriptions related to training are maintained in a single document, allowing for leading only advanced users there.
  • Fewer workspaces can be maintained, resulting in avoiding redundant update costs of Cargo.toml.

Using a memmap for the dictionary

Is your feature request related to a problem? Please describe.

It's harder to support lower-end hardware (with limited memory) particularly with bigger dictionaries.

Describe the solution you'd like

I would like the option to be able to use a memory map instead to refer to an uncompressed dictionary since storage is usually cheaper than memory. The application I need does not need extreme performance so I feel like the IO penalty would be acceptable. If the dictionary gets processed by Vibrato into something else, it would also be nice to be able serialize it to a file and memmap it as well. fst offers something like that: https://docs.rs/fst/latest/fst/#example-stream-to-a-file-and-memory-map-it-for-searching

Describe alternatives you've considered

None that I'm aware of.

Additional context

None

Remove lifetime parameter from `Worker`

Currently, the Worker struct is defined as follows:

struct Worker<'a> {
    ...
}

where 'a is a lifetime parameter of the Tokenizer. By this definition, the Worker can refer to the Tokenizer automatically for every tokenization.

This definition causes a problem when creating wrappers for other programming languages that use garbage collectors (GC).
The above definition means that the Tokenizer cannot be removed when the Worker is alive, but there is no way to impose this constraint on the GC.

To solve this problem, we need to remove the lifetime parameter and give the Worker struct to the Tokenizer for every tokenization.

Embed versions into models

Is your feature request related to a problem? Please describe.

Current models do not have their versions. There is a risk that a model in a different version will be loaded incorrectly.

Describe the solution you'd like

Embedding the version number at the head of models and verifying it.

Describe alternatives you've considered

NA

Additional context

NA

Update the benchmark results

In v0.3.1, the benchmark result in README is old: Vibrato has been updated, other dictionaries have been distributed, and lindera has been accelerated.

Support text longer than 65535 characters

Currently, Vibrato does not support texts longer than 65535 characters.
The limit is specified here:

pub const MAX_SENTENCE_LENGTH: u16 = 0xFFFF;

This limit is long enough to parse a typical sentence but should be changed to a larger value, such as 2^32-1, to increase robustness in actual operations.

Distribute precompiled dictionaries instead scripts download external resources

Issue

Currently, Vibrato provides scripts to download and compile external resources. However, those scripts are dangerous to users because they may download large amounts of unintended data.

Solution

Distributing precompiled dictionaries through Assets.

Then, the main concern is licensing. KFTT used in the current version follows CC BY-SA 3.0 competing that of IPADIC or UniDic.

One of solutions would be using unlicensed texts such as Aozora Bunko.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.