Git Product home page Git Product logo

budoux's Introduction

BudouX

PyPI npm Maven Central

Standalone. Small. Language-neutral.

BudouX is the successor to Budou, the machine learning powered line break organizer tool.

Example

It is standalone. It works with no dependency on third-party word segmenters such as Google cloud natural language API.

It is small. It takes only around 15 KB including its machine learning model. It's reasonable to use it even on the client-side.

It is language-neutral. You can train a model for any language by feeding a dataset to BudouX’s training script.

Last but not least, BudouX supports HTML inputs.

Demo

https://google.github.io/budoux

Natural languages supported by default

  • Japanese
  • Simplified Chinese
  • Traditional Chinese
  • Thai

Supported Programming languages

Python module

Install

$ pip install budoux

Usage

Library

You can get a list of phrases by feeding a sentence to the parser. The easiest way is to get a parser is loading the default parser for each language.

Japanese:

import budoux
parser = budoux.load_default_japanese_parser()
print(parser.parse('今日は天気です。'))
# ['今日は', '天気です。']

Simplified Chinese:

import budoux
parser = budoux.load_default_simplified_chinese_parser()
print(parser.parse('今天是晴天。'))
# ['今天', '是', '晴天。']

Traditional Chinese:

import budoux
parser = budoux.load_default_traditional_chinese_parser()
print(parser.parse('今天是晴天。'))
# ['今天', '是', '晴天。']

Thai:

import budoux
parser = budoux.load_default_thai_parser()
print(parser.parse('วันนี้อากาศดี'))
# ['วัน', 'นี้', 'อากาศ', 'ดี']

You can also translate an HTML string to wrap phrases with non-breaking markup. The default parser uses zero-width space (U+200B) to separate phrases.

print(parser.translate_html_string('今日は<b>とても天気</b>です。'))
# <span style="word-break: keep-all; overflow-wrap: anywhere;">今日は<b>\u200bとても\u200b天気</b>です。</span>

Please note that separators are denoted as \u200b in the example above for illustrative purposes, but the actual output is an invisible string as it's a zero-width space.

If you have a custom model, you can use it as follows.

with open('/path/to/your/model.json') as f:
  model = json.load(f)
parser = budoux.Parser(model)

A model file for BudouX is a JSON file that contains pairs of a feature and its score extracted by machine learning training. Each score represents the significance of the feature in determining whether to break the sentence at a specific point.

For more details of the JavaScript model, please refer to JavaScript module README.

CLI

You can also format inputs on your terminal with budoux command.

$ budoux 本日は晴天です。 # default: japanese
本日は
晴天です。

$ budoux -l ja 本日は晴天です。
本日は
晴天です。

$ budoux -l zh-hans 今天天气晴朗。
今天
天气
晴朗。

$ budoux -l zh-hant 今天天氣晴朗。
今天
天氣
晴朗。

$ budoux -l th วันนี้อากาศดี
วัน
นี้
อากาศ
ดี
$ echo $'本日は晴天です。\n明日は曇りでしょう。' | budoux
本日は
晴天です。
---
明日は
曇りでしょう。
$ budoux 本日は晴天です。 -H
<span style="word-break: keep-all; overflow-wrap: anywhere;">本日は\u200b晴天です。</span>

Please note that separators are denoted as \u200b in the example above for illustrative purposes, but the actual output is an invisible string as it's a zero-width space.

If you want to see help, run budoux -h.

$ budoux -h
usage: budoux [-h] [-H] [-m JSON | -l LANG] [-d STR] [-V] [TXT]

BudouX is the successor to Budou,
the machine learning powered line break organizer tool.

positional arguments:
  TXT                    text (default: None)

optional arguments:
  -h, --help             show this help message and exit
  -H, --html             HTML mode (default: False)
  -m JSON, --model JSON  custom model file path (default: /path/to/budoux/models/ja.json)
  -l LANG, --lang LANG   language of custom model (default: None)
  -d STR, --delim STR    output delimiter in TEXT mode (default: ---)
  -V, --version          show program's version number and exit

supported languages of `-l`, `--lang`:
- ja
- zh-hans
- zh-hant
- th

Caveat

BudouX supports HTML inputs and outputs HTML strings with markup that wraps phrases, but it's not meant to be used as an HTML sanitizer. BudouX doesn't sanitize any inputs. Malicious HTML inputs yield malicious HTML outputs. Please use it with an appropriate sanitizer library if you don't trust the input.

Background

English text has many clues, like spacing and hyphenation, that enable beautiful and readable line breaks. However, some CJK languages lack these clues, and so are notoriously more difficult to process. Line breaks can occur randomly and usually in the middle of a word or a phrase without a more careful approach. This is a long-standing issue in typography on the Web, which results in a degradation of readability.

Budou was proposed as a solution to this problem in 2016. It automatically translates CJK sentences into HTML with lexical phrases wrapped in non-breaking markup, so as to semantically control line breaks. Budou has solved this problem to some extent, but it still has some problems integrating with modern web production workflow.

The biggest barrier in applying Budou to a website is that it has dependency on third-party word segmenters. Usually a word segmenter is a large program that is infeasible to download for every web page request. It would also be an undesirable option making a request to a cloud-based word segmentation service for every sentence, considering the speed and cost. That’s why we need a standalone line break organizer tool equipped with its own segmentation engine small enough to be bundled in a client-side JavaScript code.

BudouX is the successor to Budou, designed to be integrated with your website with no hassle.

How it works

BudouX uses the AdaBoost algorithm to segment a sentence into phrases by considering the task as a binary classification problem to predict whether to break or not between all characters. It uses features such as the characters around the break point, their Unicode blocks, and combinations of them to make a prediction. The output machine learning model, which is encoded as a JSON file, stores pairs of the feature and its significance score. The BudouX parser takes a model file to construct a segmenter and translates input sentences into a list of phrases.

Building a custom model

You can build your own custom model for any language by preparing training data in the target language. A training dataset is a large text file that consists of sentences separated by phrases with the separator symbol "▁" (U+2581) like below.

私は▁遅刻魔で、▁待ち合わせに▁いつも▁遅刻してしまいます。
メールで▁待ち合わせ▁相手に▁一言、▁「ごめんね」と▁謝れば▁どうにか▁なると▁思っていました。
海外では▁ケータイを▁持っていない。

Assuming the text file is saved as mysource.txt, you can build your own custom model by running the following commands.

$ pip install .[dev]
$ python scripts/encode_data.py mysource.txt -o encoded_data.txt
$ python scripts/train.py encoded_data.txt -o weights.txt
$ python scripts/build_model.py weights.txt -o mymodel.json

Please note that train.py takes time to complete depending on your computer resources. Good news is that the training algorithm is an anytime algorithm, so you can get a weights file even if you interrupt the execution. You can build a valid model file by passing that weights file to build_model.py even in such a case.

Constructing a training dataset from the KNBC corpus for Japanese

The default model for Japanese (budoux/models/ja.json) is built using the KNBC corpus. You can create a training dataset, which we name source_knbc.txt below for example, from the corpus by running the following commands:

$ curl -o knbc.tar.bz2 https://nlp.ist.i.kyoto-u.ac.jp/kuntt/KNBC_v1.0_090925_utf8.tar.bz2
$ tar -xf knbc.tar.bz2  # outputs KNBC_v1.0_090925_utf8 directory
$ python scripts/prepare_knbc.py KNBC_v1.0_090925_utf8 -o source_knbc.txt

Author

Shuhei Iitsuka

Disclaimer

This is not an officially supported Google product.

budoux's People

Contributors

amitmarkel avatar dependabot[bot] avatar eggplants avatar harukaichii avatar hiro0218 avatar junseinagao avatar kojiishi avatar ryu22e avatar sassy avatar step-security-bot avatar tamanyan avatar tushuhei avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

budoux's Issues

[quality] あなたの意図したとおりに情報を伝えることができます。

Input: あなたの意図したとおりに情報を伝えることができます。
Actual:
あなたの/意図したと/おりに/情報を/伝える/ことができます。
Expected:
あなたの/意図したとおりに/情報を/伝える/ことができます。
あなたの/意図した/とおりに/情報を/伝える/ことが/できます。

Found in Blink web_tests.

[quality] まとめる

Input: 要点をまとめる必要がある。
Expected: 要点を/まとめる/必要が/ある。
Actual: 要点を/まと/める/必要が/ある。

Unopened HTML tag causes exception in budoux 0.6

With budoux 0.6.0

budoux --html "foo</p>"

Traceback (most recent call last):
  File "/home/johnc/.local/bin/budoux", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/johnc/.local/pipx/venvs/budoux/lib/python3.11/site-packages/budoux/main.py", line 187, in main
    print(_main(test))
          ^^^^^^^^^^^
  File "/home/johnc/.local/pipx/venvs/budoux/lib/python3.11/site-packages/budoux/main.py", line 171, in _main
    res = parser.translate_html_string(inputs_html)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/johnc/.local/pipx/venvs/budoux/lib/python3.11/site-packages/budoux/parser.py", line 102, in translate_html_string
    return resolve(chunks, html)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/johnc/.local/pipx/venvs/budoux/lib/python3.11/site-packages/budoux/html_processor.py", line 124, in resolve
    resolver.feed(html)
  File "/home/johnc/.pyenv/versions/3.11.3/lib/python3.11/html/parser.py", line 110, in feed
    self.goahead(0)
  File "/home/johnc/.pyenv/versions/3.11.3/lib/python3.11/html/parser.py", line 172, in goahead
    k = self.parse_endtag(i)
        ^^^^^^^^^^^^^^^^^^^^
  File "/home/johnc/.pyenv/versions/3.11.3/lib/python3.11/html/parser.py", line 413, in parse_endtag
    self.handle_endtag(elem)
  File "/home/johnc/.local/pipx/venvs/budoux/lib/python3.11/site-packages/budoux/html_processor.py", line 84, in handle_endtag
    self.to_skip = self.element_stack.get_nowait()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/johnc/.pyenv/versions/3.11.3/lib/python3.11/queue.py", line 199, in get_nowait
    return self.get(block=False)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/johnc/.pyenv/versions/3.11.3/lib/python3.11/queue.py", line 168, in get
    raise Empty
_queue.Empty

[java] `HTMLProcessor.getText()` collapses whitespaces

HTMLProcessor.getText() calls:

    return Jsoup.parseBodyFragment(html).text();

It looks like this collapses whitespaces. Example:

    String html = " H      e ";
    String result = HTMLProcessor.getText(html);

The result becomes H e, collapsing consecutive spaces into one space, and leading and trailing spaces to none.

/cc @tushuhei

Numbers become <a href="tel:">links even if <meta name="format-detection" content="telephone=no"> is set on iOS

By setting <meta name="format-detection" content="telephone=no">, numbers such as phone numbers become text on iOS.

However, the "telephone=no" setting does not work for text in <budoux>, and numbers such as phone numbers become <a href="tel:">.

Is this an iOS issue or a budoux issue?

Sample:
https://codepen.io/rhsk/full/RwEVLBa

<h2>normal</h2>
<p>090-1234-5678</p>
<p>テキストテキストテキストテキストテキスト 090-1234-5678 テキストテキストテキストテキスト</p>

<h2>budoux</h2>
<p><budoux-ja>090-1234-5678</budoux-ja></p>
<p><budoux-ja>テキストテキストテキストテキストテキスト 090-1234-5678 テキストテキストテキストテキスト</budoux-ja></p>
image

Use overflow-wrap: anywhere; instead of overflow-wrap: break-word;

Motivation

If display: flex; is applied to the parent element, the overflow-wrap: break-word; becomes ineffective, resulting in text overflow.

overflow-wrap: anywhere; resolves the issue. It's supported on modern web browsers.

https://developer.mozilla.org/en-US/docs/Web/CSS/overflow-wrap

Screen Shot 2023-05-26 at 14 57 39

Key points

The primary difference between overflow-wrap: anywhere; and overflow-wrap: break-word is reflected in the way soft wrap opportunities introduced by word-break are handled when calculating min-content intrinsic sizes.

https://developer.mozilla.org/en-US/docs/Web/CSS/overflow-wrap

anywhere
To prevent overflow, an otherwise unbreakable string of characters — like a long word or URL — may be broken at any point if there are no otherwise-acceptable break points in the line. No hyphenation character is inserted at the break point. Soft wrap opportunities introduced by the word break are considered when calculating min-content intrinsic sizes.

break-word
The same as the anywhere value, with normally unbreakable words allowed to be broken at arbitrary points if there are no otherwise acceptable break points in the line, but soft wrap opportunities introduced by the word break are NOT considered when calculating min-content intrinsic sizes.

Steps to reproduce

Go to the example https://codepen.io/tamanyan/pen/KKyyxMj

Screen Shot 2023-05-26 at 14 51 52
<div class="flex">
  <div>
    <h3>5. word-break: keep-all; + overflow-wrap: break-word; + &lt;wbr&gt;  + flex</h3>
    <p class="keep-all-break-word box">
      グレートブリテン<wbr>および<wbr>北アイルランド連合王国という<wbr>言葉は<wbr>本当に<wbr>長い言葉<wbr>ですね
    </p>
  </div>
</div>
.flex {
  display: flex;
}

.keep-all-break-word {
  word-break: keep-all;
  overflow-wrap: break-word;
}

.box {
  padding: 10px;
  border: 1px solid;
  font-size: 40px;
}

Proposal

Use overflow-wrap: anywhere; instead of overflow-wrap: break-word;

Issue with custom model

Description

Hi there,
First thanks for the lib, it's impressive the results from such a small footprint😄

The results were not exactly what I wanted for japanese tokenization, so I decided to train my own model and it was quite simple and straightforward. Sadly after importing the generated model in javascript it doesn't work.

import { Parser, loadDefaultJapaneseParser } from 'budoux'
import model from './mymodel.json'

// obviously the following works
const parser = loadDefaultJapaneseParser()
console.log(parser.parse('今日は天気です。'))

// but this doesn't
const parser = new Parser(model)
console.log(parser.parse('今日は天気です。'))

Uncaught TypeError: this.model.values is not a function or its return value is not iterable
at Parser.parse (parser.js:120:47)

Discussion: does it have any negative impact for SEO?

I have a blog site especially focused on interviews and budoux is a great tool for its readablitiy, but I have a question whether or not it has any negative impacts on SEO when appying it on each article bodies like this:

スクリーンショット 2022-06-09 14 49 46

because the text is divided into many parts by span tags. Maybe this is a bit out of scope for budoux development, but users might face this issue when using it so I posted it here. I would like to hear your opinions.

Thanks.

[Java] Java version emits close tag for self-closing tags

Input: <img>abcdef
Expected: <img>abcdef
Actual: <img></img>abcdef

Unlike Python HTMLParser, Java version uses Jsoup.parseBodyFragment, which supports HTML parsing algorithm, so we don't have to worry about issues like #355.

But the serialization code should be aware of self-closing tags.

[quality] あのイーハトーヴォのすきとおった風、夏でも底に冷たさをもつ青いそら、うつくしい森で飾られたモリーオ市、郊外のぎらぎらひかる草の波。

Input:
あのイーハトーヴォのすきとおった風、夏でも底に冷たさをもつ青いそら、うつくしい森で飾られたモリーオ市、郊外のぎらぎらひかる草の波。
Actual:
あの/イーハトーヴォの/すきと/おった/風、/夏でも/底に/冷たさを/もつ青い/そら、/うつくしい/森で/飾られた/モリーオ市、/郊外の/ぎらぎら/ひかる/草の/波。
Expected:
あの/イーハトーヴォの/すきとおった/風、/夏でも/底に/冷たさを/もつ/青い/そら、/うつくしい/森で/飾られた/モリーオ市、/郊外の/ぎらぎら/ひかる/草の/波。

Deduplicate separators when processing HTML

Demo
https://codepen.io/tushuhei/pen/VwqMywj

Setup

<p>xyz<wbr>abcabc</p>
const parser = new HTMLProcessingParser({
  UW4: {a: 1001}, // means "should separate right before 'a'".
});
const paragraph = document.querySelector('p');
parser.applyElement(paragraph);
console.log(paragraph.innerHTML);

Expected

<p>xyz<wbr>abc<wbr>abc</p>

Actual

<p>xyz<wbr><wbr>abc<wbr>abc</p>

We may want to remove duplicated separators in case we need to apply BudouX to the same element multiple times (e.g. Web Components that reuse their Light DOM).

@kojiishi Could you take a look what changes should be applied to html_processor.ts?

`mypy` error does not stop GitHub Actions

As shown in https://github.com/google/budoux/runs/6555048469, our "Style Check" GitHub Action won't be interrupted even if mypy finds an error. Any mypy error should pause the process to flag contributors about its type error.

Run sasanquaneuf/mypy-github-action@a0c442aa252655d7736ce6696e06227ccdd62870
/opt/hostedtoolcache/Python/3.9.12/x64/bin/mypy .
build/lib/budoux/__init__.py: error: Duplicate module named "budoux" (also at
"./budoux/__init__.py")
build/lib/budoux/__init__.py: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#mapping-file-paths-to-modules for more info
build/lib/budoux/__init__.py: note: Common resolutions include: a) using `--exclude` to avoid checking one of them, b) adding `__init__.py` somewhere, c) using `--explicit-package-bases` or adjusting MYPYPATH
Found 1 error in 1 file (errors prevented further checking)

`javascript/data/models/zh-hans` missing

npm test

results in:

> [email protected] test
> ts-node node_modules/jasmine/bin/jasmine tests/*.ts

src/parser.ts:20:36 - error TS2307: Cannot find module './data/models/zh-hans' or its corresponding type declarations.

20 import {model as zhHansModel} from './data/models/zh-hans';
                                      ~~~~~~~~~~~~~~~~~~~~~~~

[quality] "のみ"

input: ここから先はチケットを購入されたお客様のみ入ることができます。
actual: ここから/先は/チケットを/購入された/お客様の/み入る/ことができます。
expected: ここから/先は/チケットを/購入された/お客様のみ/入ることが/できます。

input: 基本ギアパワーのみ有効だ
actual: 基本ギアパワーの/み有効だ
expected: 基本ギアパワーのみ/有効だ

[quality] いよいよ

Input: いよいよはじまる
Expected: いよいよ/はじまる
Actual: いよいよは/じまる

[quality] お問い合わせ​

input: お気軽にお問い合わせください。
actual: お気軽に​お問い​/合わせください。
expected: お気軽に​/​お問い合わせ​/ください。

禁則処理

I've noticed a few examples where "禁則処理" is not working properly in BudouX, although they are rare. I suspect this might be due to it not being included in the training data and so on.

I used the latest main branch https://github.com/google/budoux/tree/cb21dadb92fce3ee21157457539e061d9f04d99a

Input:
Adobe Illustratorとは?デザイン・レイアウトの決定版
Actual:
Adobe Illustratorとは/?デザイン・レイアウトの/決定版
Expected:
Adobe Illustratorとは?/デザイン・レイアウトの/決定版
Input:
『バクマン。』に影響を受け、漫画家を目指す
Actual:
『バクマン。/』に/影響を/受け、/漫画家を/目指す
Expected:
『バクマン。』に/影響を/受け、/漫画家を/目指す
Input:
[動画で見る]魚眼レンズで撮影したような動画を作成する方法
Actual:
[動画で/見る/]魚眼レンズで/撮影したような/動画を/作成する/方法
Expected:
[動画で/見る]/魚眼レンズで/撮影したような/動画を/作成する/方法
Input:
電子サインの法的な効力は?
Actual:
電子サインの/法的な/効力は/?
Expected:
電子サインの/法的な/効力は?

Consider to use DocumentFragment

Have you considered directly building a DomFragment instead of returning strings from your parser ?
This would allow you to avoid unnecessary serialization/parsing/sanitization steps.

Something like this:

// in dom.ts
function parseFromString(html: string): Node {
  const template = document.createElement('template');
  template.innerHTML = html;
  return template.content;
}

// parser.ts
function translateHTMLString(html: string): Node {
  if (html === '') return new DocumentFragment();
  const fragment = parseFromString(html);
  if (Parser.hasChildTextNode(fragment)) {
    const wrapper = document.createElement('span');
    wrapper.append(...fragment.childNodes);
    fragment.appendChild(wrapper);
  }
  this.applyElement(fragment.firstChild as HTMLElement);
  return fragment;
}

// in budoux-base.ts
sync() {
  const translated = this.parser.translateHTMLString(this.innerHTML);
  this.shadow.textContent = '';
  this.shadow.appendChild(translated);
}

You can even avoid having to parse anything at all if you clone the existing nodes instead of grabbing this.innerHTML.

// in budoux-base.ts
sync() {
  let translated: HTMLElement;
  if (Parser.hasChildTextNode(this)) {
    translated = document.createElement('span');
    translated.append(...this.childNodes.map(node => node.cloneNode(true)));
  } else {
    translated = this.firstElementChild!.cloneNode(true) as HTMLElement;
  }

  this.parser.applyElement(translated);

  this.shadow.textContent = '';
  this.shadow.appendChild(translated);
}

This is also likely to be more performant than parsing and serializing the tree multiple times.

Originally posted by @engelsdamien in google/safevalues#256 (comment)

[configuration] Usage on browser's web worker

image
The module is not useable in web worker without significant patching. Imo, this can be solved either:

  1. Let user dynamically insert the needed window.
  2. Fallback to jsdom if window is undefined.
  3. Make the Parser class more independent from HTML/DOM since i think the main functionality is still the "parse" function. I don't how feasible it is and can be wrong here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.