Git Product home page Git Product logo

budou's Introduction

Budou 🍇

Budou is in maintenance mode. The development team is focusing on developing its successor, BudouX

English text has many clues, like spacing and hyphenation, that enable beautiful and legible line breaks. Some CJK languages lack these clues, and so are notoriously more difficult to process. Without a more careful approach, breaks can occur randomly and usually in the middle of a word. This is a long-standing issue with typography on the web and results in a degradation of readability.

Budou automatically translates CJK sentences into HTML with lexical chunks wrapped in non-breaking markup, so as to semantically control line breaks. Budou uses word segmenters to analyze input sentences. It can also concatenate proper nouns to produce meaningful chunks utilizing part-of-speech (pos) tagging and other syntactic information. Processed chunks are wrapped with the SPAN tag. These semantic units will no longer be split at the end of a line if given a CSS display property set to inline-block.

Installation

The package is listed in the Python Package Index (PyPI), so you can install it with pip:

$ pip install budou

Output

Budou outputs an HTML snippet wrapping chunks with span tags:

<span><span class="ww">常に</span><span class="ww">最新、</span>
<span class="ww">最高の</span><span class="ww">モバイル。</span></span>

Semantic chunks in the output HTML will not be split at the end of line by configuring each span tag with display: inline-block in CSS.

.ww {
  display: inline-block;
}

By using the output HTML from Budou and the CSS above, sentences on your webpage will be rendered with legible line breaks:

https://raw.githubusercontent.com/wiki/google/budou/images/nexus_example.jpeg

Using as a command-line app

You can process your text by running the budou command:

$ budou 渋谷のカレーを食べに行く。

The output is:

<span><span class="ww">渋谷の</span><span class="ww">カレーを</span>
<span class="ww">食べに</span><span class="ww">行く。</span></span>

You can also configure the command with optional parameters. For example, you can change the backend segmenter to MeCab and change the class name to wordwrap by running:

$ budou 渋谷のカレーを食べに行く。 --segmenter=mecab --classname=wordwrap

The output is:

<span><span class="wordwrap">渋谷の</span><span class="wordwrap">カレーを</span>
<span class="wordwrap">食べに</span><span class="wordwrap">行く。</span></span>

Run the help command budou -h to see other available options.

Using programmatically

You can use the budou.parse method in your Python scripts.

import budou
results = budou.parse('渋谷のカレーを食べに行く。')
print(results['html_code'])
# <span><span class="ww">渋谷の</span><span class="ww">カレーを</span>
# <span class="ww">食べに</span><span class="ww">行く。</span></span>

You can also make a parser instance to reuse the segmenter backend with the same configuration. If you want to integrate Budou into your web development framework in the form of a custom filter or build process, this would be the way to go.

import budou
parser = budou.get_parser('mecab')
results = parser.parse('渋谷のカレーを食べに行く。')
print(results['html_code'])
# <span><span class="ww">渋谷の</span><span class="ww">カレーを</span>
# <span class="ww">食べに</span><span class="ww">行く。</span></span>

for chunk in results['chunks']:
  print(chunk.word)
# 渋谷の 名詞
# カレーを 名詞
# 食べに 動詞
# 行く。 動詞

(deprecated) authenticate method

authenticate, which had been the method used to create a parser in previous releases, is now deprecated. The authenticate method is now a wrapper around the get_parser method that returns a parser with the Google Cloud Natural Language API segmenter backend. The method is still available, but it may be removed in a future release.

import budou
parser = budou.authenticate('/path/to/credentials.json')

# This is equivalent to:
parser = budou.get_parser(
    'nlapi', credentials_path='/path/to/credentials.json')

Available segmenter backends

You can choose different segmenter backends depending on the needs of your environment. Currently, the segmenters below are supported.

Name Identifier Supported Languages
Google Cloud Natural Language API nlapi Chinese, Japanese, Korean
MeCab mecab Japanese
TinySegmenter tinysegmenter Japanese

Specify the segmenter when you run the budou command or load a parser. For example, you can run the budou command with the MeCab segmenter by passing the --segmenter=mecab parameter:

$ budou 今日も元気です --segmenter=mecab

You can pass segmenter parameter when you load a parser:

import budou
parser = budou.get_parser('mecab')
parser.parse('今日も元気です')

If no segmenter is specified, the Google Cloud Natural Language API is used as the default.

Google Cloud Natural Language API Segmenter

The Google Cloud Natural Language API (https://cloud.google.com/natural-language/) (NL API) analyzes input sentences using machine learning technology. The API can extract not only syntax but also entities included in the sentence, which can be used for better quality segmentation (see more at Entity mode). Since this is a simple REST API, you don't need to maintain a dictionary. You can also support multiple languages using one single source.

Supported languages

  • Simplified Chinese (zh)
  • Traditional Chinese (zh-Hant)
  • Japanese (ja)
  • Korean (ko)

For those considering using Budou for Korean sentences, please refer to the Korean support section.

Authentication

The NL API requires authentication before use. First, create a Google Cloud Platform project and enable the Cloud Natural Language API. Billing also needs to be enabled for the project. Then, download a credentials file for a service account by accessing the Google Cloud Console and navigating through "API & Services" > "Credentials" > "Create credentials" > "Service account key" > "JSON".

Budou will handle authentication once the path to the credentials file is set in the GOOGLE_APPLICATION_CREDENTIALS environment variable.

$ export GOOGLE_APPLICATION_CREDENTIALS='/path/to/credentials.json'

You can also pass the path to the credentials file when you initialize the parser.

parser = budou.get_parser(
    'nlapi', credentials_path='/path/to/credentials.json')

The NL API segmenter uses Syntax Analysis and incurs costs according to monthly usage. The NL API has free quota to start testing the feature without charge. Please refer to https://cloud.google.com/natural-language/pricing for more detailed pricing information.

Caching system

Parsers using the NL API segmenter cache responses from the API in order to prevent unnecessary requests to the API and to make processing faster. If you want to force-refresh the cache, set use_cache to False.

parser = budou.get_parser(segmenter='nlapi', use_cache=False)
result = parser.parse('明日は晴れるかな')

In the Google App Engine Python 2.7 Standard Environment, Budou tries to use the memcache service to cache output efficiently across instances. In other environments, Budou creates a cache file in the python pickle format in your file system.

Entity mode

The default parser only uses results from Syntactic Analysis for parsing, but you can also utilize results from Entity Analysis by specifying use_entity=True. Entity Analysis will improve the accuracy of parsing for some phrases, especially proper nouns, so it is recommended if your target sentences include names of individual people, places, organizations, and so on.

Please note that Entity Analysis will result in additional pricing because it requires additional requests to the NL API. For more details about API pricing, please refer to https://cloud.google.com/natural-language/pricing.

import budou
# Without Entity mode (default)
result = budou.parse('六本木ヒルズでご飯を食べます。', use_entity=False)
print(result['html_code'])
# <span class="ww">六本木</span><span class="ww">ヒルズで</span>
# <span class="ww">ご飯を</span><span class="ww">食べます。</span>

# With Entity mode
result = budou.parse('六本木ヒルズでご飯を食べます。', use_entity=True)
print(result['html_code'])
# <span class="ww">六本木ヒルズで</span>
# <span class="ww">ご飯を</span><span class="ww">食べます。</span>

MeCab Segmenter

MeCab (https://github.com/taku910/mecab) is an open source text segmentation library for the Japanese language. Unlike the Google Cloud Natural Language API segmenter, the MeCab segmenter does not require any billed API calls, so you can process sentences for free and without an internet connection. You can also customize the dictionary by building your own.

Supported languages

  • Japanese

Installation

You need to have MeCab installed to use the MeCab segmenter in Budou. You can install MeCab with an IPA dictionary by running

$ make install-mecab

in the project's home directory after cloning this repository.

TinySegmenter-based Segmenter

TinySegmenter (http://chasen.org/~taku/software/TinySegmenter/) is a compact Japanese tokenizer originally created by (c) 2008 Taku Kudo. It tokenizes sentences by matching against a combination of patterns carefully designed using machine learning. This means that you can use this backend without any additional setup!

Supported languages

  • Japanese

Korean support

Korean has spaces between chunks, so you can perform line breaking simply by putting word-break: keep-all in your CSS. We recommend that you use this technique instead of using Budou.

Use cases

Budou is designed to be used mostly in eye-catching sentences such as titles and headings on the assumption that split chunks would stand out negatively at larger font sizes.

Accessibility

Some screen reader software packages read Budou's wrapped chunks one by one. This may degrade the user experience for those who need audio support. You can attach any attribute to the output chunks to enhance accessibility. For example, you can make screen readers read undivided sentences by combining the aria-describedby and aria-label attributes in the output.

<p id="description" aria-label="やりたいことのそばにいる">
  <span class="ww" aria-describedby="description">やりたい</span>
  <span class="ww" aria-describedby="description">ことの</span>
  <span class="ww" aria-describedby="description">そばに</span>
  <span class="ww" aria-describedby="description">いる</span>
</p>

This functionality is currently nonfunctional due to the html5lib sanitizer's behavior, which strips ARIA-related attributes from the output HTML. Progress on this issue is tracked at #74

Author

Shuhei Iitsuka

Disclaimer

This library is authored by a Googler and copyrighted by Google, but is not an official Google product.

License

Copyright 2018 Google LLC

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

budou's People

Contributors

adamyi avatar aodag avatar bungoume avatar juancamilo1 avatar kant avatar kotarok avatar luzhang avatar pieterdeschepper avatar tushuhei avatar yaboo-oyabu avatar yosukesuzuki avatar zborboa-g avatar zepplu avatar zrlk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

budou's Issues

Non-breaking space character (/u00A0) causes AssertionError

Here is the problem string: Chatbot\u00a0\u2013

Traceback (most recent call last):
  File "<console>", line 5, in <module>
  File "/usr/local/lib/python3.6/site-packages/budou/parser.py", line 78, in parse
    chunks = self.segmenter.segment(source, language)
  File "/usr/local/lib/python3.6/site-packages/budou/tinysegmentersegmenter.py", line 94, in segment
    assert source[seek] == ' '
AssertionError

assert source[seek] == ' '

Using `budou` name in Node.js port

I've been working on a complete port of your awesome library to Node.js, budou-node.

I would like to use the name budou in the npm package. I thought I would check in to see if you had any issues with that? Happy to make changes ✌️

Process brackets properly

Current implementation concatenates all PUNCT marks(、。「」etc.)to the previous chunk, but it is not appropriate for some marks such as open brackets.

[run error] set GOOGLE_APPLICATION_CREDENTIALS

I installed budou using pip, and I ran an example like 'budou 안녕하세요'.
But I got this error when I ran this program first time.
:: [google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://developers.google.com/accounts/docs/application-default-credentials.]
Is there more things to do than just 'pip install budou'?

Can't install budou using pip

Quick isolated case:

virtualenv venv
source venv/bin/activate
pip install budou

I get error traceback:

Collecting budou                                   
  Using cached budou-0.6.0.tar.gz                  
    Complete output from command python setup.py egg_info:                                             
    Traceback (most recent call last):             
      File "<string>", line 20, in <module>        
      File "/private/var/folders/01/w01_1_fx077_0zpzqxm2092m0000gn/T/pip-build-9UqicV/budou/setup.py", line 31, in <module>                                                                                   
        install_requires=read_file('requirements.txt').splitlines(),                                   
      File "/private/var/folders/01/w01_1_fx077_0zpzqxm2092m0000gn/T/pip-build-9UqicV/budou/setup.py", line 19, in read_file                                                                                  
        with open(os.path.join(os.path.dirname(__file__), name), 'r') as f:                            
    IOError: [Errno 2] No such file or directory: '/private/var/folders/01/w01_1_fx077_0zpzqxm2092m0000gn/T/pip-build-9UqicV/budou/requirements.txt'                                                          
                                                   
    ----------------------------------------       
    Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/01/w01_1_fx077_0zpzqxm2092m0000gn/T/pip-build-9UqicV/budou    

budou.py returns error when input a text is recognized as 'zh'

Got HttpError when I input a text which is recognized as 'zh'.
Budou must handle CJK texts...

For example:
result = parser.parse(u'再会', 'wordwrap')
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/budou/budou.py", line 113, in parse
chunks = self._get_source_chunks(input_text)
File "/usr/local/lib/python2.7/dist-packages/budou/budou.py", line 178, in _get_source_chunks
tokens = self._get_annotations(input_text)
File "/usr/local/lib/python2.7/dist-packages/budou/budou.py", line 150, in _get_annotations
response = request.execute()
File "/usr/local/lib/python2.7/dist-packages/oauth2client/util.py", line 137, in positional_wrapper
return wrapped(_args, *_kwargs)
File "/usr/local/lib/python2.7/dist-packages/googleapiclient/http.py", line 838, in execute
raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://language.googleapis.com/v1beta1/documents:annotateText?alt=json returned "The language zh is not supported for syntax analysis.">

Use pickle for caching

Current caching uses shelve module and it makes unit testing tricky because the suffix of cache file name may differ by environment. By changing the cache format to pickle, we can improve the mobility of cache files and simplify unit testing.

Handle proper nouns

Proper nouns(固有名詞)are sometimes separated to chunks, which should be wrapped in one chunk ideally. Possible solutions would be:

  • Use "entity" property in Natural Language API's response to force every entity to be wrapped in one chunk.
  • Allow users to put a list of proper nouns (maybe .csv file) to wrap as one chunk.

Span, Zero-width space, or wbr elements?

I just learned this nice work. It is particularly useful for dyslexia people.

In a meeting of the Japanese DAISY project for textbooks, we discussed how hints for line breaking should be represented. The use of span elements was suggested. But people do not want to use span elements for this purpose, because DAISY textbooks already too heavily use span elements for multi-media synchronization. Thus, Keio Advanced Publishing Laboratory is inclined to adopt the zero-width space or wbr elements. Florian's personal draft is based on this assumption. See w3c/jlreq#17

Caching feature improvement

Current caching feature uses a shelve file for caching purpose, but this approach will not work with some server-less architectures such as AppEngine environment which may launch multiple instances for front-end serving. In order to enable caching for PaaS services, updating the caching feature with factory method pattern and let each platform use its specialized implementation would be better.

Accessibility Improvement

Some screen reader programs read Budou-enabled paragraphs chunk-by-chunk, which makes their reading speed slow. We may want to add capability to configure attributes of each SPAN tag in order to let users put ARIA tags to control a screen reader's behavior.

Here's an example which controls screen reading properly.

<p id="description" aria-label="やりたいことのそばにいる Android">
  <span class="ww" aria-describedby="description">やりたい</span>
  <span class="ww" aria-describedby="description">ことの</span>
  <span class="ww" aria-describedby="description">そばに</span>
  <span class="ww" aria-describedby="description">いる Android</span>
</p>

WARNING:googleapiclient.discovery_cache:file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth

When upgrading google-api-python-client and oauth2client, there is this warning from the GCP cloud:

"WARNING:googleapiclient.discovery_cache:file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth"

It would be great to be able to specify if the build call in this line should use the cache or no:

service = googleapiclient.discovery.build(

I was thinking something like this:

class NLAPISegmenter(Segmenter):
...
def __init__(self, cache_filename, credentials_path, use_entity, use_cache, cache_discovery):
  ...
  self._authenticate(cache_discovery)
...

def _authenticate(self, cache_discovery):
...
service = googleapiclient.discovery.build(
        'language', 'v1beta2', http=authed_http, cache_discovery=cache_discovery)

Set maximum length for the chunk

Some Japanese katakana terms are too long to fit in one line, which may occur layout degradation. Setting maximum length of each chunk would be a solution for this issue.

Minor mistake in README

In this section in README, code for Traditional Chinese should be zh-Hant while zh-Hans is for Simplified Chinese.

Resolve html5lib's DeprecationWarning

The current implementation keeps returning the warning below.

DeprecationWarning: This method will be removed in future versions.  Use 'list(elem)' or iteration over elem instead.

English characters should be ignored

The output of
Google XXXX YYYY へこんにちは
should be
Google XXXX <span>YYYY へ</span><span>こんにちは</span>
instead of
<span>Google </span><span>XXXX </span><span>YYYY へ</span><span>こんにちは</span>

Documentation enhancement

Below items should be covered.

  • Custom filter integration (e.g. Flask and Django)
  • Accessibility enhancement (aria-describedby)

Chinese language name

Natural Language API accepts 'zh' and 'zh-Hant' as supported languages, but the current implementation may pass 'zh', 'zh-TW', 'zh-CN', or 'zh-HK' to the API. They need to be aligned.

Copy icon removes spaces (breaks Korean)

The copy icon on the tool removes spaces from the text, which effectively breaks Korean text.

image

Copy pasting manually:
<span><span class="ww">취소에</span> <span class="ww">대해</span> <span class="ww">궁금한</span> <span class="ww">점이</span> <span class="ww">있으면</span> <span class="ww">가족</span> <span class="ww">그룹</span> <span class="ww">관리자에게</span> <span class="ww">문의하세요.</span></span>

Using the copy button (spaces removed):
<span><span class="ww">취소에</span><span class="ww">대해</span><span class="ww">궁금한</span><span class="ww">점이</span><span class="ww">있으면</span><span class="ww">가족</span><span class="ww">그룹</span><span class="ww">관리자에게</span><span class="ww">문의하세요.</span></span>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.