Git Product home page Git Product logo

seimei's Introduction

seimei

MIT License test reviewdog

seimei is a Go port of a tool (namedivider-python) created in python to split Japanese first and last names.

For implementation details, please check (namedivider-python) from which porting.

Installation

go install github.com/glassmonkey/seimei/v2/cmd/seimei@latest

Usage

Options

$ seimei -h
Usage:
  seimei [flags]
  seimei [command]

Available Commands:
  name        It parse single full name.
  file        It bulk parse full name lit in the file.
  help        Help about any command

Flags:
  -h, --help      help for seimei
  -v, --version   version for seimei

Use "seimei [command] --help" for more information about a command

Example

$ seimei name --name 竈門炭治郎
竈門 炭治郎

$ seimei name --name 竈門禰豆子 --parse @
竈門@禰豆子
$ cat /tmp/kimetsu.txt
竈門炭治郎
竈門禰豆子
我妻善逸
嘴平伊之助

$ seimei file --file /tmp/kimetsu.txt
竈門 炭治郎
竈門 禰豆子
我妻 善逸
嘴平 伊之助

$ seimei file --file /tmp/kimetsu.txt --parse @
竈門@炭治郎
竈門@禰豆子
我妻@善逸
嘴平@伊之助

License

Mit

Author

glassmonkey(@glassmonekey)

seimei's People

Contributors

glassmonkey avatar dependabot[bot] avatar actions-user avatar gatchan0807 avatar ydah avatar

Stargazers

Ochi Daiki avatar Ryuden avatar y-oga avatar  avatar  avatar Kenshin Okinaka avatar Syuparn avatar Hitoshi Manabe avatar ikawaha avatar Akira Chiku avatar Panda_Program avatar Kei Arima avatar zbv avatar Yoichiro Shimizu avatar  avatar Takashi Matsuyuki avatar Kaito Iwatsuki avatar

Watchers

ngsw avatar James Cloos avatar  avatar

seimei's Issues

Add Features Testing

The Feature package is a lightweight implementation of numpy, but it is not well tested. i.e.soft max function

Feature: to read undivided name list from a text file / ファイルから読み込んで実行する機能追加

English version is bellow Japanese. (From DeepL.com)


概要

移植元の namedivider-python に存在する file サブコマンドを移植していただけると嬉しいです。

考えているメリット

  • 複数件の名前を分割したい場合に、複数回 $ seimei --name ${undevidedName} と呼ぶ必要がなくなるため、コマンド実行のオーバーヘッドがなくなるパフォーマンス改善
  • 直接コマンドの引数に分割前の名前が入らなくなるため、OSコマンドインジェクションの実行リスクが低減できる

考えうるデメリット

  • 複雑なファイル周りの対応が必要になるので、このコマンドのメンテナンスコストが増える点
    • 入力されるファイルの文字コードを考慮する必要がある
    • 一定サイズ以上のファイルを受け付けないようにする必要がある。など。

期待しているファイル形式と、ワーストケース

  • namedivider-python と同じような形でテキストファイル(1行に1人の分割前フルネーム)を受け取り、標準出力に分割後のフルネームを出力する
  • 私のユースケースでは最大でも1GBほどのテキストファイルを読み込み、出力できることを期待しています

About

It would be great if you could port the file subcommand that exists in the ported namedivider-python.

Merits

  • If you want to split multiple names, you don't have to call $ seimei --name ${undevidedName} multiple times, which improves performance by eliminating the overhead of executing the command.
  • The risk of OS command injection execution is reduced because the name before the split is no longer included in the direct command argument.

Demerits

The maintenance cost of this command increases due to the complexity of processing around files.

For example, "it is necessary to consider the character encoding of the input file," "it is necessary to prevent the acceptance of files over a certain size," etc.

Added missing information

Expected file format and worst case

  • "namedivider-python" as well as accepts a text file (list with one pre-divider full name per line) and outputs the post-divider full name to standard output.
  • I expect to be able to read and output a text file of about 1GB at most for my use case.

follow namedivider-python (0.2.0)

reference

rskmoi/namedivider-python@bb9e091

topic

Algorithm Modification

There are some points that would be more accurate if changed in the handling of the number of characters( n=4).

https://github.com/rskmoi/namedivider-python/blob/bb9e091bd81790f1a647d38122600df0abde5505/namedivider/divider/config.py#L41

Gradient Boosted Decision Tree (GBDT )

Accuracy is improved in exchange for execution speed.
(about 99.5% -> 99.9%)

detail:
https://dune-fifth-da7.notion.site/NameDivider-9118f1a74ca545629dbbfa606a39ba0a

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.