Git Product home page Git Product logo

kospacing's Introduction

KoSpacing

License: GPL v3

R package for automatic Korean word spacing.

Python verson can be found here.

Introduction

Word spacing is one of the important parts of the preprocessing of Korean text analysis. Accurate spacing greatly affects the accuracy of subsequent text analysis. KoSpacing has fairly accurate automatic word spacing performance, especially good for online text originated from SNS or SMS.

For example.

“아버지가방에들어가신다.” can be spaced both of below.

  1. “아버지가 방에 들어가신다.” means “My father enters the room.”
  2. “아버지 가방에 들어가신다.” means “My father goes into the bag.”

Common sense, the first is the right answer.

KoSpacing is based on Deep Learning model trained from large corpus(more than 100 million NEWS articles from Chan-Yub Park).

Performance

Test Set Accuracy
Sejong(colloquial style) Corpus(1M) 97.1%
OOOO(literary style) Corpus(3M) 94.3%
  • Accuracy = # correctly spaced characters/# characters in the test data.
    • Might be increased performance if normalize compound words.

Install

To install from GitHub, use

install.packages('remotes')
remotes::install_github('haven-jeon/KoSpacing')
library(KoSpacing)
set_env()

Example

library(KoSpacing)
#> If you install package first fime,
#> Please set_env() run before using spacing()
spacing("김형호영화시장분석가는'1987'의네이버영화정보네티즌10점평에서언급된단어들을지난해12월27일부터올해1월10일까지통계프로그램R과KoNLP패키지로텍스트마이닝하여분석했다.")
#> loaded KoSpacing model!
#> [1] "김형호 영화시장 분석가는 '1987'의 네이버 영화 정보 네티즌 10점 평에서 언급된 단어들을 지난해 12월 27일부터 올해 1월 10일까지 통계 프로그램 R과 KoNLP 패키지로 텍스트마이닝하여 분석했다."

Model Architecture

Citation

@misc{heewon2018,
author = {Heewon Jeon},
title = {KoSpacing: Automatic Korean word spacing},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/haven-jeon/KoSpacing}}

kospacing's People

Contributors

gogamza avatar haven-jeon avatar mrchypark avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

kospacing's Issues

keras의 버전 업그레이드가 실패합니다.

현재 KoSpacing은 파이썬 3.6, h5py 2.10.0, tenserflow 1.9.0 keras 2.1.5 로 버전을 고정해야 합니다.
(python 3.6은 tenserflow 1.9.0의 제약사항으로 tf를 올리지 못하면 python 버전을 올릴 수 없습니다.)

reticulate의 python 버전 고정 문제로 새로운 conda env 를 구성하여 진행할 때 아래 에러메세지와 함께 reticulate 1.25 버전에서 conda env를 찾지 못하는 문제가 있습니다. (아마도 windows)

Error in Sys.setenv(PATH = new_path) : wrong length for argument

그래서 기존 r-reticulate 환경을 그대로 사용할 수 있거나, 환경을 찾지 못하는 문제를 우회하기 위해 python 3.8 이상으로 업그레이드를 해야 하는데요.

keras를 2.1.5 버전보다 상위로 올리면 아래와 같은 에러가 발생합니다.

keras <- reticulate::import("keras")
> model <- keras$models$load_model(model_file, compile = FALSE)
2022-07-30 13:48:43.523325: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Error in py_call_impl(callable, dots$args, dots$keywords) : 
  NotImplementedError: Cannot convert a symbolic Tensor (gru_1/strided_slice:0) to a numpy array. This error may indicate that you're trying to pass a Tensor to a NumPy call, which is not supported
> model <- keras$models$load_model(model_file, compile = TRUE)
Error in py_call_impl(callable, dots$args, dots$keywords) : 
  NotImplementedError: Cannot convert a symbolic Tensor (gru_1/strided_slice_1:0) to a numpy array. This error may indicate that you're trying to pass a Tensor to a NumPy call, which is not supported

하여, 패키지를 유지하기 위해서는 2가지 방법이 있습니다.

  1. reticulate 패키지의 문제가 해결되도록 한다.
  2. 모델을 수정하여 최신 버전의 패키지나 다른 프레임워크를 사용하여 제공한다.

본 레포는 1번의 방식으로 진행하면 좋을 것 같습니다.
아마도 패키지 사용에 관심이 있으신 분이 reticulate 패키지에 문의를 지속하면 좋겠습니다.
같은 에러 메세지가 몇번 있었던거 같습니다만, 쉽지 않아 보입니다.
https://github.com/rstudio/reticulate/issues?q=Error+in+Sys.setenv%28PATH+%3D+new_path%29+%3A+wrong+length+for+argument

2번에 해당하는 접근은 제가 개인적으로 해보려고 합니다.
우선은 모델 컨버터를 사용해 다른 프레임워크를 사용하는 시도를 해보겠습니다.
이것도 잘 안되면, 새로 만들어야 할텐데 아직 잘 모르겠네요.

우선 더 이상 문제를 해결하기 힘들다는 점을 다른 분들께도 공유하고 싶어서 이슈 남깁니다.
감사합니다.

KoSpacing 패키지 설치 오류

안녕하세요. R을 막 공부하기 시작한 대학생입니다.
R 사용 중 더 깔끔한 텍스트 마이닝을 위해서
제작해 주신 KoSpacing 패키지를 사용하고자 설치를 진행했으나
non-zero exit status 오류가 계속해서 발생하여 문의드립니다.
다른 패키지를 설치할 때는 이러한 오류가 발생하지 않았는데
KoSpacing과 이를 설치하기 위한 hashmap 설치에만 해당 오류가 발생합니다.

오류 문구 남겨드립니다. 감사합니다.

ERROR: compilation failed for package 'hashmap'

  • removing 'C:/R-4.1.2/library/hashmap'
    Downloading GitHub repo nathan-russell/hashmap@HEAD
    Skipping hashmap, it is already being installed
    √ checking for file 'C:\Users\user1\AppData\Local\Temp\RtmpIzOgGY\remotes6ae0457a50c2\haven-jeon-KoSpacing-f6728eb/DESCRIPTION' ...
  • preparing 'KoSpacing':
    √ checking DESCRIPTION meta-information ...
  • checking for LF line-endings in source and make files and shell scripts
  • checking for empty or unneeded directories
    Omitted 'LazyData' from DESCRIPTION
  • building 'KoSpacing_0.1.2.tar.gz'

ERROR: dependency 'hashmap' is not available for package 'KoSpacing'

  • removing 'C:/R-4.1.2/library/KoSpacing'
    Warning messages:
    1: In i.p(...) :
    installation of package ‘C:/Users/user1/AppData/Local/Temp/RtmpIzOgGY/file6ae058f73297/hashmap_0.2.2.tar.gz’ had non-zero exit status
    2: In i.p(...) :
    installation of package ‘C:/Users/user1/AppData/Local/Temp/RtmpIzOgGY/file6ae02cd961f2/KoSpacing_0.1.2.tar.gz’ had non-zero exit status

package install error on windows

r-conda set can in x64 but x32 is not. and I got error message below.

> devtools::install_github('haven-jeon/KoSpacing')
Downloading GitHub repo haven-jeon/KoSpacing@master
from URL https://api.github.com/repos/haven-jeon/KoSpacing/zipball/master
Installing KoSpacing
"C:/PROGRA~1/R/R-35~1.1/bin/x64/R" --no-site-file --no-environ --no-save --no-restore --quiet CMD  \
  INSTALL  \
  "D:/Users/patrick_int/AppData/Local/Temp/1/RtmpuO7zwV/devtools28e840cb2c6d/haven-jeon-KoSpacing-381b2e0"  \
  --library="D:/Users/patrick_int/Documents/R/win-library/3.5" --install-tests 

* installing *source* package 'KoSpacing' ...
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
  converting help for package 'KoSpacing'
    finding HTML links ... done
    hello                                   html  
    spacing                                 html  
** building package indices
** testing if installed package can be loaded
*** arch - i386
Error: package or namespace load failed for 'KoSpacing':
 .onAttach failed in attachNamespace() for 'KoSpacing', details:
  call: NULL
  error: Unable to find conda binary. Is Anaconda installed?
Error: loading failed
Execution halted
*** arch - x64
2018-09-17 18:06:51.057878: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
ERROR: loading failed for 'i386'
* removing 'D:/Users/patrick_int/Documents/R/win-library/3.5/KoSpacing'
In R CMD INSTALL
Installation failed: Command failed (1)

session info

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] installr_0.20.1 stringr_1.3.1  

loaded via a namespace (and not attached):
 [1] httr_1.3.1      compiler_3.5.1  R6_2.2.2        magrittr_1.5    tools_3.5.1     withr_2.1.2    
 [7] curl_3.2        memoise_1.1.0   stringi_1.1.7   git2r_0.23.0    digest_0.6.17   devtools_1.13.6

conda

> reticulate::conda_version()
[1] "conda 4.5.11"
> reticulate::conda_list()
        name                                                                                   python
1 miniconda3                C:\\Users\\<<user_name>>\\AppData\\Local\\Continuum\\miniconda3\\python.exe
2    r-conda C:\\Users\\<<user_name>>\\AppData\\Local\\Continuum\\miniconda3\\envs\\r-conda\\python.exe

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.