Git Product home page Git Product logo

scel2txt's Introduction

scel2txt

搜狗细胞词库转鼠须管(Rime)词库,提供 Python3 和 Golang 实现的版本

使用

将从搜狗官方词库网站下载的 *.scel 文件放入 scel 文件夹,然后运行

Python

python3 scel2txt.py

或者下载编译好的命令 scel2txt-darwin-amd64-0.0.1.gz

gunzip scel2txt-darwin-amd64-0.0.1.gz
chmod +x scel2txt-darwin-amd64-0.0.1
./scel2txt-darwin-amd64-0.0.1

生成的文件

  • 后缀为 .txt 的同名词库文件
  • 自动合并所有 *.txt 文件到 luna_pinyin.sogou.dict.yaml

搜狗细胞词库(scel格式文件) 格式说明

按照一定格式保存的 Unicode 编码文件,其中每两个字节表示一个字符(中文汉字或者英文字母)。

主要包括两部分:

  1. 全局拼音表,在文件中的偏移值是 0x1540+4, 格式为 (py_idx, py_len, py_str)

    • py_idx: 两个字节的整数,代表这个拼音的索引
    • py_len: 两个字节的整数,拼音的字节长度
    • py_str: 当前的拼音,每个字符两个字节,总长 py_len
  2. 汉语词组表,在文件中的偏移值是 0x2628 或 0x26c4, 格式为 (word_count, py_idx_count, py_idx_data, (word_len, word_str, ext_len, ext){word_count}),其中 (word_len, word, ext_len, ext){word_count} 一共重复 word_count 次, 表示拼音的相同的词一共有 word_count 个

    • word_count: 两个字节的整数,同音词数量
    • py_idx_count: 两个字节的整数,拼音的索引个数
    • py_idx_data: 两个字节表示一个整数,每个整数代表一个拼音的索引,拼音索引数
    • word_len:两个字节的整数,代表中文词组字节数长度
    • word_str: 汉语词组,每个中文汉字两个字节,总长度 word_len
    • ext_len: 两个字节的整数,可能代表扩展信息的长度,好像都是 10
    • ext: 扩展信息,一共 10 个字节,前两个字节是一个整数(不知道是不是词频),后八个字节全是 0,ext_len 和 ext 一共 12 个字节

目前已测试的词库

参考资料

  1. scel2mmseg
  2. scel-to-txt

scel2txt's People

Contributors

asc8384 avatar lewangdev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

scel2txt's Issues

部分词库转换失败

“网络流行新词【官方推荐】.scel”没有转换(一共放了七个scel文件),输出如下:


数学词汇大全【官方推荐】.scel:15992 个词

**高等院校(大学)大全【官方推荐】.scel:7192 个词

物理词汇大全【官方推荐】.scel:13107 个词

计算机词汇大全【官方推荐】.scel:10300 个词

开发大神专用词库【官方推荐】.scel:430 个词

linux少量术语.scel:136 个词

Traceback (most recent call last):
File "scel2txt.py", line 175, in
scel2txt.deal(os.path.join("./scel", scel_file))
File "scel2txt.py", line 147, in deal
self.getChinese(data[self.startChinese:])
File "scel2txt.py", line 116, in getChinese
py = self.getWordPy(data[pos: pos+py_table_len])
File "scel2txt.py", line 87, in getWordPy
ret.append(self.GPy_Table[index])
KeyError: 34945

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.