Git Product home page Git Product logo

Comments (13)

BlindingDark avatar BlindingDark commented on July 19, 2024

目前计划是

  1. 对于不同来源的词库建立对应独立的仓库进行维护和发布,然后在本仓库提供汇总的下载连接
  2. 提供配置项,方便用户更换或搭配多种词库使用

本 issue 的目的是收集靠谱的词库来源,靠谱的来源要满足

无版权问题(必须明确声明允许自由使用,处理和分发)
词库内容靠谱(比如某论坛转发的找不到原作者的无法维护的词库就不行)
方便程序进行处理(比如 pdf,网页,就不行)

欢迎各位在此分享合适的词库。

from rime-easy-en.

BlindingDark avatar BlindingDark commented on July 19, 2024

默认词库来自于 https://github.com/skywind3000/ECDICT
生成工具 https://github.com/BlindingDark/rime_easy_eng_dict
该工具默认导出的词为带有词频 或 长度小于等于 9 的不包含数字的词

from rime-easy-en.

VimWei avatar VimWei commented on July 19, 2024

Word frequency词频语料库

Global English
英英学習型词典的词频标示判读心得
单词释义比例词典

Learn these words first

Learn These Words First
What is a Multi-Layer Dictionary?

New General Service List

New General Service List
General Service List - Wikipedia
New General Service List - Wikipedia
NGSL by Frequency | Quizlet

COCA

COCA Word frequency: based on 450 million word COCA corpus
Corpus of Contemporary American English (COCA)
Word frequency: based on 450 million word COCA corpus
COCA Frequency list::查看 60000 单词词频排名

BNC - British National Corpus

[bnc] British National Corpus
BNCweb registration - Main
BNC database and word frequency lists
British National Corpus (BYU-BNC)
Paul Nation Vocabulary Lists based on BNC
Companion Website for: Word Frequencies in Written and Spoken English
BNC 15000 - Sustainable English

Oxford words list

Oxford 3000 and 5000 | OxfordLearnersDictionaries.com
About Word Lists at Oxford Learner's Dictionaries | OxfordLearnersDictionaries.com

Longman Communication 3000 and 9000

Longman Communication 3000
sapbmw/Longman-Communication-3000: Longman Communication 3000 word list, English Words List - Learn English Words
Longman Communication 9000 from LDOCE6 - PDAWIKI
Longman Communication 9000

google 10000

google-10000-most common english words
Google Ngram Viewer

Longman Defining Vocabulary

Longman Defining Vocabulary/alphabetically - FrathWiki
The Longman Defining Vocabulary
Longman Defining Vocabulary - FrathWiki
Longman Defining Vocabulary - FrathWiki

Macmillan Dictionary Common English Words

Macmillan Dictionary Common English Words
English Language Resources from Macmillan Dictionary

Famous Freq Lists - Lextutor.ca

Famous Freq Lists - Lextutor.ca
Compleat Lexical Tutor

VOA Special English Word Book
Wikipedia:Lists of common misspellings - Wikipedia
Hyper Collocation — dictionary based on arXiv repository
媒体语言语料库(MLC)
Open American National Corpus | Open Data for Language Research and Education
Richard Kennaway's Constructed Languages List
Mills Basic Vocabulary - FrathWiki
Basic English - Wikipedia
jjzz/ZZ-WordFreq: words frequency top100k from BNC/ANC/COCA, dsl format, for goldendict
[2016.4.26更新]17万词BNC+ANC+COCA词频词典
更新 Word Frequency of 170,000 Words
Just The Word 搭配使用频率

from rime-easy-en.

BlindingDark avatar BlindingDark commented on July 19, 2024

@VimWei 尽量发些无版权疑问的,程序能处理的词库来源,比如一些大学使用的学习资料,我们应该是无权直接拿来用的,而且一些是 pdf 格式的,wiki 连接之类的,这些都不方便进行处理。麻烦整理一个筛选过后的列表。

from rime-easy-en.

VimWei avatar VimWei commented on July 19, 2024

专业词库: https://pinyin.sogou.com/dict/
专业词典: https://www.pdawiki.com/forum/forum.php?mod=viewthread&tid=42412

from rime-easy-en.

BlindingDark avatar BlindingDark commented on July 19, 2024

专业词库: https://pinyin.sogou.com/dict/
专业词典: https://www.pdawiki.com/forum/forum.php?mod=viewthread&tid=42412

  1. 无版权
  2. 无法处理

from rime-easy-en.

VimWei avatar VimWei commented on July 19, 2024

@VimWei 尽量发些无版权疑问的,程序能处理的词库来源,比如一些大学使用的学习资料,我们应该是无权直接拿来用的,而且一些是 pdf 格式的,wiki 连接之类的,这些都不方便进行处理。麻烦整理一个筛选过后的列表。

恩,上述纯粹就是我的浏览器书签导出的,确实未作整理。不过,我们没有必要把所有可用的资料都处理成现成的词库。

建议:给出一两个典型的词库案例,提供使用自定义词库的机制、如何自定义词库的教程等,其他的就让用户自己想办法解决即可。

from rime-easy-en.

BlindingDark avatar BlindingDark commented on July 19, 2024

使用自定义词库的机制、如何自定义词库的教程

普通用户没有这个能力和精力。

词库是面向最终使用者的,本 issue 目的是收集靠谱的词库来源。靠谱的来源要满足

  1. 无版权问题(必须明确声明允许自由使用,处理和分发)
  2. 词库内容靠谱(比如某论坛转发的找不到原作者的无法维护的词库就不行)
  3. 方便程序进行处理(比如 pdf,网页,就不行)

from rime-easy-en.

VimWei avatar VimWei commented on July 19, 2024

纯粹开源、无版权的资料,确实少之又少。使用效果也不好。

还是忽略上述资料吧,它们仅作为解释说明:什么是词频语料库、什么是专业词库。

PS:使用Rime的用户,估计都喜欢折腾。。。不能定义为普通用户。。。我曾经下载过,放弃了,这两天才又捡起来。。。

from rime-easy-en.

BlindingDark avatar BlindingDark commented on July 19, 2024

纯粹开源、无版权的资料,确实少之又少

ECDICT 是我能找到的最靠谱的开源词库了,可以围绕它来做几个裁剪和修补。

from rime-easy-en.

VimWei avatar VimWei commented on July 19, 2024

wiktionary: https://en.wiktionary.org/

Wiktionary is a wiki, which means that you can edit it, and all the content is dual-licensed under both the Creative Commons Attribution-ShareAlike 3.0 Unported License and the GNU Free Documentation License.

虽然原版是基于网页的,但已有不少基于此的mdx词典,应该比较容易转换。

from rime-easy-en.

blackhole889 avatar blackhole889 commented on July 19, 2024

rime是一个很好的输入法程序,但也存在一些较大的不足。其中一个就是词库的建立和精选。
提高词库的效率有两个两个方法,我需要的在里面,我不需要的不在里面。只关注其一,如加大词库数量无法提高词库的效率。
现在rime似乎无法删除一些已有词库里的词(其宣称的ctr+del,shift+del,ctr+k可以删除自造词,无法删除一些词库里的词,甚止降低已有词库权重也难以做到。降低已有词库权重偶而可以做到,很不稳定)。有些词一般用户用不到,如果从词库删除可以加大输入效率。能否对原有程序进行修改,使得可以删除任意词组。

解决词库里删除词语问题,也可以让用户删除一些个人隐私词语,共享出个人词库,从而精炼出好的词库。

from rime-easy-en.

BlindingDark avatar BlindingDark commented on July 19, 2024

能否对原有程序进行修改

不在本 issue 讨论范围之内。请去 rime 那边反馈意见。

不过你可以用文本编辑器直接修改词库。

from rime-easy-en.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.