Git Product home page Git Product logo

jwe's Introduction

JWE

Source codes of our EMNLP2017 paper Joint Embeddings of Chinese Words, Characters, and Fine-grained Subcharacter Components

Preparation

You need to prepare a training corpus and the Chinese subcharacter radicals or components.

  • Training corpus. Download Chinese Wikipedia Dump. Following the instractions on the blog, you can extract the raw content from the xml file and do data preprocessing such as removing pure digits and non Chinese characters. Alternatively, you can download the corpus after preprocessing at the onlibe baidu box.
  • Subcharacter radicals and components. Deploy the scrapy codes in JWE/ChineseCharCrawler on Scrapy Cloud, you can crawl the resource from HTTPCN. We provide a copy of the data in ./subcharacters for reserach convenience. The copyright and all rights therein of the subcharacter data are reserved by the website HTTPCN.

Model Training

  • cd JWE/src, compile the code by make all.
  • run ./jwe for parameters details.
  • run ./run.sh to start the model training, you may modify the parameters in file run.sh.
  • Input files format: Corpus wiki.txt contains segmented Chinese words with UTF-8 encoding; Subcharacters comp.txt contains a list of components which are seperated by blank spaces; char2comp.txt, each line consists of a Chinese character and its components in the following format:
侩 亻 人 云
侨 亻 乔
侧 亻 贝 刂
侦 亻 卜 贝

Model Evaluation

Two Chinese word similarity datasets 240.txt and 297.txt and one Chinese analogy dataset analogy.txt in JWE/evaluation folder are provided by (Chen et al., IJCAI, 2015).

cd JWE/src, then

  • run python word_sim.py -s <similarity_file> -e <embed_file> for word similarity evaluation, where similarity_file is the word similarity file, e.g., 240.txt or 297.txt, embed_file is the trained word embedding file.
  • run python word_analogy.py -a <analogy_file> -e <embed_file> or ./word_analogy <embed_file> <analogy_file> for word analogy evaluation.

jwe's People

Contributors

jinxing94 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.