Git Product home page Git Product logo

nwc-toolkit's Introduction

Project URL: http://code.google.com/p/nwc-toolkit/

nwc-toolkit は[http://s-yata.jp/corpus/nwc2010/ 日本語ウェブコーパス]を作成するために開発されたソフトウェアです.[http://s-yata.jp/corpus/nwc2010/htmls/ HTML アーカイブ]から[http://s-yata.jp/corpus/nwc2010/texts/ テキストアーカイブ]を作成するツールと,分かち書きしたテキストから [http://s-yata.jp/corpus/nwc2010/ngrams/ N-gram コーパス]を作成するツールで構成されています.サイズが 1TiB を超える HTML アーカイブを想定して C++ で開発されたツールなので,正規表現を多用する HTML パーサなどと比べれば,かなり高速に動作します.

== ドキュメント ==

 * [http://nwc-toolkit.googlecode.com/svn/trunk/docs/index.html nwc-toolkit(日本語ウェブコーパス用ツールキット)]
  * テキストアーカイブの作成
   * [http://nwc-toolkit.googlecode.com/svn/trunk/docs/tools/text-extractor.html nwc-toolkit-text-extractor(テキスト抽出ツール)]
    * [http://s-yata.jp/apps/nwc-toolkit/text-extractor テキスト抽出ツールのウェブサービス]
   * [http://nwc-toolkit.googlecode.com/svn/trunk/docs/tools/html-parser.html nwc-toolkit-html-parser(HTML 解析ツール)]
   * [http://nwc-toolkit.googlecode.com/svn/trunk/docs/tools/unicode-normalizer.html nwc-toolkit-unicode-normalizer(Unicode 正規化ツール)]
   * [http://nwc-toolkit.googlecode.com/svn/trunk/docs/tools/text-filter.html nwc-toolkit-text-filter(簡易文抽出ツール)]
  * N-gram コーパスの作成
   * [http://nwc-toolkit.googlecode.com/svn/trunk/docs/tools/ngram-counter.html nwc-toolkit-ngram-counter(N-gram 頻度計数ツール)]
   * [http://nwc-toolkit.googlecode.com/svn/trunk/docs/tools/ngram-merger.html nwc-toolkit-ngram-merger(N-gram コーパス統合ツール)]
  * その他
   * [http://nwc-toolkit.googlecode.com/svn/trunk/docs/tools/hash-calculator.html nwc-toolkit-hash-calculator(ハッシュ値計算ツール)]
   * [http://nwc-toolkit.googlecode.com/svn/trunk/docs/tools/duplicate-detector.html nwc-toolkit-duplicate-detector(重複検出ツール)]

nwc-toolkit's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.