Git Product home page Git Product logo

ldig's Introduction

ldig (Language Detection with Infinity Gram)

This is a prototype of language detection for short message service (twitter). with 99.1% accuracy for 17 languages

Usage

  1. Extract model directory tar xf models/[select model archive]

  2. Detect ldig.py -m [model directory] [text data file]

Data format

As input data, Each tweet is one line in text file as the below format.

[label]\t[some metadata separated '\t']\t[text without '\t']

[label] is a language name alike en, de, fr and so on. It is also optional as metadata. (ldig doesn't use metadata and label for detection, of course :D)

The output data of lidg is as the below.

[correct label]\t[detected label]\t[original metadata and text]

Estimation Tool

ldig has a estimation tool.

./server.py -m [model directory]

Open http://localhost:48000 and input target text into textarea. Then ldig outputs language probabilities and feature parameters in the text.

Supported Languages

  • cs Czech
  • da Dannish
  • de German
  • en English
  • es Spanish
  • fi Finnish
  • fr French
  • id Indonesian
  • it Italian
  • nl Dutch
  • no Norwegian
  • pl Polish
  • pt Portuguese
  • ro Romanian
  • sv Swedish
  • tr Turkish
  • vi Vietnamese

Documents

Copyright & License

  • (c)2011-2012 Nakatani Shuyo / Cybozu Labs Inc. All rights reserved.
  • All codes and resources are available under the MIT License.

ldig's People

Contributors

saffsd avatar shuyo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ldig's Issues

segfault

I get a segfault on this line:

sum_w = numpy.dot(param[events.keys(),].T, events.values())

I'm not sure why. The values look good.

Question about model

Hi. How did you build model.latin.20120315 model? What is the input corpus to build the model?
Thanks

How to train other language model

Hi
It is a great work. But I do hope you can share how to build the model for other languages which r not convered in your trained model. Thanks

Confused by documentation

Hello,

I would like to ask you how it is possible use ldig to detect language from short text. Is it possible to use it as a command line tool where I would load a model directory and some text dto detect?

Something like this:
$ /c/Python27/python ldig.py -m models/ldig.model.small/ldig.model.small/model.small/ -t "Dneska mi je krasne"
and the program will return cs

I am confused by using the data file - I need only detect languages for a list of text, I don't need to learn my own model or measure the accuracy by estimating label.

Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.