Git Product home page Git Product logo

lemmatizer's Introduction

lemmatizer

Lemmatizer for text in English. Inspired by Python's nltk.corpus.reader.wordnet.morphy package.

Based on code posted by mtbr at his blog entry WordNet-based lemmatizer

Version 0.2 has added functionality to add user supplied data at runtime

Installation

sudo gem install lemmatizer

Usage

require "lemmatizer"
  
lem = Lemmatizer.new
  
p lem.lemma("dogs",    :noun ) # => "dog"
p lem.lemma("hired",   :verb ) # => "hire"
p lem.lemma("hotter",  :adj  ) # => "hot"
p lem.lemma("better",  :adv  ) # => "well"
  
# when part-of-speech symbol is not specified as the second argument, 
# lemmatizer tries :verb, :noun, :adj, and :adv one by one in this order.
p lem.lemma("fired")           # => "fire"
p lem.lemma("slow")            # => "slow"

Limitations

# Lemmatizer leaves alone words that its dictionary does not contain.
# This keeps proper names such as "James" intact.
p lem.lemma("MacBooks", :noun) # => "MacBooks" 
  
# If an inflected form is included as a lemma in the word index,
# lemmatizer may not give an expected result.
p lem.lemma("higher", :adj) # => "higher" not "high"!

# The above has to happen because "higher" is itself an entry word listed in dict/index.adj .
# To fix this, modify the original dict directly (lib/dict/index.{noun|verb|adj|adv}) 
# or supply with custom dict files (recommended).

Supplying with user dict

# You can supply custom dict files consisting of lines in the format of <pos>\s+<form>\s+<lemma>.
# The data in user supplied files overrides the preset data. Here's the sample. 

# --- sample.dict1.txt (don't include hash symbol on the left) ---
# adj   higher   high
# adj   highest  high
# noun  MacBooks MacBook
# ---------------------------------------------------------------

lem = Lemmatizer.new("sample.dict1.txt")

p lem.lemma("higher", :adj)     # => "high"
p lem.lemma("highest", :adj)    # => "high"
p lem.lemma("MacBooks", :noun)  # => "MacBook"

# The argument to Lemmatizer.new can be either of the following:
# 1) a path string to a dict file (e.g. "/path/to/dict.txt")
# 2) an array of paths to dict files (e.g. ["./dict/noun.txt", "./dict/verb.txt"])

Resolving abbreviations

# You can use 'abbr' tag in user dicts to resolve abbreviations in text.

# --- sample.dict2.txt (don't include hash symbol on the left) ---
# abbr  utexas   University of Texas
# abbr  mit      Massachusetts Institute of Technology
# ---------------------------------------------------------------

# <NOTE>
# 1. Expressions on the right (substitutes) can contain white spaces, 
#    while expressions in the middle (words to be replaced) cannot.
# 2. Double/Single quotations could be used with substitute expressions,
#    but not with original expressions.

lem = Lemmatizer.new("sample.dict2.txt")

p lem.lemma("utexas", :abbr) # => "University of Texas"
p lem.lemma("mit", :abbr)    # => "Massachusetts Institute of Technology"

Author

Thanks for assistance and contributions:

License

Licensed under the MIT license.

lemmatizer's People

Contributors

dankimio avatar jywarren avatar mifrill avatar t-cool avatar vividness avatar yohasebe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lemmatizer's Issues

The exceptions collection does not include full list

The logic w, s = line.split(/\s+/) compute only for 2 first matches even for cases with 3 matches

open_file(exc) do |io|
io.each_line do |line|
w, s = line.split(/\s+/)
@exceptions[pos][w] ||= []
@exceptions[pos][w] << s
end
end

For example:

zamindaris zamindari zemindari

      open_file(exc) do |io|
        io.each_line do |line|
          w, s = line.split(/\s+/)
          if line =~ /zamin/
            puts line
            puts w
            puts s
          end
          @exceptions[pos][w] ||= []
          @exceptions[pos][w] << s
        end
      end

# => 
# zamindaris zamindari zemindari
# zamindaris
# zamindari

The word zemindari is out of the compute range, is it a bug?

Turning off logging?

Hi, we're still getting quite a few log statements here:

https://travis-ci.org/publiclab/plots2/builds/525101925

I am wondering how we could turn these off, or suppress them?

Thanks so much for this library and for your responsiveness to our input and questions! We are truly grateful!

Note, this is the kind of output we see:

"awesome"
2
"spectrometer"
1
"activity:spectrometer"
1
"spam"
1
"chapter"
1
"question:spectrometer"
4
"lat:71.0"
1
"lon:52.0"
1
"blog"
1

Adding custom dictionaries

Is there a means to point to additional dictionaries supplied at initialization, like Lemmatizer.new(['path/to/dict']) or Lemmatizer.new(dictHash)?

How about ablility to detect part of speech?

Hi, developer.

Very helpful gem, thanks!

I just finished something like wordextractor gem with some features and I just was playing around an idea of detection phrasal verbs from text by analyzing templates verb + preposition; verb + pronoun + preposition; verb + pronoun + preposition.

I thought to add function to detect part of speech.

def pos(w)
  a = []
  [:noun, :verb, :adj, :adv].each { |p| a << p if lemma(w, p) }
  a.join(', ')
end

But in gem documentation and in sources I see that it always returns my input parameter when not found.

p lem.lemma("at", :noun) #=> at
p lem.lemma("in", :noun) #=> in

Can you provide something like method pos(word) which will return all part of speech for word?

spaces in dict entries?

We're adding custom dictionaries as in #5, but weren't sure if we could put a phrase into quotation marks so that two words can be shown to be associated, like:

h2s "Hydrogen Sulphide"
data-logging "data logging"
cafo "factory farm"

We're using this for synonyms, which I know isn't the original use case, but curious if the gem supports this syntax, or could be made to? Thank you!

Performance optimization for 'split' by 'partition'

Method split can be changed to partition to speed up the entire process.

w = line.split(/\s+/)[0]

require 'benchmark'

Benchmark.bm do |x|
  string = "'hood n 1 2 @ ; 1 0 08641944"
  x.report { 50000.times { string.split(/\s+/)[0] } }
  x.report { 50000.times { string.partition(/\s+/)[0] } }
end
[
 #<Benchmark::Tms:0x0000560ef6b6cc30 @label="", @real=0.18285021604970098, @cstime=0.0, @cutime=0.0, @stime=2.3999999999996247e-05, @utime=0.18282500000000024, @total=0.18284900000000023>, 
 #<Benchmark::Tms:0x0000560ef6b864a0 @label="", @real=0.05183851800393313, @cstime=0.0, @cutime=0.0, @stime=1.2000000000012001e-05, @utime=0.0518200000000002, @total=0.05183200000000021>
]

The change of real from 0.18284900000000023 to 0.05183200000000021 looks nice, @yohasebe what do you think?

Source:
https://stackoverflow.com/questions/7533479/ruby-string-search-which-is-faster-split-or-regex

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.