seamusabshere / fuzzy_match Goto Github PK

Find a needle (a document or record) in a haystack using string similarity and (optionally) regular expression rules. Uses Dice's Coefficient (aka Pair Similiarity) and Levenshtein Distance internally.

License: MIT License

Ruby 100.00%

fuzzy_match's Introduction

Top 3 reasons you should use FuzzyMatch

intelligent defaults: it uses a combination of Pair Distance (2-gram) and Levenshtein Edit Distance to effectively match many examples with no configuration
all-vs-all: it takes care of finding the optimal match by comparing everything against everything else (when that's necessary)
refinable: you might get to 90% with no configuration, but if you need to go beyond you can use regexps, grouping, and stop words

It solves many mid-range matching problems — if your haystack is ~10k records — if you can winnow down the initial possibilities at the database level and only bring good contenders into app memory — why not give it a shot?

FuzzyMatch

Find a needle in a haystack based on string similarity and regular expression rules.

Replaces loose_tight_dictionary because that was a confusing name.

Warning! normalizers are gone in version 2 and above! See the CHANGELOG and check out enhanced (and hopefully more intuitive) groupings.

Quickstart

>> require 'fuzzy_match'
=> true
>> FuzzyMatch.new(['seamus', 'andy', 'ben']).find('Shamus')
=> "seamus"

See also the blog post Fuzzy match in Ruby.

Default matching (string similarity)

At the core, and even if you configure nothing else, string similarity (calculated by "pair distance" aka Dice's Coefficient) is used to compare records.

You can tell FuzzyMatch what field or method to use via the :read option... for example, let's say you want to match a Country object like #<Country name:"Uruguay" iso_3166_code:"UY">

>> fz = FuzzyMatch.new(Country.all, :read => :name)
=> #<FuzzyMatch: [...]>
>> fz.find('youruguay')
=> #<Country name:"Uruguay" iso_3166_code:"UY">

Optional rules (regular expressions)

You can improve the default matchings with rules. There are 3 different kinds of rules. Each rule is a regular expression.

We suggest that you first try without any rules and only define them to improve matching, prevent false positives, etc.

Groupings

Group records together. The two laws of groupings:

If a needle matches a grouping, only compare it with straws in the same grouping; (the "buddies vs buddies" rule)
If a needle doesn't match any grouping, only compare it with straws that also don't match ANY grouping (the "misfits vs misfits" rule)

The two laws of chained groupings: (new in v2.0 and rather important)

Sub-groupings (e.g., /plaza/i below) only match if their primary (e.g., /ramada/i) does
In final grouping decisions, sub-groupings win over primaries (so "Ramada Inn" is NOT grouped with "Ramada Plaza", but if you removed /plaza/i sub-grouping, then they would be grouped together)

Hopefully they are rather intuitive once you start using them.

That will...

separate "Orient Express Hotel" and "Ramada Conference Center Mandarin" from real Mandarin Oriental hotels
keep "Trump Hotel Collection" away from "Luxury Collection" (another real hotel brand) without messing with the word "Luxury"
make sure that "Ramada Plaza" are always grouped with other RPs—and not with plain old Ramadas—and vice versa
splits out Hyatts into their different brands
and more

You specify chained groupings as arrays of regexps:

groupings = [
  /mandarin/i,
  /trump/i,
  [ /ramada/i, /plaza/i ],
  ...
]
fz = FuzzyMatch.new(haystack, groupings: groupings)

This way of specifying groupings is meant to be easy to load from a CSV, like bin/fuzzy_match does.

Formerly called "blockings," but that was jargon that confused people.

Identities

Prevent impossible matches. Can be very confusing—see if you can make things work with groupings first.

Adding an identity like /(f)-?(\d50)/i ensures that "Ford F-150" and "Ford F-250" never match.

Note that identities do not establish certainty. They just say whether two records could be identical... then string similarity takes over.

Stop words

Ignore common and/or meaningless words when doing string similarity.

Adding a stop word like THE ensures that it is not taken into account when comparing "THE CAT", "THE DAT", and "THE CATT"

Stop words are NOT removed when checking :must_match_at_least_one_word and when doing identities and groupings.

Find options

read: how to interpret each record in the 'haystack', either a Proc or a symbol
must_match_grouping: don't return a match unless the needle fits into one of the groupings you specified
must_match_at_least_one_word: don't return a match unless the needle shares at least one word with the match. Note that "Foo's" is treated like one word (so that it won't match "'s") and "Bolivia," is treated as just "bolivia"
gather_last_result: enable last_result

Case sensitivity

String similarity is case-insensitive. Everything is downcased before scoring. This is a change from previous versions.

Be careful with uppercase letters in your rules; in general, things are downcased before comparing.

String similarity algorithm

The algorithm is Dice's Coefficient (aka Pair Distance) because it seemed to work better than Longest Substring, Hamming, Jaro Winkler, Levenshtein (although see edge case below) etc.

Here's a great explanation copied from the wikipedia entry:

to calculate the similarity between:

    night
    nacht

We would find the set of bigrams in each word:

    {ni,ig,gh,ht}
    {na,ac,ch,ht}

Each set has four elements, and the intersection of these two sets has only one element: ht.

Inserting these numbers into the formula, we calculate, s = (2 · 1) / (4 + 4) = 0.25.

Edge case: when Dice's fails, use Levenshtein

In edge cases where Dice's finds that two strings are equally similar to a third string, then Levenshtein distance is used. For example, pair distance considers "RATZ" and "CATZ" to be equally similar to "RITZ" so we invoke Levenshtein.

>> 'RITZ'.pair_distance_similar 'RATZ'
=> 0.3333333333333333 
>> 'RITZ'.pair_distance_similar 'CATZ'
=> 0.3333333333333333                   # pair distance can't tell the difference, so we fall back to levenshtein...
>> 'RITZ'.levenshtein_similar 'RATZ'
=> 0.75 
>> 'RITZ'.levenshtein_similar 'CATZ'
=> 0.5                                  # which properly shows that RATZ should win

Cached results

Make sure you add active_record_inline_schema to your gemfile.

TODO write documentation. For now, please see how we manually cache matches between aircraft and flight segments.

Glossary

The admittedly imperfect metaphor is "look for a needle in a haystack"

needle: the search term
haystack: the records you are searching (your result will be an object from here)

Using amatch to make it faster

You can optionally use amatch by Florian Frank (thanks Flori!) to make string similarity calculations in a C extension.

require 'fuzzy_match'
require 'amatch' # note that you have to require this... fuzzy_match won't require it for you
FuzzyMatch.engine = :amatch

Otherwise, pure ruby versions of the string similarity algorithms derived from the answer to a StackOverflow question and the text gem are used. Thanks marzagao and threedaymonk!

Real-world usage

We use fuzzy_match for data science at Brighter Planet and in production at

We often combine it with remote_table and errata:

download table with remote_table
correct serious or repeated errors with errata
fuzzy_match the rest

Contributors

Seamus Abshere [email protected]
Ian Hough [email protected]
Andy Rossmeissl [email protected]
Luke Rodgers @lukeasrodgers

Copyright

fuzzy_match's People

Contributors

Stargazers

Watchers

fuzzy_match's Issues

Licence missing in the rubygems version and in the gemspec

The "fuzzy_match" gem seems not to have a license at all. Unless a license that specifies otherwise is included, nobody else can use, copy, distribute, or modify that library without being at risk of take-downs, shake-downs, or litigation.

I know, that this gem has a license on github, however it's missing one at rubygems and in a gemspec, thus in the context of my work I cannot use this lib 👍

Would you mind adding a license in the gemspec as well? Thank you!

ref https://rubygems.org/gems/fuzzy_match

Match only beginning of words needed

Hi,

We've recently used fuzzy_match and found out that it produces some pretty weird name matches. Instead of matching "art" to "Artem", it chooses "Karl".

Could you please add an option "Match only from the beginning of words", so "art" would prefer "artem" to "karl" ?

Thank you.

Specs failing

When I run bundle exec rspec I get the following failures:

Failures:

  1) FuzzyMatch::CachedResult works with weighted_average
     Failure/Error: aircraft.flight_segments.weighted_average(:seats, :weighted_by => :passengers).should == 5.45455
     ActiveRecord::StatementInvalid:
       Mysql2::Error: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ' `flight_segments`, 'passengers', `flight_segments`.`passengers`, 0, `flight_seg' at line 1: SELECT SUM('flight_segments', `flight_segments`, 'passengers', `flight_segments`.`passengers`, 0, `flight_segments`, 'seats', `flight_segments`.`seats`, `flight_segments`.`passengers` * `flight_segments`.`seats`, 1.0, `flight_segments`.`passengers` * `flight_segments`.`seats` * 1.0) / SUM(`flight_segments`.`passengers`) FROM `flight_segments`  WHERE `flight_segments`.`aircraft_description` IN ('bing 747', 'BORING 747200') AND `flight_segments`.`seats` IS NOT NULL AND `flight_segments`.`passengers` > 0
     # ./spec/cache_spec.rb:115:in `block (2 levels) in <top (required)>'

  2) FuzzyMatch::Rule::Grouping has chains
     Failure/Error: h.xjoin?(r('grund hyatt'), r('grand hyatt')).should == true
       expected: true
            got: false (using ==)
     # ./spec/grouping_spec.rb:33:in `block (2 levels) in <top (required)>'

Running on OSX 10.9.3, ruby 1.9.3-p392, and mysql Ver 14.14 Distrib 5.6.16, for osx10.9.

For what it's worth, I get the same failures when trying the specs with sqlite3.

Object caching with mongoid

Hey guys,
Can you think of any way to disable following caching?

matcher = FuzzyMatch.new(Project.all, read: :name)
 => #<FuzzyMatch:0x00000004bdf6d8 @read=:name, @groupings=[], @identities=[], @stop_words=[], @default_options={:must_match_grouping=>false, :must_match_at_least_one_word=>false, :gather_last_result=>false, :find_all=>false, :find_all_with_score=>false, :threshold=>0, :find_best=>false, :find_with_score=>true}, @haystack=[w("project")]>

matcher.find('project').first
 => #<Project _id: 5217d95177796b80510000c7, name: "project", url: "">

Project.find_by(name: 'project').update_attribute(:url, 'test.dev')
 => true

matcher.find('project').first.url
 => ""

One character word matching score is ignored in some cases

Reproducible example

names = [
"Apple iPhone",
"Apple iPhone 8",
]
matcher = FuzzyMatch.new(names)

matcher.find_all_with_score("Apple 8")
=>
[["Apple iPhone", 0.6153846153846154, 0.5],
 ["Apple iPhone 8", 0.6153846153846154, 0.5]]

Expected behavior

Apple iPhone 8 to have a higher match score than Apple iPhone when matched against Apple 8.

Match threshold

Could you add an optional parameter to find that allows you to set a match threshold? I am looking for matches that are only really close vs. getting matches that are .2 or .3 in similarity. By setting the threshold I could eliminate anything that isn't almost exactly a match.

how to match all-vs-all?

Given FuzzyMatch.new(['a', 'b', 'c'])
How do I match between all of them?

By default I get a 0.2 score match

Hello!

I trying to use this gem to find similar Car Makes, but I'm getting a result I don't understand:

If I have the make Citroen and I look for the make "Volkswagen" I get the "Citroen" make:

FuzzyMatch.new(['Citroen']).find("Volkswagen") -> Citroen
FuzzyMatch.new(['Citroen']).find_with_score("Volkswagen") -> ["Citroen", 0.13333333333333333, 0.19999999999999996]

How is this possible? Am I doing something wrong? How can I change the threshold?

Thanks!

Undefined method `map`

I am suddenly getting the following error when I attempt to create a new FuzzyMatch object with a global map of users as the object to search.

`initialize': undefined method `map' for nil:NilClass (NoMethodError)
Did you mean?  tap

To create the new FuzzyMatch object I use the example:

  def search_users(user)
    fz = FuzzyMatch.new(@user)
    project = fz.find(user)
    return user['id'], user['name']
  end

However the error only occurs after running the same search twice. If I search for user 'john' and then in the same query later search for a user 'john' it throws the error.

I am using Sinatra Version 3.6.2 and Ruby 2.3.1-p112

Threshold Ignored For 1 and 2 letter search terms.

First of all, thanks for a great gem! However, when I search like this:

FuzzyMatch.new(Organization.where("is_counsel = false"), :read => :name, find_all_with_score: true, threshold: 0.5).find("li")

I'll get back results like this.

[#<Organization:0x007fb000d15530
id: 67,
name: "Little Inc",
created_at: Wed, 14 Dec 2016 02:15:00 UTC +00:00,
updated_at: Wed, 14 Dec 2016 02:15:00 UTC +00:00,
is_counsel: false>,
0.25,
0.19999999999999996],
[#<Organization:0x007fb000cf4998
id: 95,
name: "Lindgren Inc",
created_at: Wed, 14 Dec 2016 02:15:00 UTC +00:00,
updated_at: Wed, 14 Dec 2016 02:15:00 UTC +00:00,
is_counsel: false>,
0.2,
0.16666666666666663],
[#<Organization:0x007fb000cd6128
id: 26,
name: "Grant-Little",
created_at: Wed, 14 Dec 2016 02:14:59 UTC +00:00,
updated_at: Wed, 14 Dec 2016 02:14:59 UTC +00:00,
is_counsel: false>,
0.16666666666666666,
0.16666666666666663],
[#<Organization:0x007fb000cd50c0
id: 36,
name: "Kessler-Kling",
created_at: Wed, 14 Dec 2016 02:14:59 UTC +00:00,
updated_at: Wed, 14 Dec 2016 02:14:59 UTC +00:00,
is_counsel: false>,
0.15384615384615385,
0.15384615384615385],

Obviously, the scores fall well below the threshold. I can get around be filtering afterward with select or something, but that's expensive, and it would be nice if I didn't have to do this.

Best way to search over multiple active record models?

I have a search box in my webapp which fuzzy-searches over several different ActiveRecord models (e.g. Company, Person).

How would you recommend providing the top 10 results from across all sources?

ruby 1.8.7 compatibility

this line uses new hash syntax:

https://github.com/seamusabshere/fuzzy_match/blob/master/lib/fuzzy_match.rb#L143

RubyNLP

Dear Seamus,

I've recently added your project to our RubyNLP list: https://github.com/arbox/nlp-with-ruby

I wonder if you want to participate in the Ruby for NLP network. You could do this in a very simple step by adding the rubynlp topic to your GitHub repository.

Thank you for the project!

find similar?

how can I get all that match say "[email protected]"?

In a test case I also have "[email protected]" and I would like to find all. All I am getting back are exact matches.

Feature request - returning array of multiple possible matches

First, thanks so much for the helpful module - it's been incredibly useful on past applications. Currently though, I have a need to do some fuzzy matching with multiple possible results. It would be great to add that functionality to this module (with an optional "threshold" parameter).

Example use case - say a customer searches for a "shover", you could return valid options of "shovel" or "shaver" and allow him/her to specify what was meant.

I know Sphinx and other search options exist which do this, don't want to go the full route of new search functionality for something so close to what this module already does.

Curious about Similarity#satisfy? logic

It seems like issue #4 has been resolved, but I'm wondering about the logic at line https://github.com/seamusabshere/fuzzy_match/blob/master/lib/fuzzy_match/similarity.rb#L27 which seems to undermine some of the usefulness of threshold.

My use case is basically this: searching for near-matches in a list of tweets. Typically, the one I'm looking for will either be a perfect match, or be off by 5-12 characters, localized to one particular word (a URL, to be specific).

Adjusting the values of threshold has gotten me good results, but only if I remove the aforementioned line, which results in matches that are, for my needs, false positives.

I don't see any specs that make use of threshold, so I'm not entirely clear on what use cases (needle.words & record2.words).any? is supporting.

In any case, thanks for the helpful gem.

Is there any python implementation of this library? Please let me know.

pair_distance_similar and levenshtein_similar examples don't work

You mention similarity example, like:

require "fuzzy_match"
FuzzyMatch.engine = :amatch
'RITZ'.pair_distance_similar 'RATZ'
'RITZ'.levenshtein_similar 'RATZ'

But that does not work. Can you explain how you calculate similarity between strings?

Crash using amatch

Hi,

using the amatch engine significantly speeds up my matching.

Unfortunately it is not stable (at least for me).

I ofter get this crash when using to amatch:

.../ruby/2.1.0/gems/fuzzy_match-2.1.0/lib/fuzzy_match/score/amatch.rb:12: [BUG] Segmentation fault at 0x00000000000008

This part of the crash log might be of some interest … I don't know:

-- C level backtrace information -------------------------------------------
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1e29ec) [0x7f24462279ec] vm_dump.c:690
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x77533) [0x7f24460bc533] error.c:312
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(rb_bug+0xb3) [0x7f24460bd183] error.c:339
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x15a2ef) [0x7f244619f2ef] signal.c:812
/lib/x86_64-linux-gnu/libc.so.6(+0x36d40) [0x7f2445cb6d40]
/var/www/my-project/shared/bundle/ruby/2.1.0/gems/amatch-0.3.0/lib/amatch_ext.so(pair_array_match+0x4) [0x7f243ec70ba4] pair.c:46
/var/www/my-project/shared/bundle/ruby/2.1.0/gems/amatch-0.3.0/lib/amatch_ext.so(+0x66e5) [0x7f243ec6d6e5] amatch_ext.c:500
/var/www/my-project/shared/bundle/ruby/2.1.0/gems/amatch-0.3.0/lib/amatch_ext.so(+0x8396) [0x7f243ec6f396] amatch_ext.c:1142
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1cbd04) [0x7f2446210d04] vm_insnhelper.c:1489
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1d09c4) [0x7f24462159c4] insns.def:1028
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1d462f) [0x7f244621962f] vm.c:1398
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1d5d34) [0x7f244621ad34] vm_eval.c:171
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1d6a7b) [0x7f244621ba7b] vm_eval.c:50
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x301dd) [0x7f24460751dd] array.c:2396
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(ruby_qsort+0x294) [0x7f24461e3964] util.c:420
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(rb_ary_sort_bang+0xf3) [0x7f244607b483] array.c:2438
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(rb_ary_sort+0x11) [0x7f244607f1d1] array.c:2510
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1cbd04) [0x7f2446210d04] vm_insnhelper.c:1489
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1d09c4) [0x7f24462159c4] insns.def:1028
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1d462f) [0x7f244621962f] vm.c:1398
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1d7aea) [0x7f244621caea] vm.c:817
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(rb_yield+0x69) [0x7f244621db19] vm.c:856
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(rb_ary_each+0x52) [0x7f2446073012] array.c:1785
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1cbd04) [0x7f2446210d04] vm_insnhelper.c:1489
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1d1305) [0x7f2446216305] insns.def:999
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1d462f) [0x7f244621962f] vm.c:1398
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1d7aea) [0x7f244621caea] vm.c:817
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(rb_yield+0x69) [0x7f244621db19] vm.c:856
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(rb_ary_each+0x52) [0x7f2446073012] array.c:1785
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1cbd04) [0x7f2446210d04] vm_insnhelper.c:1489
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1d1305) [0x7f2446216305] insns.def:999
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1d462f) [0x7f244621962f] vm.c:1398
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1d54ba) [0x7f244621a4ba] vm_eval.c:1279
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1d5a7f) [0x7f244621aa7f] vm_eval.c:1320
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x8600d) [0x7f24460cb00d] proc.c:377
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1cbd04) [0x7f2446210d04] vm_insnhelper.c:1489
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1d09c4) [0x7f24462159c4] insns.def:1028
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1d462f) [0x7f244621962f] vm.c:1398
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1dc5f5) [0x7f24462215f5] vm.c:817
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(rb_rescue2+0xbe) [0x7f24460c4f9e] eval.c:754
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1ca5ce) [0x7f244620f5ce] vm_eval.c:1042
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1d9981) [0x7f244621e981] vm_insnhelper.c:1489
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1d1305) [0x7f2446216305] insns.def:999
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1d462f) [0x7f244621962f] vm.c:1398
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1dd109) [0x7f2446222109] vm.c:817
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1cdbaa) [0x7f2446212baa] vm_eval.c:1849
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(rb_catch_obj+0xc) [0x7f2446212c6c] vm_eval.c:1828
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1cdd0e) [0x7f2446212d0e] vm_eval.c:1814
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1d9981) [0x7f244621e981] vm_insnhelper.c:1489
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1d1305) [0x7f2446216305] insns.def:999
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1d462f) [0x7f244621962f] vm.c:1398
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1dd109) [0x7f2446222109] vm.c:817
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1cdbaa) [0x7f2446212baa] vm_eval.c:1849
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(rb_catch_obj+0xc) [0x7f2446212c6c] vm_eval.c:1828
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1cdd0e) [0x7f2446212d0e] vm_eval.c:1814
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1d9981) [0x7f244621e981] vm_insnhelper.c:1489
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1d1305) [0x7f2446216305] insns.def:999
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1d462f) [0x7f244621962f] vm.c:1398
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(rb_iseq_eval+0x1a9) [0x7f2446222809] vm.c:1649
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x8304e) [0x7f24460c804e] load.c:615
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(rb_require_safe+0x7d2) [0x7f24460c9b22] load.c:644
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1d9981) [0x7f244621e981] vm_insnhelper.c:1489
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1d09c4) [0x7f24462159c4] insns.def:1028
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x1d462f) [0x7f244621962f] vm.c:1398
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(rb_iseq_eval_main+0x1f6) [0x7f2446222a76] vm.c:1662
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(+0x7d6ff) [0x7f24460c26ff] eval.c:253
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(ruby_exec_node+0x1d) [0x7f24460c476d] eval.c:318
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/../lib/libruby.so.2.1(ruby_run_node+0x1c) [0x7f24460c6f3c] eval.c:310
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/ruby() [0x40087b] main.c:36
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f2445ca1ec5]
/home/my-user/.rvm/rubies/ruby-2.1.5/bin/ruby() [0x4008a9] main.c:38

License missing from gemspec

RubyGems.org doesn't report a license for your gem. This is because it is not specified in the gemspec of your last release.

via e.g.

spec.license = 'MIT'
# or
spec.licenses = ['MIT', 'GPL-2']

Including a license in your gemspec is an easy way for rubygems.org and other tools to check how your gem is licensed. As you can imagine, scanning your repository for a LICENSE file or parsing the README, and then attempting to identify the license or licenses is much more difficult and more error prone. So, even for projects that already specify a license, including a license in your gemspec is a good practice. See, for example, how rubygems.org uses the gemspec to display the rails gem license.

There is even a License Finder gem to help companies/individuals ensure all gems they use meet their licensing needs. This tool depends on license information being available in the gemspec. This is an important enough issue that even Bundler now generates gems with a default 'MIT' license.

I hope you'll consider specifying a license in your gemspec. If not, please just close the issue with a nice message. In either case, I'll follow up. Thanks for your time!

Appendix:

If you need help choosing a license (sorry, I haven't checked your readme or looked for a license file), GitHub has created a license picker tool. Code without a license specified defaults to 'All rights reserved'-- denying others all rights to use of the code.
Here's a list of the license names I've found and their frequencies

p.s. In case you're wondering how I found you and why I made this issue, it's because I'm collecting stats on gems (I was originally looking for download data) and decided to collect license metadata,too, and make issues for gemspecs not specifying a license as a public service :). See the previous link or my blog post about this project for more information.

Latest version is not compatible with Ruby 1.8

It seems to use the 1.9 hash literal syntax instead of the old one.

I know that 1.9/2.0 is awesome and I personally love it and hate legacy compatibility but OSX still comes with 1.8 by default so if non-ruby-programmers want to install tools all their libraries have to be 1.8 compatible.

Library/Ruby/Gems/1.8/gems/fuzzy_match-2.0.0/lib/fuzzy_match.rb:83: syntax error, unexpected ':', expecting '='
...stop_words: @stop_words, read: @read }

Engine results should match

Here's a test and failure that show the :pure_ruby engine and the :amatch engine giving different results:

  test "pure_ruby and amatch behave the same" do
    haystack = [
      "    uniqueness: { message: \"is invalid\" }\n\n  validates :name, presence: true\n\n  def clear_reset_password_token!\n    clear_reset_password_token\n    save\n",
      "  validates :email,\n    uniqueness: { message: \"is invalid\" }\n\n  validates :name, presence: true\n\n  def clear_reset_password_token!\n    clear_reset_password_token\n    save\n",
    ]

    needle = "  validates :email,\n   uniqueness: { message: \"is invalid\" }\n\n  def clear_reset_password_token!\n    clear_reset_password_token\n    save\n"

    FuzzyMatch.engine = :pure_ruby
    fzp = FuzzyMatch.new(haystack)
    resultp = fzp.find_all_with_score(needle)

    FuzzyMatch.engine = :amatch
    fza = FuzzyMatch.new(haystack)
    resulta = fza.find_all_with_score(needle)

    expect(resultp).to eq resulta
  end

Result:

Error:
Test#test_pure_ruby_and_amatch_behave_the_same:
RSpec::Expectations::ExpectationNotMetError:
expected: [["    uniqueness: { message: \"is invalid\" }\n\n  validates :name, presence: true\n\n  def clear_re...ssword_token!\n    clear_reset_password_token\n    save\n", 0.8888888888888888, 0.7869822485207101]]
     got: [["  validates :email,\n    uniqueness: { message: \"is invalid\" }\n\n  validates :name, presence: t...ssword_token!\n    clear_reset_password_token\n    save\n", 0.8826291079812206, 0.6190476190476191]]

(compared using ==)

I would expect the results to match. For my use case I actually prefer the result that the :pure_ruby engine gives, but I'm not sure which one is more technically correct. Unfortunately I need my use case to run faster as well, so I'm stuck between a rock and a hard place.

Any idea why they differ and how I might be able to get them to behave the same?

jaro winkler * pair distance might work better than pair distance + levenshtein

in terms of the core decision logic, i've seen cases where jaro winkler score * pair distance score is better than first trying pair distance and only using levenshtein to break a tie (the current system)

Unexpected results when specifying stop words as Regexps

FuzzyMatch.new(['AAI Limited', 'LITED'], :stop_words=>['limited']).find('AAI Limited')
=> "AAI Limited"  # good


FuzzyMatch.new(['AAI Limited', 'LITED'], :stop_words=>[/limited/i]).find('AAI Limited')
=> "LITED"  # bad

I would expect the same result in either case, given the absence of special characters in the regexp.

seamusabshere / fuzzy_match Goto Github PK

fuzzy_match's Introduction

Top 3 reasons you should use FuzzyMatch

FuzzyMatch

Quickstart

Default matching (string similarity)

Optional rules (regular expressions)

Groupings

Identities

Stop words

Find options

Case sensitivity

String similarity algorithm

Edge case: when Dice's fails, use Levenshtein

Cached results

Glossary

Using amatch to make it faster

Real-world usage

Contributors

Copyright

fuzzy_match's People

Contributors

Stargazers

Watchers

Forkers

fuzzy_match's Issues

Reproducible example

Expected behavior

Recommend Projects

Recommend Topics

Recommend Org