mpkorstanje / simmetrics Goto Github PK

View Code? Open in Web Editor NEW

This project forked from nickmancol/simmetrics

41.0 41.0 14.0 4.62 MB

Similarity or Distance Metrics, e.g. Levenshtein, for Java

License: Apache License 2.0

Java 100.00%

simmetrics's People

Contributors

Stargazers

Watchers

Forkers

ijabz pthorson gisdarcon prakashgaikwad mathuis kingyiren febryo yanyongxin arshakanjum msharma-boop qijunbo dennisfabri klara-business-ag zhqianqian

simmetrics's Issues

Performance tests and benchmarks

Version 1 had a bench mark tests for all metrics. It was too much spaghetti code to re-factor so I tossed it out. Still, having a bench mark would be nice.There currently is a single performance test in org.simmetrics.performance.BatchPerformance but it is suffering from CPU-look ahead and a few other JVM optimizations.

Using Google Caliper would be nice.

MatchingCoefficient is not symetric

The current implementation of MatchingCoefficient isn't symmetric and the implementation doesn't seem to make much sense as a string metric. See:

nickmancol#5

Soundex optimization

This code is slow because it creates a pattern with each call to replaceAll. Should be replaced by a precompiled pattern.

        wordStr = wordStr.replaceAll("[aeiouwh]+", "0");
        wordStr = wordStr.replaceAll("[bpfv]+", "1");
        wordStr = wordStr.replaceAll("[cskgjqxz]+", "2");
        wordStr = wordStr.replaceAll("[dt]+", "3");
        wordStr = wordStr.replaceAll("[l]+", "4");
        wordStr = wordStr.replaceAll("[mn]+", "5");
        wordStr = wordStr.replaceAll("[r]+", "6");

Implement Metaphone and DoubleMetaphone

Use libsimmetrics version as base

Question about SimMetrics license

I had a question about the license of SimMetrics and how it was relicensed from GPL 2.0.

In this commit (8307a58), the license of the project has been changed from GPL 2.0 and several authors removed. I'd like to know the circumstances behind this change.

Dedicated unit tests for List and SetMetrics

List and Set metrics are currently tested as if they were string metrics.

This prevents testing specific features of the metrics. Tests should be decoupled in some way.

More information required about the different metrics

HI, sorry not really an issue but I have raised a simmetrics question on http://stackoverflow.com/questions/40740577/should-i-use-stringmetric-or-multisetmetric-for-comparing-these-strings-with-sim that I hope you can me help with

Having said that it would be helpful if there was a page that grouped/explained the metrics to allow casual users to have a better stab on using the right algorithm. For example I have only just realized that CosineSimilarity with WhiteSpace tokenizer just treats the words in a sentence as a set ignoring order in sentence, although happily this essentially is what I want it to do

Immutable and Thread-Safe

Add javadoc comments for immutable and thread-safe.

Longer Soundex lenghts

The current implementation of the SoundexSimplifier has a maximum soundex size because the implementation is rather naive. Implementation should be fixed to be less naive.

        // Drop first letter code and remove zeros
        wordStr = wordStr.substring(1).replaceAll("0", "");
        // FIXME: This will not work for all soundex lenghts
        wordStr += "000000000000000000"; /* pad with zeros on right */
        // Add first letter of word and size to taste
        wordStr = firstLetter + "-" + wordStr.substring(0, length - 2);
        return wordStr;

Smith-Waterman

Smith-Waterman implementation needs to be checked against the research paper.

Static Metric build method.

The current builder is a bit verbose:

        StringMetric metric = new StringMetricBuilder()
                .with(new CosineSimilarity<String>())
                .simplify(new CaseSimplifier.Lower())
                .tokenize(new QGramTokenizer(2))
                .build();

Could be much shorter by using import static org.simmetrics.StringMetricBuilder.with;

        StringMetric metric = with(new CosineSimilarity<String>())
                .simplify(new CaseSimplifier.Lower())
                .tokenize(new QGramTokenizer(2))
                .build();

SmithWatermanGotoh

SmithWatermanGotoh and SmithWatermanGotohWindowedAffine need to be refactored and checked against research papers.

maven-enforcer-plugin 3.0.0-M1

I just checked the pom.xml and found this milestone version for the maven-enforcer-plugin. Is it really needed?

<plugin>
  <groupId>org.apache.maven.plugins</groupId>
  <artifactId>maven-enforcer-plugin</artifactId>
  <version>3.0.0-M1</version>
</plugin>

Dependency Dashboard

This issue provides visibility into Renovate updates and their statuses. Learn more

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

Check this box to trigger a request for Renovate to run again on this repository

Refactor Jaro, Jaro-Winkler and Levenshtein

Use libsimmetrics code instead.