Git Product home page Git Product logo

simmetrics's People

Contributors

jokillsya avatar mpkorstanje avatar renovate-bot avatar renovate[bot] avatar twillouer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

simmetrics's Issues

Performance tests and benchmarks

Version 1 had a bench mark tests for all metrics. It was too much spaghetti code to re-factor so I tossed it out. Still, having a bench mark would be nice.There currently is a single performance test in org.simmetrics.performance.BatchPerformance but it is suffering from CPU-look ahead and a few other JVM optimizations.

Using Google Caliper would be nice.

Soundex optimization

This code is slow because it creates a pattern with each call to replaceAll. Should be replaced by a precompiled pattern.

        wordStr = wordStr.replaceAll("[aeiouwh]+", "0");
        wordStr = wordStr.replaceAll("[bpfv]+", "1");
        wordStr = wordStr.replaceAll("[cskgjqxz]+", "2");
        wordStr = wordStr.replaceAll("[dt]+", "3");
        wordStr = wordStr.replaceAll("[l]+", "4");
        wordStr = wordStr.replaceAll("[mn]+", "5");
        wordStr = wordStr.replaceAll("[r]+", "6");

Question about SimMetrics license

I had a question about the license of SimMetrics and how it was relicensed from GPL 2.0.

In this commit (8307a58), the license of the project has been changed from GPL 2.0 and several authors removed. I'd like to know the circumstances behind this change.

More information required about the different metrics

HI, sorry not really an issue but I have raised a simmetrics question on http://stackoverflow.com/questions/40740577/should-i-use-stringmetric-or-multisetmetric-for-comparing-these-strings-with-sim that I hope you can me help with

Having said that it would be helpful if there was a page that grouped/explained the metrics to allow casual users to have a better stab on using the right algorithm. For example I have only just realized that CosineSimilarity with WhiteSpace tokenizer just treats the words in a sentence as a set ignoring order in sentence, although happily this essentially is what I want it to do

Longer Soundex lenghts

The current implementation of the SoundexSimplifier has a maximum soundex size because the implementation is rather naive. Implementation should be fixed to be less naive.

        // Drop first letter code and remove zeros
        wordStr = wordStr.substring(1).replaceAll("0", "");
        // FIXME: This will not work for all soundex lenghts
        wordStr += "000000000000000000"; /* pad with zeros on right */
        // Add first letter of word and size to taste
        wordStr = firstLetter + "-" + wordStr.substring(0, length - 2);
        return wordStr;

Smith-Waterman

Smith-Waterman implementation needs to be checked against the research paper.

Static Metric build method.

The current builder is a bit verbose:

        StringMetric metric = new StringMetricBuilder()
                .with(new CosineSimilarity<String>())
                .simplify(new CaseSimplifier.Lower())
                .tokenize(new QGramTokenizer(2))
                .build();

Could be much shorter by using import static org.simmetrics.StringMetricBuilder.with;

        StringMetric metric = with(new CosineSimilarity<String>())
                .simplify(new CaseSimplifier.Lower())
                .tokenize(new QGramTokenizer(2))
                .build();

SmithWatermanGotoh

SmithWatermanGotoh and SmithWatermanGotohWindowedAffine need to be refactored and checked against research papers.

maven-enforcer-plugin 3.0.0-M1

I just checked the pom.xml and found this milestone version for the maven-enforcer-plugin. Is it really needed?

<plugin>
  <groupId>org.apache.maven.plugins</groupId>
  <artifactId>maven-enforcer-plugin</artifactId>
  <version>3.0.0-M1</version>
</plugin>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.