mpkorstanje / simmetrics Goto Github PK
View Code? Open in Web Editor NEWThis project forked from nickmancol/simmetrics
Similarity or Distance Metrics, e.g. Levenshtein, for Java
License: Apache License 2.0
This project forked from nickmancol/simmetrics
Similarity or Distance Metrics, e.g. Levenshtein, for Java
License: Apache License 2.0
Version 1 had a bench mark tests for all metrics. It was too much spaghetti code to re-factor so I tossed it out. Still, having a bench mark would be nice.There currently is a single performance test in org.simmetrics.performance.BatchPerformance but it is suffering from CPU-look ahead and a few other JVM optimizations.
Using Google Caliper would be nice.
The current implementation of MatchingCoefficient isn't symmetric and the implementation doesn't seem to make much sense as a string metric. See:
This code is slow because it creates a pattern with each call to replaceAll. Should be replaced by a precompiled pattern.
wordStr = wordStr.replaceAll("[aeiouwh]+", "0");
wordStr = wordStr.replaceAll("[bpfv]+", "1");
wordStr = wordStr.replaceAll("[cskgjqxz]+", "2");
wordStr = wordStr.replaceAll("[dt]+", "3");
wordStr = wordStr.replaceAll("[l]+", "4");
wordStr = wordStr.replaceAll("[mn]+", "5");
wordStr = wordStr.replaceAll("[r]+", "6");
Use libsimmetrics version as base
I had a question about the license of SimMetrics and how it was relicensed from GPL 2.0.
In this commit (8307a58), the license of the project has been changed from GPL 2.0 and several authors removed. I'd like to know the circumstances behind this change.
List and Set metrics are currently tested as if they were string metrics.
This prevents testing specific features of the metrics. Tests should be decoupled in some way.
HI, sorry not really an issue but I have raised a simmetrics question on http://stackoverflow.com/questions/40740577/should-i-use-stringmetric-or-multisetmetric-for-comparing-these-strings-with-sim that I hope you can me help with
Having said that it would be helpful if there was a page that grouped/explained the metrics to allow casual users to have a better stab on using the right algorithm. For example I have only just realized that CosineSimilarity with WhiteSpace tokenizer just treats the words in a sentence as a set ignoring order in sentence, although happily this essentially is what I want it to do
Add javadoc comments for immutable and thread-safe.
The current implementation of the SoundexSimplifier has a maximum soundex size because the implementation is rather naive. Implementation should be fixed to be less naive.
// Drop first letter code and remove zeros
wordStr = wordStr.substring(1).replaceAll("0", "");
// FIXME: This will not work for all soundex lenghts
wordStr += "000000000000000000"; /* pad with zeros on right */
// Add first letter of word and size to taste
wordStr = firstLetter + "-" + wordStr.substring(0, length - 2);
return wordStr;
Smith-Waterman implementation needs to be checked against the research paper.
The current builder is a bit verbose:
StringMetric metric = new StringMetricBuilder()
.with(new CosineSimilarity<String>())
.simplify(new CaseSimplifier.Lower())
.tokenize(new QGramTokenizer(2))
.build();
Could be much shorter by using import static org.simmetrics.StringMetricBuilder.with;
StringMetric metric = with(new CosineSimilarity<String>())
.simplify(new CaseSimplifier.Lower())
.tokenize(new QGramTokenizer(2))
.build();
SmithWatermanGotoh and SmithWatermanGotohWindowedAffine need to be refactored and checked against research papers.
I just checked the pom.xml and found this milestone version for the maven-enforcer-plugin. Is it really needed?
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-enforcer-plugin</artifactId>
<version>3.0.0-M1</version>
</plugin>
This issue provides visibility into Renovate updates and their statuses. Learn more
These updates have all been created already. Click a checkbox below to force a retry/rebase of any.
Use libsimmetrics code instead.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.