Comments (11)
Thanks! As you may figure it out yourself, DBSCan use RNNSearch inside. When a Distance or Metric is supplied, we wrap it with LinearSearch or CoverTree. setIdenticalExcluded(true) is important for some uses including DBScan. But for your use case, it is different due to the meaning of JAR_WINKLER_DISTANCE. I will check if I should update the semantical meaning of setIdenticalExcluded, for example it means exactly same object (not just distance 0). I have to double check before making this change. Thanks!
from smile.
BTW, can you please contribute your JARO_WINKLER_DISTANCE implementation to SMILE? Thank you very much in advance! If you agree, please put it in package smile.math.distance. Thanks again!
from smile.
Nearest neighbor search is used in many density estimation and clustering algorithms. In most situations, distance 0 means that two objects are the same in the feature space. Without excluding identical data points, we may get into troubles in some cases. It is why we do setIdenticalExcluded(true) by default. I think that it is reasonable and also more efficient. But in your use case, many objects have 0 distance even if they are different. Your approach with customized LinearSearch is correct and I am glad it works. The problem comes from the distance definition not DBSCan. I won't change the default behavior otherwise many other algorithms will not work correctly. Thanks!
from smile.
Thank a lot you for fast and detailed reply.
I think I was not clear last time with my explanation, please let me rephrase it.
My task is to cluster the set of strings into groups.
In my case, distance 0 is possible only with equal or identical (equal by reference in Java terms) strings.
The problem occurs when I have two identical Java strings in dataset.
They have distance 0 and they are reference-equal. I expect them to be in one cluster, but because of identicalExcluded = false
, objects are labeled as noise.
You can check that with dataset {"123", "123"} strings are labeled as noise because they are the same Java objects (thanks, Java string pool), but with dataset {new String("123"), new String("123")} objects are clustered together because they have different references and distance 0.
In my opinion this example exposes inconsistent behavior because the algorithm depends on reference equality in Java language.
Anyway, I am not a specialist at machine learning, I just want to be sure you understand me right because of my terrible English. Thanks again for reviewing this issue!
About Jaro-Winkler distance implementation
In my case I used implementation provided by Apache commons library, please see code below:
public static final Distance<String> JARO_WINKLER_DISTANCE = new Distance<String>() {
@Override
public double d(String from, String to) {
return 1.0 - StringUtils.getJaroWinklerDistance(from, to);
}
};
Do you think I should add dependency below in SMILE and add distance to package smile.math.distance
?
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.3.2</version>
</dependency>
Many thanks for your time.
from smile.
Thanks for explanation. Value equality and reference equality are treated differently in almost all programming languages. I feel that our behavior is correct in this sense. In DBScan, we need to query every object's neighborhood. If we return the query object in the neighborhood, we will get into a self-reference loop. Although our code avoids getting into dead loop, it is still not a desired behavior to including the query object in the neighborhood. It is the purpose of setIdenticalExcluded.
So it is clear that Java pool string literal, which is the root cause. As you said, new String() is probably the right solution. With other data, I observe that the result is worse if I do setIdenticalExclude(false). Even though it may work for you, be careful with the clustering results.
Since you use apache commons for Jar-Winkler distance, it is not necessary to include it here. Thanks!
from smile.
Thanks for clear and detailed explanation.
With your permission I'd like to add couple of lines to documentation of RNNSearch
class, explaining default behavior.
If you are OK with this, I'll open pull request.
Also, I will follow your advice and rewrite my code to use new String()
approach
from smile.
Sure. Go ahead adding comments and make a pull request. Thanks!
from smile.
Btw, not sure your use case. But usually BKTree and CoverTree are much faster than linear search. We do have a very efficient implementation of edit distance. If your data is large, maybe worth of a try.
from smile.
Thanks for advice, but according to this article, Jaro-Winkler is not a metric because it does not satisfy triangle equality. It seems I can use BKTree
and CoverTree
only with metric.
Do we have better approach than LinearSearch
when distance does not satisfy triangle inequality?
BTW, my dataset should not be more than 10K objects.
from smile.
Since your data is not big, it is okay with linear search. I don't know your application, which may need that string distance. Just suggest to try edit distance if it makes sense in your case. It really depends on the use case. Speed is only a secondary consideration for small data.
from smile.
Many thanks for help.
I am closing the issue.
from smile.
Related Issues (20)
- FR: Compact "how to load dirty data" example HOT 1
- Arff.java writeField can fail when type isn't in the list of handled types HOT 1
- BarPlot.getUpperBound() computes wrong bound. HOT 1
- FR: Warn before trying to train where the label column has any nulls HOT 1
- Dot product Question HOT 2
- stringVector(0) error HOT 1
- Suggest changing license to Apache 2.0 license or MIT
- Non-monotonic cluster tree -- the linkage is probably not appropriate! HOT 1
- HiddenLayerBuilder does not add dropout to HiddenLayer HOT 4
- Method in interface BaseArray can never return an int[] HOT 2
- Making the plot module available in Java API HOT 4
- InnerProduct of vectors created with cas.Vars not being simplified HOT 6
- Support header attribute on facet / row / column encoding channels HOT 2
- Incorrect spec generated for encoding channel sort HOT 4
- How can I set up in Intellij or other IDE to compile and read code? HOT 3
- What is the efficient way to fill null values in a column with an arbitrary string in a Dataframe? HOT 3
- ClassCastException when calling DataFrame.omitNullRows() HOT 1
- smile.plot.swing.BarPlot works with smile-plot 3.0.2 but not with 3.1.0 HOT 1
- IllegalArgumentException when suing SimpleImputer for data sourced from json file HOT 1
- Is there any possibility to use ID3 or C4.5 via the Smile Package in Java? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from smile.