Hi, First let me thank you for the great library, it is pleasure to

Thanks for advice, but according to <a href="http://richardminerich.com/tag/jaro-winkl

Handling of identical objects in DBScan about smile HOT 11 CLOSED

badgersow commented on May 3, 2024

Handling of identical objects in DBScan

from smile.

Comments (11)

haifengl commented on May 3, 2024

Thanks! As you may figure it out yourself, DBSCan use RNNSearch inside. When a Distance or Metric is supplied, we wrap it with LinearSearch or CoverTree. setIdenticalExcluded(true) is important for some uses including DBScan. But for your use case, it is different due to the meaning of JAR_WINKLER_DISTANCE. I will check if I should update the semantical meaning of setIdenticalExcluded, for example it means exactly same object (not just distance 0). I have to double check before making this change. Thanks!

from smile.

haifengl commented on May 3, 2024

BTW, can you please contribute your JARO_WINKLER_DISTANCE implementation to SMILE? Thank you very much in advance! If you agree, please put it in package smile.math.distance. Thanks again!

from smile.

haifengl commented on May 3, 2024

Nearest neighbor search is used in many density estimation and clustering algorithms. In most situations, distance 0 means that two objects are the same in the feature space. Without excluding identical data points, we may get into troubles in some cases. It is why we do setIdenticalExcluded(true) by default. I think that it is reasonable and also more efficient. But in your use case, many objects have 0 distance even if they are different. Your approach with customized LinearSearch is correct and I am glad it works. The problem comes from the distance definition not DBSCan. I won't change the default behavior otherwise many other algorithms will not work correctly. Thanks!

from smile.

badgersow commented on May 3, 2024

Thank a lot you for fast and detailed reply.

I think I was not clear last time with my explanation, please let me rephrase it.
My task is to cluster the set of strings into groups.
In my case, distance 0 is possible only with equal or identical (equal by reference in Java terms) strings.

The problem occurs when I have two identical Java strings in dataset.
They have distance 0 and they are reference-equal. I expect them to be in one cluster, but because of identicalExcluded = false, objects are labeled as noise.

You can check that with dataset {"123", "123"} strings are labeled as noise because they are the same Java objects (thanks, Java string pool), but with dataset {new String("123"), new String("123")} objects are clustered together because they have different references and distance 0.

In my opinion this example exposes inconsistent behavior because the algorithm depends on reference equality in Java language.

Anyway, I am not a specialist at machine learning, I just want to be sure you understand me right because of my terrible English. Thanks again for reviewing this issue!

About Jaro-Winkler distance implementation
In my case I used implementation provided by Apache commons library, please see code below:

public static final Distance<String> JARO_WINKLER_DISTANCE = new Distance<String>() {
    @Override
    public double d(String from, String to) {
        return 1.0 - StringUtils.getJaroWinklerDistance(from, to);
    }
};

Do you think I should add dependency below in SMILE and add distance to package smile.math.distance?

<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-lang3</artifactId>
    <version>3.3.2</version>
</dependency>

Many thanks for your time.

from smile.

haifengl commented on May 3, 2024

Thanks for explanation. Value equality and reference equality are treated differently in almost all programming languages. I feel that our behavior is correct in this sense. In DBScan, we need to query every object's neighborhood. If we return the query object in the neighborhood, we will get into a self-reference loop. Although our code avoids getting into dead loop, it is still not a desired behavior to including the query object in the neighborhood. It is the purpose of setIdenticalExcluded.

So it is clear that Java pool string literal, which is the root cause. As you said, new String() is probably the right solution. With other data, I observe that the result is worse if I do setIdenticalExclude(false). Even though it may work for you, be careful with the clustering results.

Since you use apache commons for Jar-Winkler distance, it is not necessary to include it here. Thanks!

from smile.

badgersow commented on May 3, 2024

Thanks for clear and detailed explanation.
With your permission I'd like to add couple of lines to documentation of RNNSearch class, explaining default behavior.

If you are OK with this, I'll open pull request.

Also, I will follow your advice and rewrite my code to use new String() approach

from smile.

haifengl commented on May 3, 2024

Sure. Go ahead adding comments and make a pull request. Thanks!

from smile.

haifengl commented on May 3, 2024

Btw, not sure your use case. But usually BKTree and CoverTree are much faster than linear search. We do have a very efficient implementation of edit distance. If your data is large, maybe worth of a try.

from smile.

badgersow commented on May 3, 2024

Thanks for advice, but according to this article, Jaro-Winkler is not a metric because it does not satisfy triangle equality. It seems I can use BKTree and CoverTree only with metric.
Do we have better approach than LinearSearch when distance does not satisfy triangle inequality?
BTW, my dataset should not be more than 10K objects.

from smile.

haifengl commented on May 3, 2024

Since your data is not big, it is okay with linear search. I don't know your application, which may need that string distance. Just suggest to try edit distance if it makes sense in your case. It really depends on the use case. Speed is only a secondary consideration for small data.

from smile.

badgersow commented on May 3, 2024

Many thanks for help.
I am closing the issue.

from smile.

Handling of identical objects in DBScan about smile HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent