Git Product home page Git Product logo

Comments (11)

haifengl avatar haifengl commented on May 3, 2024

Thanks! As you may figure it out yourself, DBSCan use RNNSearch inside. When a Distance or Metric is supplied, we wrap it with LinearSearch or CoverTree. setIdenticalExcluded(true) is important for some uses including DBScan. But for your use case, it is different due to the meaning of JAR_WINKLER_DISTANCE. I will check if I should update the semantical meaning of setIdenticalExcluded, for example it means exactly same object (not just distance 0). I have to double check before making this change. Thanks!

from smile.

haifengl avatar haifengl commented on May 3, 2024

BTW, can you please contribute your JARO_WINKLER_DISTANCE implementation to SMILE? Thank you very much in advance! If you agree, please put it in package smile.math.distance. Thanks again!

from smile.

haifengl avatar haifengl commented on May 3, 2024

Nearest neighbor search is used in many density estimation and clustering algorithms. In most situations, distance 0 means that two objects are the same in the feature space. Without excluding identical data points, we may get into troubles in some cases. It is why we do setIdenticalExcluded(true) by default. I think that it is reasonable and also more efficient. But in your use case, many objects have 0 distance even if they are different. Your approach with customized LinearSearch is correct and I am glad it works. The problem comes from the distance definition not DBSCan. I won't change the default behavior otherwise many other algorithms will not work correctly. Thanks!

from smile.

badgersow avatar badgersow commented on May 3, 2024

Thank a lot you for fast and detailed reply.

I think I was not clear last time with my explanation, please let me rephrase it.
My task is to cluster the set of strings into groups.
In my case, distance 0 is possible only with equal or identical (equal by reference in Java terms) strings.

The problem occurs when I have two identical Java strings in dataset.
They have distance 0 and they are reference-equal. I expect them to be in one cluster, but because of identicalExcluded = false, objects are labeled as noise.

You can check that with dataset {"123", "123"} strings are labeled as noise because they are the same Java objects (thanks, Java string pool), but with dataset {new String("123"), new String("123")} objects are clustered together because they have different references and distance 0.

In my opinion this example exposes inconsistent behavior because the algorithm depends on reference equality in Java language.

Anyway, I am not a specialist at machine learning, I just want to be sure you understand me right because of my terrible English. Thanks again for reviewing this issue!

About Jaro-Winkler distance implementation
In my case I used implementation provided by Apache commons library, please see code below:

public static final Distance<String> JARO_WINKLER_DISTANCE = new Distance<String>() {
    @Override
    public double d(String from, String to) {
        return 1.0 - StringUtils.getJaroWinklerDistance(from, to);
    }
};

Do you think I should add dependency below in SMILE and add distance to package smile.math.distance?

<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-lang3</artifactId>
    <version>3.3.2</version>
</dependency>

Many thanks for your time.

from smile.

haifengl avatar haifengl commented on May 3, 2024

Thanks for explanation. Value equality and reference equality are treated differently in almost all programming languages. I feel that our behavior is correct in this sense. In DBScan, we need to query every object's neighborhood. If we return the query object in the neighborhood, we will get into a self-reference loop. Although our code avoids getting into dead loop, it is still not a desired behavior to including the query object in the neighborhood. It is the purpose of setIdenticalExcluded.

So it is clear that Java pool string literal, which is the root cause. As you said, new String() is probably the right solution. With other data, I observe that the result is worse if I do setIdenticalExclude(false). Even though it may work for you, be careful with the clustering results.

Since you use apache commons for Jar-Winkler distance, it is not necessary to include it here. Thanks!

from smile.

badgersow avatar badgersow commented on May 3, 2024

Thanks for clear and detailed explanation.
With your permission I'd like to add couple of lines to documentation of RNNSearch class, explaining default behavior.

If you are OK with this, I'll open pull request.

Also, I will follow your advice and rewrite my code to use new String() approach

from smile.

haifengl avatar haifengl commented on May 3, 2024

Sure. Go ahead adding comments and make a pull request. Thanks!

from smile.

haifengl avatar haifengl commented on May 3, 2024

Btw, not sure your use case. But usually BKTree and CoverTree are much faster than linear search. We do have a very efficient implementation of edit distance. If your data is large, maybe worth of a try.

from smile.

badgersow avatar badgersow commented on May 3, 2024

Thanks for advice, but according to this article, Jaro-Winkler is not a metric because it does not satisfy triangle equality. It seems I can use BKTree and CoverTree only with metric.
Do we have better approach than LinearSearch when distance does not satisfy triangle inequality?
BTW, my dataset should not be more than 10K objects.

from smile.

haifengl avatar haifengl commented on May 3, 2024

Since your data is not big, it is okay with linear search. I don't know your application, which may need that string distance. Just suggest to try edit distance if it makes sense in your case. It really depends on the use case. Speed is only a secondary consideration for small data.

from smile.

badgersow avatar badgersow commented on May 3, 2024

Many thanks for help.
I am closing the issue.

from smile.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.