Comments (4)
The convergence guarantees are asymptotic; it will converge if you have infinite data. The bounds are merely bounds such that the method of proof they use will go through, not hard bounds required for convergence. They comment themselves that they believe better bounds may exist and still provide convergence, they just can't construct the proof for such at the time of writing the paper.
Given all that I wouldn't worry too much about the bounds for practical purposes.
I do also expect any convergence true of single linkage ought to go through for HDBSCAN since, for suitable choices of parameters HDBSCAN will replicate the hierarchy of single linkage. Of course I don't have a proof of comparable convergence, but I can handwave in the appropriate direction I believe.
from hdbscan.
You're right that it might be useful to have a notebook on parameter selection. I'll see if I can get something useful put together.
In summary though, the short answer is that HDBSCAN is ideally fairly robust to parameter selection. Some rules of thumb:
- Don't touch alpha unless you know what you're doing; the default (1.0) should be fine.
- Think of min_samples as being "how noisy is my data". Larger min_samples anticipates noisier data (to some degree).
- Things should be largely robust to min_samples, and something in the 5-15 range should be good for a wide variety of data, and the exact choice shouldn't matter too much.
- We set min_samples to min_cluster_size by default (if min_samples isn't specified), and this is usually the right choice.
I am also working on eliminating min_samples as a parameter altogether (along with alpha as well) leaving only min_cluster_size, which should be relatively intuitive to pick. This involves some theoretical considerations, and while I have the skeleton of the relevant theory, I haven't yet got the details worked out to make that an efficient implementation.
from hdbscan.
As an added note, from Chaudhuri and Dasgupta one can find bounds of
alpha = sqrt(2)
min_samples ~ d log(n)
where d is the dimension of the data and n is the number of data points. These aren't actually very practical values however: they are relevant values for being able to prove convergence to the level set tree (a valuable theoretical result!), but in practice they are larger than you really want to use. It might help get you into an appropriate ballpark however.
from hdbscan.
Thanks for the reply!
Basically, you're arguing that convergence is less important than keeping those parameters small. Is that an appropriate summary?
Also, they mention in the paper that regular old single-linkage clustering always converges in one dimension. Do you think this would be true of the hierarchical clustering underlying HDBSCAN as well?
from hdbscan.
Related Issues (20)
- Validation questions HOT 1
- pypi version throws ValueError HOT 27
- TypeError encountered HOT 2
- Getting Error while using HDBSCAN HOT 1
- Clustering struggles with mix of noise levels HOT 1
- HDBSCAN version 0.8.33 not able to install with python version 3.10.13 HOT 2
- Tests failed with: No module named 'hdbscan._hdbscan_linkage'
- Request for Adding `__version__` Attribute HOT 1
- Request for `verbose` setting
- max_cluster_size parameter does not work
- ip
- Question regarding sparse matrices
- Crash when points are equal HOT 1
- Way to obtain the lambda value HOT 1
- requirements prevent cython>=3 HOT 1
- How to set cluster_selection_epsilon when using cosine distances?
- Outlier scores - possible bug in GLOSH computation
- Can't install HDBSCAN via pip: [WinError 5] Access is denied HOT 1
- HDBScan performance issue when choosing Best algorithm HOT 5
- How to provide pre-calculated medoids to HDBSCAN
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hdbscan.