HDBScan performance issue when choosing Best algorithm,about scikit-learn-contrib/hdbscan

Comments (6)

jc-healy commented on June 3, 2024

That is indeed strange behaviour. By 10% of input data I presume you mean clustering about ~16,000 points which are 15 dimensional. If so, 5hrs+ is remarkably slow and even two minutes is a bit slow. I can cluster 16,000 15 dimensional points with your parameters in about 4 seconds (truncatedSVD to 15 dimensions on top of MNIST). For scaling context I can handle 70,000 15 dimensional points in about 30 seconds. My best guess is that there is something strange going on with your data being loaded from your csv. Is it properly numeric data? Or do you have 15 string columns that are being loaded as categorical values and being transformed via a one hot encoder or some such thing? Have you loaded it into a numpy array? As an aside, I think your parameter of p=1.5 is being ignored. It is a parameter for Minkowski distance and should be ignored when your metric='euclidean'.

…

On Mon, Apr 22, 2024 at 5:30 AM divya-agrawal3103 ***@***.***> wrote: Hi, I am attempting to execute a stream that is using *HDBScan* clustering algorithm on a set of input data to generate a model. When I am selecting the Algorithm as *Best* and randomly passing 10% of the total input data (The input data is a csv file that has *15 columns*, and *~169379 rows*) , the stream executes and never finishes, I tracked it till *5hrs 9 mins* and then had to stop. This is the piece of code from the python script that is getting used to build the model, and it is runs forever and is taking all the time. hdb = hdbscan.hdbscan_.HDBSCAN(min_cluster_size=param['min_cluster_size'], min_samples=param['min_samples'], metric=param['metric'], alpha=param['alpha'], p=param['p'], algorithm=param['algorithm'], leaf_size=param['leaf_size'], approx_min_span_tree=param['approx_min_span_tree'], cluster_selection_method=param['cluster_selection_method'], allow_single_cluster=param['allow_single_cluster'], gen_min_span_tree=param['gen_min_span_tree'], prediction_data=True).fit(X) Below are the inputs we are feeding- *min_cluster_size* = 50 *min_samples* = 5 *metric* = euclidean *alpha* = 1.0 *p* = 1.5 *algorithm* = best *leaf_size* = 30 *approx_min_span_tree* = True *cluster_selection_method* = eom *allow_single_cluster* = False *gen_min_span_tree* = True Can you help us with this? 5hrs+ seems to be a lot of time. We need to optimise it. Note: This is happening if we choose the algorithm as *Best and 10% of input data*, with other algorithms it is finishing in reasonable time, also if we choose Best as the algorithm but only pass *8% of input data, it finishes within 2 minutes*. — Reply to this email directly, view it on GitHub <#630>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC3IUWVUNKX7RQIT7JKYYALY6TKCTAVCNFSM6AAAAABGSM4ND6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGI2TMMBUHAYTGOA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

from hdbscan.

divya-agrawal3103 commented on June 3, 2024

Hi @jc-healy Thanks for the swift response.
I am attaching the input data file here, and as far as I can see it comprises of categorical columns (Gender,Marital Status).
Could you please try using this input to test ?
Really appreciate your time!
sample.zip

from hdbscan.

divya-agrawal3103 commented on June 3, 2024

Hi @jc-healy We are stuck and really looking forward for any inputs from your side in order to resolve the problem.
Thank you in advance.

from hdbscan.

jc-healy commented on June 3, 2024

Hi there, I grabbed your data and filtered out the categorical columns (and your customer ID column) before hitting it with hdbscan and it took 3 to 5 minutes for me to cluster the 198,000 records.

Looking at your data I have two recommendations for clustering. First your numeric values are on vastly different scales. So Euclidean distance over this data will be dominated by your Income column which is on a vastly different scale that the "Members Within Household". To fix that I'd use something like a RobustScaler from sklearn.preprocessing to normalize your numeric columns. You can do fancier things but that's a pretty solid first thing to try.

I'd also one hot encode your two categorical fields to convert them to numeric. I'd do this in a pipeline using sklearns OneHotEncoder. Again you can get fancier but this is a good start.

As general good practice I'd suggest wrapping your preprocessing in a ColumnTransformer. That is a generally good practice for keeping track of your column transformations so they can be consistently applied to future data. Not necessary here but still a good habit.

Here is some sample code to get you started:

import hdbscan
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler, OneHotEncoder

data = pd.read_csv('sample.csv')
cluster_data = data.drop(['Customer ID', 'Campaign ID', 'Response'], axis=1)

categorical_features = ['Gender', 'Marital Status']
numeric_features = list(set(cluster_data.columns) - set(categorical_features))
preprocessor = ColumnTransformer(transformers=[
    ('cat', OneHotEncoder(), categorical_features),
    ('numeric', RobustScaler(), numeric_features)
], remainder='passthrough')
normalized = preprocessor.fit_transform(cluster_data)
normalized_df = pd.DataFrame(normalized, columns=preprocessor.get_feature_names_out())

model = hdbscan.HDBSCAN(min_cluster_size = 50, min_samples = 5).fit(normalized_df)

Cheers,
John

from hdbscan.

divya-agrawal3103 commented on June 3, 2024

Hi @jc-healy
Thanks a lot for the detailed analysis.
Will try to incorporate the suggestions.
Appreciate your time.

from hdbscan.

jc-healy commented on June 3, 2024

Closing this for now

from hdbscan.

HDBScan performance issue when choosing Best algorithm about hdbscan HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent