Comments (6)
from hdbscan.
Hi @jc-healy Thanks for the swift response.
I am attaching the input data file here, and as far as I can see it comprises of categorical columns (Gender,Marital Status).
Could you please try using this input to test ?
Really appreciate your time!
sample.zip
from hdbscan.
Hi @jc-healy We are stuck and really looking forward for any inputs from your side in order to resolve the problem.
Thank you in advance.
from hdbscan.
Hi there, I grabbed your data and filtered out the categorical columns (and your customer ID column) before hitting it with hdbscan and it took 3 to 5 minutes for me to cluster the 198,000 records.
Looking at your data I have two recommendations for clustering. First your numeric values are on vastly different scales. So Euclidean distance over this data will be dominated by your Income column which is on a vastly different scale that the "Members Within Household". To fix that I'd use something like a RobustScaler from sklearn.preprocessing to normalize your numeric columns. You can do fancier things but that's a pretty solid first thing to try.
I'd also one hot encode your two categorical fields to convert them to numeric. I'd do this in a pipeline using sklearns OneHotEncoder. Again you can get fancier but this is a good start.
As general good practice I'd suggest wrapping your preprocessing in a ColumnTransformer. That is a generally good practice for keeping track of your column transformations so they can be consistently applied to future data. Not necessary here but still a good habit.
Here is some sample code to get you started:
import hdbscan
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler, OneHotEncoder
data = pd.read_csv('sample.csv')
cluster_data = data.drop(['Customer ID', 'Campaign ID', 'Response'], axis=1)
categorical_features = ['Gender', 'Marital Status']
numeric_features = list(set(cluster_data.columns) - set(categorical_features))
preprocessor = ColumnTransformer(transformers=[
('cat', OneHotEncoder(), categorical_features),
('numeric', RobustScaler(), numeric_features)
], remainder='passthrough')
normalized = preprocessor.fit_transform(cluster_data)
normalized_df = pd.DataFrame(normalized, columns=preprocessor.get_feature_names_out())
model = hdbscan.HDBSCAN(min_cluster_size = 50, min_samples = 5).fit(normalized_df)
Cheers,
John
from hdbscan.
Hi @jc-healy
Thanks a lot for the detailed analysis.
Will try to incorporate the suggestions.
Appreciate your time.
from hdbscan.
Closing this for now
from hdbscan.
Related Issues (20)
- HDBSCAN version 0.8.33 not able to install with python version 3.10.13 HOT 2
- Tests failed with: No module named 'hdbscan._hdbscan_linkage'
- Request for Adding `__version__` Attribute HOT 1
- Request for `verbose` setting
- max_cluster_size parameter does not work
- ip
- Question regarding sparse matrices
- Crash when points are equal HOT 1
- Way to obtain the lambda value HOT 1
- requirements prevent cython>=3 HOT 1
- How to set cluster_selection_epsilon when using cosine distances?
- Outlier scores - possible bug in GLOSH computation
- Can't install HDBSCAN via pip: [WinError 5] Access is denied HOT 1
- How to provide pre-calculated medoids to HDBSCAN
- Archive repo? HOT 1
- Can't install hdbscan in conda environment with gcc-14 HOT 2
- hdbscan and sparse precomputed distance matrix
- HDBSCAN v0.8.36 release in conda-forge HOT 3
- Can you support graph object data?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hdbscan.