Git Product home page Git Product logo

Comments (6)

jc-healy avatar jc-healy commented on June 3, 2024

from hdbscan.

divya-agrawal3103 avatar divya-agrawal3103 commented on June 3, 2024

Hi @jc-healy Thanks for the swift response.
I am attaching the input data file here, and as far as I can see it comprises of categorical columns (Gender,Marital Status).
Could you please try using this input to test ?
Really appreciate your time!
sample.zip

from hdbscan.

divya-agrawal3103 avatar divya-agrawal3103 commented on June 3, 2024

Hi @jc-healy We are stuck and really looking forward for any inputs from your side in order to resolve the problem.
Thank you in advance.

from hdbscan.

jc-healy avatar jc-healy commented on June 3, 2024

Hi there, I grabbed your data and filtered out the categorical columns (and your customer ID column) before hitting it with hdbscan and it took 3 to 5 minutes for me to cluster the 198,000 records.

Looking at your data I have two recommendations for clustering. First your numeric values are on vastly different scales. So Euclidean distance over this data will be dominated by your Income column which is on a vastly different scale that the "Members Within Household". To fix that I'd use something like a RobustScaler from sklearn.preprocessing to normalize your numeric columns. You can do fancier things but that's a pretty solid first thing to try.

I'd also one hot encode your two categorical fields to convert them to numeric. I'd do this in a pipeline using sklearns OneHotEncoder. Again you can get fancier but this is a good start.

As general good practice I'd suggest wrapping your preprocessing in a ColumnTransformer. That is a generally good practice for keeping track of your column transformations so they can be consistently applied to future data. Not necessary here but still a good habit.

Here is some sample code to get you started:

import hdbscan
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler, OneHotEncoder

data = pd.read_csv('sample.csv')
cluster_data = data.drop(['Customer ID', 'Campaign ID', 'Response'], axis=1)

categorical_features = ['Gender', 'Marital Status']
numeric_features = list(set(cluster_data.columns) - set(categorical_features))
preprocessor = ColumnTransformer(transformers=[
    ('cat', OneHotEncoder(), categorical_features),
    ('numeric', RobustScaler(), numeric_features)
], remainder='passthrough')
normalized = preprocessor.fit_transform(cluster_data)
normalized_df = pd.DataFrame(normalized, columns=preprocessor.get_feature_names_out())

model = hdbscan.HDBSCAN(min_cluster_size = 50, min_samples = 5).fit(normalized_df)

Cheers,
John

from hdbscan.

divya-agrawal3103 avatar divya-agrawal3103 commented on June 3, 2024

Hi @jc-healy
Thanks a lot for the detailed analysis.
Will try to incorporate the suggestions.
Appreciate your time.

from hdbscan.

jc-healy avatar jc-healy commented on June 3, 2024

Closing this for now

from hdbscan.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.