In this project, I applied K-Prototypes clustering and anomaly detection to discover patterns in the dataset keeping in mind the potential benefits the stakeholders can get from the model.
π‘ Instead of using K-Means clustering which is best suited for contuinuous variables, I used K-Prototypes clustering to handle a mix of categorical and continuous variables present in my dataset.
Kickstarter is a platform where creators share their project visions with the communities that will come together to fund them.
By grouping the projects using clustering algorithm, the management could uncover distinct characteristics within each cluster. Below are the unique characteristics of 8 clusters.
These projects showed a preference for swift project initiation, and the pledged amount tended to increase as the launch- to-deadline period extended.
Projects with high fundraising goals and recent deadlines. Interestingly, some projects in this cluster managed to attract a high number of backers, even with lofty fundraising goals.
This cluster stood out for its projects' ability to attract a significant number of backers.
These projects struck a balance between goal and backers, with an average duration between project creation and launch.
Projects in this cluster enjoyed high backers and pledged amounts. These projects maintained a relatively low goal and were picked by staff, potentially for the spotlight.
This cluster embodied a preference for rapid project development and execution.
With a 24% success rate, Cluster 7 features projects with relatively higher fundraising goals compared to the amount pledged.
Cluster 8, with a 29% success rate, represents projects with the highest fundraising goals across clusters. Notably, these projects were created a long time back, suggesting a lower success rate over time.
Below is the detailed process for model building.
- π§Ή Data preprocessing:
- Removed non-informative columns including ππ, ππππ, ππππ_πππ, πππ’ππ_πππ, and πππππππ.
- Created a new feature, ππππ_π’π π, by multiplying ππππ and π π‘ππ‘ππ_π’π π_πππ‘π.
- Replaced non-US countries with 'Non-US' and filled missing values in πππ‘πππππ¦ with 'No Category'.
- Dropped irrelevant columns like original date columns and hour-specific columns.
- π Anomaly detection:
- Before running any clustering algorithm, I ran an Isolation Forest Model for Anomaly Detection to identify and remove anomalies. The model deteted 1,344 anomalies with unusually high ππππ_π’π π, πππππππ _πππ’ππ‘ and π’π π_πππππππ.
- π€ Clustering model:
- Applied K-Prototypes clustering to accommodate both the numerical and categorical features.
- By testing cost function for different values of K for K-Prototypes Clustering, I could observe K=8 and K=10 are the elbow points at which the cost drops drastically. I chose K=8 since it was giving me better business interpretations.
In summary, the K-Prototypes clustering algorithm provided valuable insights into diverse project profiles, offering a comprehensive understanding of Kickstarter projects. Overall, successful projects have high number of backers, pledged amounts, and are staff-picked, ultimately securing a place on Kickstarter spotlight page.
When it comes to computation of performance, it is hard to measure βdistanceβ between categorical variables. Instead, it is more suitable to assess the "dissimilarity" between categorical variables. To address this, I attempted to use 'kprototypes.matching_dissim' and 'kprototypes.check_distance' to compute dissimilarity for categorical variables and distance for continuous ones. My goal was to obtain the silhouette score, but unfortunately, I encountered errors in the process.
- π©βπ» Python script for clustering models
- π Entire dataset and Data Dictionary
- π Data exploration and other charts