Git Product home page Git Product logo

knayyar0416 / unsupervised-model-clustering Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 32.18 MB

Built an unsupervised clustering model using K-Prototypes clustering and anomaly detection algorithms to discover patterns in the dataset containing 13000+ projects.

Home Page: https://www.kickstarter.com

Python 100.00%
anomaly-detection k-means-clustering k-prototypes unsupervised-learning data-preprocessing sklearn

unsupervised-model-clustering's Introduction

Exploring distinctive features for various projects using unsupverised clustering algorithm

In this project, I applied K-Prototypes clustering and anomaly detection to discover patterns in the dataset keeping in mind the potential benefits the stakeholders can get from the model.

πŸ’‘ Instead of using K-Means clustering which is best suited for contuinuous variables, I used K-Prototypes clustering to handle a mix of categorical and continuous variables present in my dataset.

🌐 About Kickstarter

Kickstarter is a platform where creators share their project visions with the communities that will come together to fund them.

πŸ† What did I discover?

By grouping the projects using clustering algorithm, the management could uncover distinct characteristics within each cluster. Below are the unique characteristics of 8 clusters.

βš–οΈ Moderate Goals, Quick Launchers

These projects showed a preference for swift project initiation, and the pledged amount tended to increase as the launch- to-deadline period extended.

πŸ“† High Goals, Recent Projects

Projects with high fundraising goals and recent deadlines. Interestingly, some projects in this cluster managed to attract a high number of backers, even with lofty fundraising goals.

πŸ‘₯ Backer-Friendly Projects

This cluster stood out for its projects' ability to attract a significant number of backers.

πŸ’‘ Modest Achievers

These projects struck a balance between goal and backers, with an average duration between project creation and launch.

πŸ‘ Well-Supported Initiatives

Projects in this cluster enjoyed high backers and pledged amounts. These projects maintained a relatively low goal and were picked by staff, potentially for the spotlight.

πŸš€ Rapid Projects

This cluster embodied a preference for rapid project development and execution.

🎯 Ambitious Newcomers

With a 24% success rate, Cluster 7 features projects with relatively higher fundraising goals compared to the amount pledged.

πŸ•°οΈ Long-Term High-Stakes

Cluster 8, with a 29% success rate, represents projects with the highest fundraising goals across clusters. Notably, these projects were created a long time back, suggesting a lower success rate over time. image

πŸ› οΈ How did I achieve this?

Below is the detailed process for model building.

  1. 🧹 Data preprocessing:
    • Removed non-informative columns including 𝑖𝑑, π‘›π‘Žπ‘šπ‘’, π‘›π‘Žπ‘šπ‘’_𝑙𝑒𝑛, π‘π‘™π‘’π‘Ÿπ‘_𝑙𝑒𝑛, and 𝑝𝑙𝑒𝑑𝑔𝑒𝑑.
    • Created a new feature, π‘”π‘œπ‘Žπ‘™_𝑒𝑠𝑑, by multiplying π‘”π‘œπ‘Žπ‘™ and π‘ π‘‘π‘Žπ‘‘π‘–π‘_𝑒𝑠𝑑_π‘Ÿπ‘Žπ‘‘π‘’.
    • Replaced non-US countries with 'Non-US' and filled missing values in π‘π‘Žπ‘‘π‘’π‘”π‘œπ‘Ÿπ‘¦ with 'No Category'.
    • Dropped irrelevant columns like original date columns and hour-specific columns.
  2. πŸ” Anomaly detection:
    • Before running any clustering algorithm, I ran an Isolation Forest Model for Anomaly Detection to identify and remove anomalies. The model deteted 1,344 anomalies with unusually high π‘”π‘œπ‘Žπ‘™_𝑒𝑠𝑑, π‘π‘Žπ‘π‘˜π‘’π‘Ÿπ‘ _π‘π‘œπ‘’π‘›π‘‘ and 𝑒𝑠𝑑_𝑝𝑙𝑒𝑑𝑔𝑒𝑑.
  3. πŸ€– Clustering model:
    • Applied K-Prototypes clustering to accommodate both the numerical and categorical features.
    • By testing cost function for different values of K for K-Prototypes Clustering, I could observe K=8 and K=10 are the elbow points at which the cost drops drastically. I chose K=8 since it was giving me better business interpretations.

πŸŽ‰ Conclusion

In summary, the K-Prototypes clustering algorithm provided valuable insights into diverse project profiles, offering a comprehensive understanding of Kickstarter projects. Overall, successful projects have high number of backers, pledged amounts, and are staff-picked, ultimately securing a place on Kickstarter spotlight page.

❗️ Challenges faced

When it comes to computation of performance, it is hard to measure β€˜distance’ between categorical variables. Instead, it is more suitable to assess the "dissimilarity" between categorical variables. To address this, I attempted to use 'kprototypes.matching_dissim' and 'kprototypes.check_distance' to compute dissimilarity for categorical variables and distance for continuous ones. My goal was to obtain the silhouette score, but unfortunately, I encountered errors in the process.

πŸ”— Supporting files

unsupervised-model-clustering's People

Contributors

knayyar0416 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.