Git Product home page Git Product logo

mariaorabi / machine-learning-products-prediction Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 56.97 MB

This dataset from "ShufersalML" captures customer order history, aiming to predict future purchases using Python. It involves interconnected files that detail customer orders over time. The goal is to build a predictive model leveraging past order patterns to anticipate which products a user is likely to include in their next order.

HTML 51.30% Jupyter Notebook 48.70%
adaboost gmm gradient-boosting juypter-notebook kmeans kmeans-clustering knn machine-learning pca prediction

machine-learning-products-prediction's Introduction

Machine-learning-products-prediction

The Data

This dataset is a relational set of files that describes customers' orders over time from a grocery delivery company named "ShufersalML". The goal is to predict which products will be in a user's next order, based on its past orders.

The dataset contains a sample of over 3 million grocery orders from more than 200,000 users. For each user, between 4 and 100 of their orders are provided, along with the sequence of products purchased in each order, the week and hour of day the order was placed, and a relative measure of time between orders.

The dataset is anonymized and contains several files that are associated with each entity, such as customers, products, orders, aisles, and departments:

  • aisles.csv
  • departments.csv
  • products.csv - provides information on each product, such as its name, aisle ID, and department ID.
  • order_products__*.csv files - specify which products were purchased in each order.
    • order_products__prior.csv - contains past orders of customers. This file should be used for feature engineering, which involves creating new features from the raw data that can be used to train the model.
    • order_products__train_test.csv โ€“ contains orders and products that should be used for train and test the model.
  • target.csv โ€“ contains the labels for each order and product combination of the train and test samples.
  • orders.csv - indicates to which set (prior, train, test) an order belongs, and extra details about the orders.

Section A (Data Exploration and Visualization)

Exploring the data using tables, visualizations, and other relevant methods.

  • For each plot or table, there is a short description of key observations. It only includes content which would be meaningful for a "ShufersalML" manager.

Section B (Data Pre-processing)

Preparing the data for the models in the next sections.

  • Performing feature engineering on the data using prior samples.

Section C

Using different machine learning models to predict the future order of each customer, according to the target.csv. And predict the value of column "Was_In_Order".

Section D (Clustering)

  • Clustering algorithms on the prior data to cluster the different customers.
  • Identifying the most important features that contribute to the differences between the clusters.
  • Clustering the different products. And see the differences between the clusters.

Section E (Clustering and Dimensions Reduction)

  • Reduce dimensions of the customers' data (from Section D) using PCA algorithm.
  • Showing which principal components explain the majority of the variance in data, using a plot. And see what are the features that are most strongly represented in each component.
  • Using the top principal components to perform clustering on the customers, using the same clustering algorithms as before.

Section F (Chi-Square test)

Performing a Chi-Square test on the train dataset to determine the relationship between the "Reordered" feature and the binary label "Was_In_Order". The Chi-Square test is a statistical method used to test the independence of two events and is commonly used in feature selection for classification tasks. This test helps us determine whether the "Reordered" feature should be selected as a predictor for model training. I define a null hypothesis (that the two variables are independent) and accept or reject the null hypothesis based on the Chi-Square value, with 95% confidence that alpha (level of significance) equals to 0.05.

machine-learning-products-prediction's People

Contributors

mariaorabi avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.