A simple flask web application powered by xgboost that helps in predicting the prices of the bike based on given inputs
- model accuracy of 92.5% has been acheived
- Generalised model with train accuracy of 96% and test of 92.5%
- Detailed jupyter notebook with each and every step explanined
- Usage of sklearn pipelines [ for training 11 different models synchronously ]
- Python
- Flask
- Pandas
- Numpy
- Seaborn
- Scikit-learn
- Html
- Css
- Pickle
Dataset is taken from kaggle and can be downloaded from here
This contains information about different bikes and their prices and has 7857 rows and 8 columns
Column name | Description |
---|---|
model_name | The name of the bike's model. It contains some additional information like model year,engine etc. |
model_year | The year in which the model was built. |
kms_driven | Total kilometers the bike has been driven. |
owner | The represents which type of owner the bike has like it is first owner which means the current owner had bought the this bike as new, second owner means the bike has been sold to this owner from first owner and so on. |
location | The location of the seller. |
mileage | Average mileage the bike gives. Its is represented as kilometer per liter of petrol (kmpl). |
power | Power is in terms of Bhp. BHP is the rate at which the torque generated by the engine in a bike is delivered to the wheels. Such that faster the deliverability, higher is the speed of the motorcycle and vice versa. For a bike that consists of a lower BHP can pull higher loads and for a bike that contains a greater BHP can propel the bike at faster speeds. |
Clone the project
git clone https://github.com/RishiBakshii/Bike-Price-Predictor.git
Go to the project directory
cd path/to/the/cloned/repository
Install dependencies
pip install -r requirements.txt
Start the server
py app.py
- Data Cleaning and Pre-Processing
- Exploratory Data analysis
- Feature Engineering
- Modelling
- Deployment
-
Columns like model_name, mileage ,kms_driven and power were like this in the initial stage
-
A lot of data Cleaning has been perfomed to clean them and make it look like this, all the cleaning functions were written from scratch
-
all the values in different units like HP and Kw has been converted to bhp in the power column
-
Created column transformers for encoding( onehotencoding and Ordinal Encoding ) and scaling ( MinMaxScaler ) of data
encoder_transformer=ColumnTransformer([
('onehotencoding',OneHotEncoder(sparse=False,handle_unknown='ignore',drop='first'),[0,4]),
('ordinalencoder',OrdinalEncoder(categories=[['fourth owner or more','third owner','second owner','first owner']],handle_unknown='error'),[3]),
],remainder='passthrough')
scaler_transformer=ColumnTransformer([
('StandardScaler',MinMaxScaler(),[1,2,5,6]),
],remainder='passthrough')
- Created piplines for iterative training of 11 different models
pipeline_lr=Pipeline([('encoder_transfomer',encoder_transformer),
('scaler_transformer',scaler_transformer),
('linear',LinearRegression())
])
pipeline_las=Pipeline([('encoder_transformer',encoder_transformer),
('scaler_transformer',scaler_transformer),
('lasso',Lasso())
])
pipeline_ridge=Pipeline([('encoder_transformer',encoder_transformer),
('scaler_transformer',scaler_transformer),
('ridge',Ridge())
])
pipeline_knn=Pipeline([('encoder_transformer',encoder_transformer),
('scaler_transformer',scaler_transformer),
('knn',KNeighborsRegressor())
])
pipeline_dt=Pipeline([('encoder_transformer',encoder_transformer),
('scaler_transformer',scaler_transformer),
('dt',DecisionTreeRegressor())
])
pipeline_svm=Pipeline([('encoder_transformer',encoder_transformer),
('scaler_transformer',scaler_transformer),
('svm',SVR())
])
pipeline_rf=Pipeline([('encoder_transformer',encoder_transformer),
('scaler_transformer',scaler_transformer),
('rf',RandomForestRegressor())
])
pipeline_gbr=Pipeline([('encoder_transformer',encoder_transformer),
('scaler_transformer',scaler_transformer),
('gbr',GradientBoostingRegressor())
])
pipeline_abr=Pipeline([('encoder_transformer',encoder_transformer),
('scaler_transformer',scaler_transformer),
('abr',AdaBoostRegressor())
])
pipeline_etr=Pipeline([('encoder_transformer',encoder_transformer),
('scaler_transformer',scaler_transformer),
('etr',ExtraTreesRegressor())
])
pipeline_xgb=Pipeline([('encoder_transformer0',encoder_transformer),
('scaler_transformer',scaler_transformer),
('xgb',XGBRegressor())
])
-
XgBoost was the most generalized model with the highest accuracy and lowest differnece between bias and variance
-
- Residuals are densely populated between the range of -100 and 100
- some outliers are present in the lower magnitude
- There can be seen a very strong linear relationship between the actual and predicted values
- This project is currently deployed at Render
- and can be visited here Bike Price Predictor