Data Science Salary Estimator: Project 📚

Created a tool that estimates data science salaries (MAE ~ $ 11K) to help data scientists negotiate their income when they get a job.
Scraped over 1000 job descriptions from glassdoor using python and selenium
Engineered features from the text of each job description to quantify the value companies put on python, excel, aws, and spark.
Optimized Linear, Lasso, and Random Forest Regressors using GridsearchCV to reach the best model.
Built a client facing API using flask

Code and Resources Used

Python Version: 3.9

Packages: pandas, numpy, sklearn, matplotlib, seaborn, selenium, flask, json, pickle

For Web Framework Requirements: # pip install -r requirements.txt

Scraper Github: https://github.com/arapfaik/scraping-glassdoor-selenium

Scraper Article: https://towardsdatascience.com/selenium-tutorial-scraping-glassdoor-com-in-10-minutes-3d0915c6d905

Flask Productionization: https://towardsdatascience.com/productionize-a-machine-learning-model-with-flask-and-heroku-8201260503d2

YouTube Project Walk-Through by Ken Jee >>

YouTube >> https://www.youtube.com/playlist?list=PL2zq7klxX5ASFejJj80ob9ZAnBHdz5O1t

Ken Jee's GitHub Profile >>> https://github.com/PlayingNumbers

❤️ ❤️ ❤️ And big Thank you to Ken Jee > this is my First end to end project 😊 😊 ❤️ ❤️ ❤️

Web Scraping

Tweaked the web scraper github repo (above) to scrape 1000 job postings from glassdoor.com. With each job, we got the following:

Job title
Salary Estimate
Job Description
Rating
Company
Location
Company Headquarters
Company Size
Company Founded Date
Type of Ownership
Industry
Sector
Revenue
Competitors

Data Cleaning

After scraping the data, I needed to clean it up so that it was usable for our model. I made the following changes and created the following variables:

Parsed numeric data out of salary
Made columns for employer provided salary and hourly wages
Removed rows without salary
Parsed rating out of company text
Made a new column for company state
Added a column for if the job was at the company’s headquarters
Transformed founded date into age of company

Made columns for if different skills were listed in the job description:

Python
R
Excel
AWS
Spark
Column for simplified job title and Seniority
Column for description length

EDA

I looked at the distributions of the data and the value counts for the various categorical variables. Below are a few highlights from the pivot tables.

skills required 1 yes 0 for no Python Excel AWS

word cloud of job description

Model Building

First, I transformed the categorical variables into dummy variables. I also split the data into train and tests sets with a test size of 20%.

I tried three different models and evaluated them using Mean Absolute Error. I chose MAE because it is relatively easy to interpret and outliers aren’t particularly

bad in for this type of model.

I tried three different models:

Multiple Linear Regression – Baseline for the model
Lasso Regression – Because of the sparse data from the many categorical variables, I thought a normalized regression like lasso would be effective.
Random Forest – Again, with the sparsity associated with the data, I thought that this would be a good fit.

Model performance

The Random Forest model far outperformed the other approaches on the test and validation sets.

Random Forest : MAE = 11.120102768456377
Linear Regression: MAE = 3919437.2410207116 #
Ridge Regression : MAE = 11.120102768456377

Productionization

In this step, I built a flask API endpoint that was hosted on a local webserver by following along with the TDS tutorial in the reference section above. The API endpoint takes in a request with a list of values from a job listing and returns an estimated salary.

Extras Guide

mayur-ingole / ds_salary_proj Goto Github PK

ds_salary_proj's Introduction

Data Science Salary Estimator: Project 📚

Code and Resources Used

YouTube Project Walk-Through by Ken Jee >>

Web Scraping

Data Cleaning

Made columns for if different skills were listed in the job description:

EDA

skills required 1 yes 0 for no Python Excel AWS

word cloud of job description

Model Building

I tried three different models:

Model performance

Productionization

Extras Guide

how to format README

how to add images in README

how to add emoji

ds_salary_proj's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org