-
Created a tool that estimates data science salaries (MAE ~ $ 11K) to help data scientists negotiate their income when they get a job.
-
Scraped over 1000 job descriptions from glassdoor using python and selenium
-
Engineered features from the text of each job description to quantify the value companies put on python, excel, aws, and spark.
-
Optimized Linear, Lasso, and Random Forest Regressors using GridsearchCV to reach the best model.
-
Built a client facing API using flask
Python Version: 3.9
Packages: pandas, numpy, sklearn, matplotlib, seaborn, selenium, flask, json, pickle
For Web Framework Requirements: # pip install -r requirements.txt
Scraper Github: https://github.com/arapfaik/scraping-glassdoor-selenium
Scraper Article: https://towardsdatascience.com/selenium-tutorial-scraping-glassdoor-com-in-10-minutes-3d0915c6d905
Flask Productionization: https://towardsdatascience.com/productionize-a-machine-learning-model-with-flask-and-heroku-8201260503d2
YouTube >> https://www.youtube.com/playlist?list=PL2zq7klxX5ASFejJj80ob9ZAnBHdz5O1t
Ken Jee's GitHub Profile >>> https://github.com/PlayingNumbers
❤️ ❤️ ❤️ And big Thank you to Ken Jee > this is my First end to end project 😊 😊 ❤️ ❤️ ❤️
Tweaked the web scraper github repo (above) to scrape 1000 job postings from glassdoor.com. With each job, we got the following:
-
Job title
-
Salary Estimate
-
Job Description
-
Rating
-
Company
-
Location
-
Company Headquarters
-
Company Size
-
Company Founded Date
-
Type of Ownership
-
Industry
-
Sector
-
Revenue
-
Competitors
After scraping the data, I needed to clean it up so that it was usable for our model. I made the following changes and created the following variables:
-
Parsed numeric data out of salary
-
Made columns for employer provided salary and hourly wages
-
Removed rows without salary
-
Parsed rating out of company text
-
Made a new column for company state
-
Added a column for if the job was at the company’s headquarters
-
Transformed founded date into age of company
-
Python
-
R
-
Excel
-
AWS
-
Spark
-
Column for simplified job title and Seniority
-
Column for description length
I looked at the distributions of the data and the value counts for the various categorical variables. Below are a few highlights from the pivot tables.
First, I transformed the categorical variables into dummy variables. I also split the data into train and tests sets with a test size of 20%.
I tried three different models and evaluated them using Mean Absolute Error. I chose MAE because it is relatively easy to interpret and outliers aren’t particularly
bad in for this type of model.
-
Multiple Linear Regression – Baseline for the model
-
Lasso Regression – Because of the sparse data from the many categorical variables, I thought a normalized regression like lasso would be effective.
-
Random Forest – Again, with the sparsity associated with the data, I thought that this would be a good fit.
The Random Forest model far outperformed the other approaches on the test and validation sets.
-
Random Forest : MAE = 11.120102768456377
-
Linear Regression: MAE = 3919437.2410207116 #
-
Ridge Regression : MAE = 11.120102768456377
In this step, I built a flask API endpoint that was hosted on a local webserver by following along with the TDS tutorial in the reference section above. The API endpoint takes in a request with a list of values from a job listing and returns an estimated salary.
https://github.com/tchapi/markdown-cheatsheet
guide >> https://stackoverflow.com/questions/14494747/how-to-add-images-to-readme-md-on-github
https://github.com/ikatyang/emoji-cheat-sheet/blob/master/README.md