I am a Data Scientist, Data Analyst, Systems Engineer skilled in Python, SQL, Machine Learning, Optimization, and modeling. I harness data visualization techniques to present results to stakeholders in order to tell the story the data is longing to tell. My past work experience as a Java Developer, Linux System Admin and IT Operations Manager has enabled a wider viewpoint. I possess a logical approach and great problem-solving skills, perform well in high-pressure situations, and thrive in a team-oriented system by enabling my teammates. I am interested in both Data Science and Data Analytics and solving problems.
Technical Skills: • Languages: Python, SQL • Predictive Modeling: Linear/Logistic Regression, Classification, Clustering, Decision Tree, Random Forest, Support Vector Machines, K-Nearest Neighbors, • Machine Learning: Deep Learning, Neural Networks, Keras, TensorFlow, Time Series • Databases: MySQL, Oracle, SQLite, MongoDB • Data Visualization: Matplotlib, Seaborn, Tableau • Environments: Google Colab, Jupyter Notebook • Data Science Methods: Gathering, Cleaning, Scrubbing, Exploration, Mining, Modeling, Visualization
This project encapsulates using Classification with Machine Learning for modeling 2018 Domestic Airline Flight Delays.
We performed Inferential Analysis of 7M+ recs looking at Airlines, Destinations of Flights, Delays, Times, We then reduced the Data set to just the Top 5 Airlines by number of flights. We reduced the number of Origins and Destinations to the top 30 instead of the 358.
We also performed Classification Analysis with Machine Learning Algorithms Logistic Regression, Decision Trees, Random Forests, XGBoost
With GridSearch narrowing down the most optimal Hyperparameters to predict delayed flights and assess the strength and relationship and importance of the different features and their relation to delayed flight.
This project implements a Recommendation System for Movies.
We performed KFold Cross Validation on the movie ratings with Matrix reduction algorithms and optimized with GridSearch. We were looking for minimal errors choosing RMSE as our main metric and also time it takes to fit the moved as the matric will need to run to re fit after a user updates their ratings.
We used a Collaborative Filtering Model Based approach for this first implementation
This project encapsulates using multiple regression for modeling home sales data. We performed Inferential Analysis on over 21,000 home sales from Kings County and by removing any data with a outliers which had a z score larger than 3. In a normal distribution 99% or all data falls with a z score of under 3. We also performed a multiple regression analysis which allows us to build a pricing model and assess the strength and relationship and importance of the different features and their relation to an estimate price of a property.
A new feature we created was distance from four major employment locations in Kings County. Using the haversine formlula mentioned in the following blogs as reference: We also created the district feature to divid the county into 10 separate districts based on zipcodes.
This project analyzes movie data in order to create a portfolio strategy for entrance into the Entertainment industry.
Regression Analysis for Domestic Box Office with the Bass Diffusion Model and Monte Carlo Simulation
Data The Numbers Yearly Box Office revenue 11 years Weekly Box Office Revenue 11 years Distributor, Genre, Source, Creative Type, Inflation Adjusted Domestic Bo IMDB Daily Dumps 8 mil records movies,principals, Actors actresses, Directors
Methods Created Actor Influence-formula Created Director Influence-formula Classified each move in a Franchise or Not Each movie fit to Bass Model for 3 coefficients M (market size, initially set to 1,000,000), p (coefficient of innovation, initially set to 0.003) and q (coefficient of imitation, initially set to 0.5).