SF-DAT-20

##Lecture 1 Summary (Introduction on Data Science part I)

We talked about different roles of Data Scientists
T-Shaped Data Scientists
Data Science Workflow
Continuous, Discrete and Qualitative Data
Supervised vs Unsupervised Learning
Set up github accounts
set ipython notebook
Introduced Numpy

Lecture 2 Summary (Introduction on Data Science part II)

Classification vs Clustering and Regression vs Dimentionality Reduction
Flexibility vs Interpretability
Different types of data (Cross-Sectional, Time-Series, Panel Data)
Walkthrough Acquire& Parses with Pandas
HW 1 assigned - Due date Feb 8th at 6:30PM

Lecture 3 Summary (Basic Statistics - Review Session)

Measures of central tendency (Mean, Median, Mode, Quartiles, Percentiles)
Measures of Variability (IQR, Standard Deviation, Variance)
Skewness Coefficient
Kurtosis Coefficient
Boxplots
Bias vs Variance
Central Limit Theorem – Standard Error of Mean
Class/Dummy Variables
Walkthrough describing and visualizing data in Pandas

Lecture 4 Summary (Linear Regression Lines - Part I)

Linear Regression lines
Single Variable and Multi-Variable Regression Lines
Capture non-linearity using Linear Regression lines.
Interpretting regression coefficients
Dealing with dummy variables in regression lines
intro on sklearn and searborn library
HW 2 assigned - Due date Feb 17th 2016 at 6:30PM

Lecture 5 Summary (Linear Regression Lines - Part II)

Hypothesis test - test of significance on regression coefficients
p-value
Capture non-linearity using Linear Regression lines.
Different types of errors and R-squared
Interaction Effects

Lecture 6 Summary (Model Selection)

Bias-Variance Trade off
Validation (Test vs Train set)
Cross-Validation
Ridge and Lasso Regression
(Optional) Backward Selection, Forward Selection, All Subset Selection. (If you want to use these methods you need to use R)

Lecture 7 Summary (Missing Data and Imputation)

Types of missing data (MCAR, MAR, NMAR)
Single imputation and their limitations
Imuptation using regression lines and error
Hot deck imputation
multiple imputation

Lecture 8 Summary (K-Nearest Neighbors)

Classification Problems
Misclassifciation Error
KNN algorithm for Classification
Cross-Validation for KNN Algorithm
Limitations of KNN Algorithm
KNN algorithm for Regression

Lecture 9 Summary (Logistic Regression Part I)

Intro to Logistic Regression
Odds vs Probability
Using Logistic Regression to Make predictions
How one interprets coefficients of Logistic Regression model
Strength and weaknesses of Logistic Regression Model

Lecture 10 Summary (Logistic Regression Part II)

Unbalanced observations and Logistic Regression
FP/FN/TP/TN/FPR/TPR
The effect of chaning Threshold
ROC Curves
Area Under Curve
How to compare classifciation algorithms

Lecture 11 Summary (Decision Trees Part I)

Decision Tree for Regression
Greedy Approach
Decision Tree for Classification
Gini Index and Entropy index
Limitation of Simple Decision Tree

Lecture 12 Summary (Decision Trees Part II)

Bagging
Random Forest
Boosting
Tuning parameters for boosting and Random Forest

Additional Resources

Lecture 13 Summary (Natural Language Processing)

Definition of Natural Language Processing
NLP applications
Basic NLP practice
Stop words, bag-of-words, IF-DIF

Additional Resources

If you want to learn a lot more NLP, check out the excellent video lectures and slides from this Coursera course (which is no longer being offered).
Natural Language Processing with Python is the most popular book for going in-depth with the Natural Language Toolkit (NLTK).
A Smattering of NLP in Python provides a nice overview of NLTK, as does this notebook from DAT5.
spaCy is a newer Python library for text processing that is focused on performance (unlike NLTK).
If you want to get serious about NLP, Stanford CoreNLP is a suite of tools (written in Java) that is highly regarded.
When working with a large text corpus in scikit-learn, HashingVectorizer is a useful alternative to CountVectorizer.
Automatically Categorizing Yelp Businesses discusses how Yelp uses NLP and scikit-learn to solve the problem of uncategorized businesses.
Modern Methods for Sentiment Analysis shows how "word vectors" can be used for more accurate sentiment analysis.
Identifying Humorous Cartoon Captions is a readable paper about identifying funny captions submitted to the New Yorker Caption Contest.

Lecture 14 Summary (Principal Component Analysis)

Principal Component Analysis
Computation of PCAs
Geometry of PCAs
Proportion of Variance Explained

Additional Resources

This tutorial on Principal Components Analysis (PCA) includes good refreshers on covariance and linear algebra
To go deeper on Singular Value Decomposition, read Kirk Baker's excellent tutorial.
Chapter 10 of Statistical Learning with applications in R

Lecture 15 Summary (Time Series Models)

AutoRegressive Models
Moving Averages
ARMA
ARIMA

Additional Resources

This is a good resource for AR models
Seemingly easy to read book on time series.

Lecture 16 Summary (Databases and SQL)

Talked about databases and data warehouse design.
Introduction to SQL and learn the Fundamental Growth Query.
Look at product engagement data of a fictional company and use FGQ to compute retention curves.
Apply convolution to the retention curve to project future active users.
Build a model to predict the retention likelihood of individual customers.
Thanks to Michael

Additional Resources

Well organized and easy to undestand tutorials - Thanks to Catherine
More tutorials on SQL - Thanks to Randy

Lecture 17 Summary (Naive Bayes)

Pre-work: Please review Bayes Questions.

dqofficial / sf-dat-20 Goto Github PK

sf-dat-20's Introduction

SF-DAT-20

Lecture 2 Summary (Introduction on Data Science part II)

Lecture 3 Summary (Basic Statistics - Review Session)

Lecture 4 Summary (Linear Regression Lines - Part I)

Lecture 5 Summary (Linear Regression Lines - Part II)

Lecture 6 Summary (Model Selection)

Lecture 7 Summary (Missing Data and Imputation)

Lecture 8 Summary (K-Nearest Neighbors)

Lecture 9 Summary (Logistic Regression Part I)

Lecture 10 Summary (Logistic Regression Part II)

Lecture 11 Summary (Decision Trees Part I)

Lecture 12 Summary (Decision Trees Part II)

Lecture 13 Summary (Natural Language Processing)

Lecture 14 Summary (Principal Component Analysis)

Lecture 15 Summary (Time Series Models)

Lecture 16 Summary (Databases and SQL)

Lecture 17 Summary (Naive Bayes)

sf-dat-20's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org