Light

ranjani94 / cracking_the_machine_learning_interview Goto Github PK

View Code? Open in Web Editor NEW

This project forked from soundnandu/cracking_the_machine_learning_interview

0.0 0.0 0.0 10.41 MB

(Under Construction) I am currently writing a solution from the Medium article "Cracking the Machine Learning Interview," written by Subhrajit Roy. In the past year since the article went public, Subhrajit has only written down the questions with no update on the solutions. I plan on finishing the war. I may add more questions outside of the articles domain. No one else on the internet has written down a solution for machine learning interview, an opportunity I want to take advantage of.

cracking_the_machine_learning_interview's Introduction

Solutions on Cracking The Machine Learning Interview

-------> Currently under construction! <-------

I am currently writing a solution from the Medium article "Cracking the Machine Learning Interview," written by Subhrajit Roy. In the past year since the article went public, Subhrajit has only written down the questions with no update on the solutions. I plan on finishing the war. I may add more questions outside of the articles domain. Some of my solutions may contain code from the following following frameworks: Scikit-Learn, PyTorch, TensorFlow

https://medium.com/subhrajit-roy/cracking-the-machine-learning-interview-1d8c5bb752d8

Contents:


1. Linear Algebra	2. Numerical Optimization
3. Basics of Probability and Information Theory	4. Confidence Interval
5. Learning Theory	6. Model and Feature Selection
7. Curse of dimensionality	8. Universal approximation of neural networks
9. Deep Learning motivation	10. Support Vector Machine
11. Bayesian Machine Learning	12. Regularization
13. Evaluation of Machine Learning systems	14. Clustering
15. Dimensionality Reduction	16. Basics of Natural Language Processing
17. Some basic questions	18. Optimization Procedures
19. Sequence Modeling	20. Autoencoders
21. Representation Learning	22. Monte Carlo Methods
23. Generative Models	24. Reinforcement Learning
25. Probabilistic Graphical Models	26. Computational Logic

Linear Algebra

(Return back to Contents)

What is broadcasting in connection to Linear Algebra?
What are scalars, vectors, matrices, and tensors?
What is Hadamard product of two matrices?
What is an inverse matrix?
If inverse of a matrix exists, how to calculate it?
What is the determinant of a square matrix? How is it calculated? What is the connection of determinant to eigenvalues?
Why does the negative area of the determinent relate to orientation flipping? Check out lecture 6 from 3BLUE1BROWN?
Justify in one sentence why the following equation on why it is true: "If you multiply two matrices together, the determinent of the reulting matrix is the same as the product of the determinence of the original two matrices" det(M_{1}M_{2}) = det(M_{1})det(M_{2}). If you try to justify with numbers it would take a long time.
Discuss span and linear dependence.
Following up on question #7, what does the following definition mean, "The basis of a vector space is a set of linearly independent vectors that span the full space."
What is Ax = b? When does Ax = b has a unique solution?
In Ax = b, what happens when A is fat or tall?
When does inverse of A exist?
What is a norm? What is L1, L2 and L infinity norm?
What are the conditions a norm has to satisfy?
Why is squared of L2 norm preferred in ML than just L2 norm?
When L1 norm is preferred over L2 norm?
Can the number of nonzero elements in a vector be defined as L0 norm? If no, why?
What is Frobenius norm?
What is a diagonal matrix?
Why is multiplication by diagonal matrix computationally cheap? How is the multiplication different for square vs. non-square diagonal matrix?
At what conditions does the inverse of a diagonal matrix exist?
What is a symmetrix matrix?
What is a unit vector?
When are two vectors x and y orthogonal?
At R^n what is the maximum possible number of orthogonal vectors with non-zero norm?
When are two vectors x and y orthonormal?
What is an orthogonal matrix? Why is computationally preferred?
What is eigendecomposition, eigenvectors and eigenvalues?
How to find eigen values of a matrix?
Write the eigendecomposition formula for a matrix. If the matrix is real symmetric, how will this change?
Is the Eigendecomposition guaranteed to be unique? If not, then how do we represent it?
What are positive definite, negative definite, positive semi definite and negative semi definite matrices?
What is Singular Value Decomposition? Why do we use it? Why not just use ED?
Given a matrix A, how will you calculate its Singular Value Decomposition?
What are singular values, left singulars and right singulars?
What is the connection of Singular Value Decomposition of A with functions of A?
Why are singular values always non-negative?
What is the Moore Penrose pseudo inverse and how to calculate it?
If we do Moore Penrose pseudo inverse on Ax = b, what solution is provided is A is fat? Moreover, what solution is provided if A is tall?
Which matrices can be decomposed by ED?
Which matrices can be decomposed by SVD?
What is the trace of a matrix?
How to write Frobenius norm of a matrix A in terms of trace?
Why is trace of a multiplication of matrices invariant to cyclic permutations?
What is the trace of a scalar?
Write the frobenius norm of a matrix in terms of trace?

Numerical Optimization

(Return back to Contents)

What is underflow and overflow?
How to tackle the problem of underflow or overflow for softmax function or log softmax function?
What is poor conditioning?
What is the condition number?
What are grad, div and curl?
What are critical or stationary points in multi-dimensions?
Why should you do gradient descent when you want to minimize a function?
What is line search?
What is hill climbing?
What is a Jacobian matrix?
What is curvature?
What is a Hessian matrix?
What is a gradient checking?

Basics of Probability and Information Theory

(Return back to Contents)

Compare “Frequentist probability” vs. “Bayesian probability”?
What is a random variable?
What is a probability distribution?
What is a probability mass function?
What is a probability density function?
What is a joint probability distribution?
What are the conditions for a function to be a probability mass function?
What are the conditions for a function to be a probability density function?
What is a marginal probability? Given the joint probability function, how will you calculate it?
What is conditional probability? Given the joint probability function, how will you calculate it?
State the Chain rule of conditional probabilities.
What are the conditions for independence and conditional independence of two random variables?
What are expectation, variance and covariance?
Compare covariance and independence.
What is the covariance for a vector of random variables?
What is a Bernoulli distribution? Calculate the expectation and variance of a random variable that follows Bernoulli distribution?
What is a multinoulli distribution?
What is a normal distribution?
Why is the normal distribution a default choice for a prior over a set of real numbers?
What is the central limit theorem?
What are exponential and Laplace distribution?
What are Dirac distribution and Empirical distribution?
What is mixture of distributions?
Name two common examples of mixture of distributions? (Empirical and Gaussian Mixture)
Is Gaussian mixture model a universal approximator of densities?
Write the formulae for logistic and softplus function.
Write the formulae for Bayes rule.
What do you mean by measure zero and almost everywhere?
If two random variables are related in a deterministic way, how are the PDFs related?
Define self-information. What are its units?
What are Shannon entropy and differential entropy?
What is Kullback-Leibler (KL) divergence?
Can KL divergence be used as a distance measure?
Define cross-entropy.
What are structured probabilistic models or graphical models?
In the context of structured probabilistic models, what are directed and undirected models? How are they represented? What are cliques in undirected structured probabilistic models?

Confidence interval

(Return back to Contents)

What is population mean and sample mean?
What is population standard deviation and sample standard deviation?
Why population s.d. has N degrees of freedom while sample s.d. has N-1 degrees of freedom? In other words, why 1/N inside root for pop. s.d. and 1/(N-1) inside root for sample s.d.?
What is the formula for calculating the s.d. of the sample mean?
What is confidence interval?
What is standard error?

Learning Theory

(Return back to Contents)

Describe bias and variance with examples.
What is Empirical Risk Minimization?
What is Union bound and Hoeffding’s inequality?
Write the formulae for training error and generalization error. Point out the differences.
State the uniform convergence theorem and derive it.
What is sample complexity bound of uniform convergence theorem?
What is error bound of uniform convergence theorem?
What is the bias-variance trade-off theorem?
From the bias-variance trade-off, can you derive the bound on training set size?
What is the VC dimension?
What does the training set size depend on for a finite and infinite hypothesis set? Compare and contrast.
What is the VC dimension for an n-dimensional linear classifier?
How is the VC dimension of a SVM bounded although it is projected to an infinite dimension?
Considering that Empirical Risk Minimization is a NP-hard problem, how does logistic regression and SVM loss work?

Model and feature selection

(Return back to Contents)

Why are model selection methods needed?
How do you do a trade-off between bias and variance?
What are the different attributes that can be selected by model selection methods?
Why is cross-validation required?
Describe different cross-validation techniques.
What is hold-out cross validation? What are its advantages and disadvantages?
What is k-fold cross validation? What are its advantages and disadvantages?
What is leave-one-out cross validation? What are its advantages and disadvantages?
Why is feature selection required?
Describe some feature selection methods.
What is forward feature selection method? What are its advantages and disadvantages?
What is backward feature selection method? What are its advantages and disadvantages?
What is filter feature selection method and describe two of them?
What is mutual information and KL divergence?
Describe KL divergence intuitively.

Curse of dimensionality

(Return back to Contents)

Describe the curse of dimensionality with examples.
What is local constancy or smoothness prior or regularization?

Universal approximation of neural networks

(Return back to Contents)

State the universal approximation theorem? What is the technique used to prove that?
What is a Borel measurable function?
Given the universal approximation theorem, why can’t a Multi Layer Perceptron (MLP) still reach an arbitrarily small positive error?

Deep Learning motivation

(Return back to Contents)

What is the mathematical motivation of Deep Learning as opposed to standard Machine Learning techniques?
In standard Machine Learning vs. Deep Learning, how is the order of number of samples related to the order of regions that can be 3. recognized in the function space?
What are the reasons for choosing a deep model as opposed to shallow model?
How Deep Learning tackles the curse of dimensionality?

Support Vector Machine

(Return back to Contents)

How can the SVM optimization function be derived from the logistic regression optimization function?
What is a large margin classifier?
Why SVM is an example of a large margin classifier?
SVM being a large margin classifier, is it influenced by outliers?
What is the role of C in SVM?
In SVM, what is the angle between the decision boundary and theta?
What is the mathematical intuition of a large margin classifier?
What is a kernel in SVM? Why do we use kernels in SVM?
What is a similarity function in SVM? Why it is named so?
How are the landmarks initially chosen in an SVM? How many and where?
Can we apply the kernel trick to logistic regression? Why is it not used in practice then?
What is the difference between logistic regression and SVM without a kernel?
How does the SVM parameter C affect the bias/variance trade off?
How does the SVM kernel parameter sigma² affect the bias/variance trade off?
Can any similarity function be used for SVM?
Logistic regression vs. SVMs: When to use which one?

Bayesian Machine Learning

(Return back to Contents)

What are the differences between “Bayesian” and “Freqentist” approach for Machine Learning?
Compare and contrast maximum likelihood and maximum a posteriori estimation.
How does Bayesian methods do automatic feature selection?
What do you mean by Bayesian regularization?
When will you use Bayesian methods instead of Frequentist methods?
Please explain Expectation-Maximization algorithm
What is Variational Inference?
What is Latent Dirichlet Allocation (LDA)?
What is Markov chain?

Regularization

(Return back to Contents)

What is L1 regularization?
What is L2 regularization?
Compare L1 and L2 regularization.
Why does L1 regularization result in sparse models?
What is dropout?
How will you implement dropout during forward and backward pass?

Evaluation of Machine Learning systems

(Return back to Contents)

What are accuracy, sensitivity, specificity, ROC?
What are precision and recall?
Describe t-test in the context of Machine Learning.

Clustering

(Return back to Contents)

Describe the k-means algorithm.
What is distortion function? Is it convex or non-convex?
Tell me about the convergence of the distortion function.
Topic: EM algorithm
What is the Gaussian Mixture Model?
Describe the EM algorithm intuitively.
What are the two steps of the EM algorithm
Compare Gaussian Mixture Model and Gaussian Discriminant Analysis.

Dimensionality Reduction

(Return back to Contents)

Why do we need dimensionality reduction techniques?
What do we need PCA and what does it do?
What is the difference between logistic regression and PCA?
What are the two pre-processing steps that should be applied before doing PCA?

Basics of Natural Language Processing

(Return back to Contents)

What is WORD2VEC?
What is t-SNE? Why do we use PCA instead of t-SNE?
What is sampled softmax?
Why is it difficult to train a RNN with SGD?
How do you tackle the problem of exploding gradients?
What is the problem of vanishing gradients?
How do you tackle the problem of vanishing gradients?
Explain the memory cell of a LSTM.
What type of regularization do one use in LSTM?
What is Beam Search?
How to automatically caption an image?

Some basic questions

(Return back to Contents)

Can you state Tom Mitchell’s definition of learning and discuss T, P and E?
What can be different types of tasks encountered in Machine Learning?
What are supervised, unsupervised, semi-supervised, self-supervised, multi-instance learning, and reinforcement learning?
Loosely how can supervised learning be converted into unsupervised learning and vice-versa?
Consider linear regression. What are T, P and E?
Derive the normal equation for linear regression.
What do you mean by affine transformation? Discuss affine vs. linear transformation.
Discuss training error, test error, generalization error, overfitting, and underfitting.
Compare representational capacity vs. effective capacity of a model. Discuss VC dimension.
What are nonparametric models? What is nonparametric learning?
What is an ideal model? What is Bayes error? What is/are the source(s) of Bayes error occur?
What is the no free lunch theorem in connection to Machine Learning?
What is regularization? Intuitively, what does regularization do during the optimization procedure?
What is weight decay? What is it added?
What is a hyperparameter? How do you choose which settings are going to be hyperparameters and which are going to be learned?
Why is a validation set necessary?
What are the different types of cross-validation? When do you use which one?
What are point estimation and function estimation in the context of Machine Learning? What is the relation between them?
What is the maximal likelihood of a parameter vector $theta$? Where does the log come from?
Prove that for linear regression MSE can be derived from maximal likelihood by proper assumptions.
Why is maximal likelihood the preferred estimator in ML?
Under what conditions do the maximal likelihood estimator guarantee consistency?
What is cross-entropy of loss?
What is the difference between loss function, cost function and objective function?

Optimization procedures

(Return back to Contents)

What is the difference between an optimization problem and a Machine Learning problem?
How can a learning problem be converted into an optimization problem?
What is empirical risk minimization? Why the term empirical? Why do we rarely use it in the context of deep learning?
Name some typical loss functions used for regression. Compare and contrast.
What is the 0–1 loss function? Why can’t the 0–1 loss function or classification error be used as a loss function for optimizing a deep neural network?

Sequence Modeling

(Return back to Contents)

Write the equation describing a dynamical system. Can you unfold it? Now, can you use this to describe a RNN?
What determines the size of an unfolded graph?
What are the advantages of an unfolded graph?
What does the output of the hidden layer of a RNN at any arbitrary time t represent?
Are the output of hidden layers of RNNs lossless? If not, why?
RNNs are used for various tasks. From a RNNs point of view, what tasks are more demanding than others?
Discuss some examples of important design patterns of classical RNNs.
Write the equations for a classical RNN where hidden layer has recurrence. How would you define the loss in this case? What problems you might face while training it?
What is backpropagation through time?
Consider a RNN that has only output to hidden layer recurrence. What are its advantages or disadvantages compared to a RNN having only hidden to hidden recurrence?
What is Teacher forcing? Compare and contrast with BPTT.
What is the disadvantage of using a strict teacher forcing technique? How to solve this?
Explain the vanishing/exploding gradient phenomenon for recurrent neural networks.
Why don’t we see the vanishing/exploding gradient phenomenon in feedforward networks?
What is the key difference in architecture of LSTMs/GRUs compared to traditional RNNs?
What is the difference between LSTM and GRU?
Explain Gradient Clipping.
Adam and RMSProp adjust the size of gradients based on previously seen gradients. Do they inherently perform gradient clipping? If no, why?
Discuss RNNs in the context of Bayesian Machine Learning.
Can we do Batch Normalization in RNNs? If not, what is the alternative?

Autoencoders

(Return back to Contents)

What is an Autoencoder? What does it “auto-encode”?
What were Autoencoders traditionally used for? Why there has been a resurgence of Autoencoders for generative modeling?
What is recirculation?
What loss functions are used for Autoencoders?
What is a linear autoencoder? Can it be optimal (lowest training reconstruction error)? If yes, under what conditions?
What is the difference between Autoencoders and PCA?
What is the impact of the size of the hidden layer in Autoencoders?
What is an undercomplete Autoencoder? Why is it typically used for?
What is a linear Autoencoder? Discuss it’s equivalence with PCA. Which one is better in reconstruction?
What problems might a nonlinear undercomplete Autoencoder face?
What are overcomplete Autoencoders? What problems might they face? Does the scenario change for linear overcomplete autoencoders?
Discuss the importance of regularization in the context of Autoencoders.
Why does generative autoencoders not require regularization?
What are sparse autoencoders?
What is a denoising autoencoder? What are its advantages? How does it solve the overcomplete problem?
What is score matching? Discuss it’s connections to DAEs.
Are there any connections between Autoencoders and RBMs?
What is manifold learning? How are denoising and contractive autoencoders equipped to do manifold learning?
What is a contractive autoencoder? Discuss its advantages. How does it solve the overcomplete problem?
Why is a contractive autoencoder named so?
What are the practical issues with CAEs? How to tackle them?
What is a stacked autoencoder? What is a deep autoencoder? Compare and contrast.
Compare the reconstruction quality of a deep autoencoder vs. PCA.
What is predictive sparse decomposition?
Discuss some applications of Autoencoders.

Representation Learning

(Return back to Contents)

What is representation learning? Why is it useful?
What is the relation between Representation Learning and Deep Learning?
What is one-shot and zero-shot learning (Google’s NMT)? Give examples.
What trade offs does representation learning have to consider?
What is greedy layer-wise unsupervised pretraining (GLUP)? Why greedy? Why layer-wise? Why unsupervised? Why pretraining?
What were/are the purposes of the above technique? (deep learning problem and initialization)
Why does unsupervised pretraining work?
When does unsupervised training work? Under which circumstances?
Why might unsupervised pretraining act as a regularizer?
What is the disadvantage of unsupervised pretraining compared to other forms of unsupervised learning?
How do you control the regularizing effect of unsupervised pretraining?
How to select the hyperparameters of each stage of GLUP?

Monte Carlo Methods

(Return back to Contents)

What are deterministic algorithms?
What are Las vegas algorithms?
What are deterministic approximate algorithms?
What are Monte Carlo algorithms?

Generative Models

(Return back to Contents)

What is a Variational Autoencoder (VAE)?
How is VAE different from a regular Autoencoder?
Basics of GAN?
How do you train a GAN (Backpropagation)?
Cost function derivation?
What are the drawbacks for GAN?
Implement GAN with PyTorch
Implement GAN with Tensorflow

Reinforcement Learning

(Return back to Contents)

What is the Reinforcement Learning?
Factors in Reinforcement Learning with Python
Types of Reinforcement Learning with Python
Positive Reinforcement Learning
Negative Reinforcement Learning
Reinforced Learning vs Supervised Learning
Decision Making
Dependency and Labels

cracking_the_machine_learning_interview's People

Contributors

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.