Git Product home page Git Product logo

awesome-ai-questions's Introduction

AI Questions


Inspired by the repository https://github.com/lydiahallie/javascript-questions I created a questionnaire for a couple of topics that fall under the umbrella of Artificial Intelligence, covering Basic Techniques and Foundations, Computer Vision, Natural Language Processsing, and Reinforcement Learning.

The purpose of this repository is for you to learn the concepts, refresh your knowledge, or prepare yourself for an interview.

The concept is the same as on Liydia's repo, although I don't provide possible answers. The correct answer can be seen by clicking on the expandable part. I'm happy to incorporate feedback, so feel free to drop me a line in case something feels off.


Question
Answer

Example of an answer.

1 EN: ML Basics 1

  1. What is the output when using a convolution kernel?
    Answer
    • A real number
  2. How is the conv result calculated?
    Answer
    • Mean over the sum of the multiplications of image value with convolution kernel value
  3. What is a first level tensor called?
    Answer
    • Vector
  4. What is a second level tensor called?
    Answer
    • Matrix
  5. What is a third level tensor called?
    Answer
    • Tensor
  6. How can nominal data be transformed to numerical data?

    Answer

    • One Hot encoding

    data/1%20EN%20ML%20Basics%201%20c4761834b24c465c9eeabd9dcf126f86/Untitled.png

  7. What is meant by segmentation?

    Answer

    • Set of data belonging to an object of an object class c

  8. Is the Euclidean distance a measure of similarity or dissimilarity?
    Answer
    • Dissimilarity measure
  9. Is the correlation a similarity measure or dissimilarity measure?
    Answer
    • Similarity measure
  10. What is a Voronoi cell?
    Answer
    • Separation of the solution space into volumes (multi-dimensional) / surfaces (two-dimensional) using mid-perpendiculars between two adjacent points.
  11. How does the K-Neighbourhood Classifier work?

    Answer

    • A hypersphere is enlarged by a new point in feature space until k elements are contained. Then the class is assigned by a simple majority decision.

  12. What is a Template?

    Answer

    • Template objects are the mean of a certain class. This then results in the center of gravity of a cluster.

      data/1%20EN%20ML%20Basics%201%20c4761834b24c465c9eeabd9dcf126f86/Untitled%201.png

      Source: Karlsruhe University of Applied Science - Prof. Dr. Norbert Link

  13. What is the Minimum Distance Classifier?

    Answer

    • If vec_x closer to template 1, then class c_1, otherwise class c_2

  14. Perceptron: What is the idea of the weight vector?

    Answer

    • The intuition behind the weight vector comes from the idea of calculating the Euclidean distance from an input vector and a template vector. (See questions about this). The Euclidean distance of two vectors can be resolved to

      data/1%20EN%20ML%20Basics%201%20c4761834b24c465c9eeabd9dcf126f86/Untitled%202.png

      Source: Karlsruhe University of Applied Science - Prof. Dr. Norbert Link

      The middle part shows the weight vector and the +2 the bias. Since we are doing continuous learning, we can define a weight vector w and a random bias. This gives the Euclidean distance when we calculate the scalar product of the input vector with the weight vector. Since the Euclidean distance is a dissimilarity measure, if it is greater than 0, it is assigned a different class than if it is less than 0.

      data/1%20EN%20ML%20Basics%201%20c4761834b24c465c9eeabd9dcf126f86/Untitled%203.png

      Source: Karlsruhe University of Applied Science - Prof. Dr. Norbert Link

  15. Perceptron: What is the idea of input vector and weight vector?

    Answer

    • Calculate the Euclidean distance

  16. Perceptron: What is the idea of the perceptron?
    Answer
    • By means of continuous learning, we change the weight vector so that it creates a dividing line between two classes in our feature space.
  17. A linear decision function in 2D space may have how many input variables?

    Answer

    • Only one, because the function value adds another axis

    data/1%20EN%20ML%20Basics%201%20c4761834b24c465c9eeabd9dcf126f86/Untitled%204.png

    (Linear decision function)

    Source: Karlsruhe University of Applied Science - Prof. Dr. Norbert Link

  18. Which is the required property for a cost function so that the minimum can be found?

    Answer

    • Function must be continuous and differentiable

  19. What are the two options for the minimum search of the cost function?

    Answer

    • Numerical (Full Scanning)
    • Analytical (derivation)
    • In the following example, the cost function is the sum of the quantity of incorrectly classified vec(x)

    data/1%20EN%20ML%20Basics%201%20c4761834b24c465c9eeabd9dcf126f86/Untitled%205.png

    Source: Karlsruhe University of Applied Science - Prof. Dr. Norbert Link

  20. How can the distance of a point vec(x) to the decision line be calculated?

    Answer

    data/1%20EN%20ML%20Basics%201%20c4761834b24c465c9eeabd9dcf126f86/Untitled%206.png

    data/1%20EN%20ML%20Basics%201%20c4761834b24c465c9eeabd9dcf126f86/Untitled%207.png

    Source: Karlsruhe University of Applied Science - Prof. Dr. Norbert Link

  21. How do I make the Perceptron cost function continuous and differentiable?

    Answer

    • Use the distance of the incorrectly classified distances and add them up. The function becomes continuous, since we are now working with fine granularity, meaning that a slight change in w may produce a large variation in the result of the function value (hyperplane)

    data/1%20EN%20ML%20Basics%201%20c4761834b24c465c9eeabd9dcf126f86/Untitled%208.png

    Source: Karlsruhe University of Applied Science - Prof. Dr. Norbert Link

  22. How are the weight values initialized?

    Answer

    • Random!
    • Never use 0, otherwise the algorithm cannot start

  23. How do I update the weights of a perceptron?
    Answer
    • Weight = weight - learning rate * slope of the cost function from w to at position k (which is the derivative of the cost function over w)
  24. Multiclass extension of the perceptron: Which Loss do I choose?
    Answer
    • Hinge-Loss or Maximum Margin Loss
  25. How does the Hinge Loss work?

    Answer

    • For each class the score is calculated and the goal is that the score for the correct class is greater than a certain delta + the accumulated sum of all other classes. If so, the cost would be 0, if in the delta, costs are calculated proportionally.

      data/1%20EN%20ML%20Basics%201%20c4761834b24c465c9eeabd9dcf126f86/Untitled%209.png

  26. What is the composition of the Multiclass SVM Loss?

    Answer

    • Composed of hinge loss and regularization loss (with lambda as regularization parameter)

  27. When do I use the loss and when a score?
    Answer
    • Loss value implies how well or poorly a certain model behaves after each iteration of optimization. Ideally, one would expect the reduction of loss after each, or several, iteration(s).
    • The accuracy of a model is usually determined after the model parameters are learned and fixed and no learning is taking place. Then the test samples are fed to the model and the number of mistakes (zero-one loss) the model makes are recorded, after comparison to the true targets.
    • Source: https://stackoverflow.com/questions/34518656/how-to-interpret-loss-and-accuracy-for-a-machine-learning-model
  28. What is a weakness of the template approach?
    Answer
    • Template becomes blurry if class instance has a very different feature value
  29. How can the problem be solved for a template?
    Answer
    • Transformation of the actual features to so-called intermediate features, which are invariant to differences in original features. (non-linear transformation). This results in an intermediate feature template
  30. What is the structure of a non-linear classifier that addresses the problem of intermediate feature templates?

    Answer

    data/1%20EN%20ML%20Basics%201%20c4761834b24c465c9eeabd9dcf126f86/Untitled%2010.png

    Source: Karlsruhe University of Applied Science - Prof. Dr. Norbert Link

  31. How can non-linearity be generated?

    Answer

    • With AND, OR and XOR Grids

  32. What is the advantage of the templates in terms of inference?
    Answer
    • They are faster because only Voinoi must be calculated for representatives
  33. What is the disadvantage of the templates ?
    Answer
    • They are inaccurate
  34. How do I get the hyperplane of a perceptron?
    Answer
    • Set zero for one of the two values (x1,x2). Solve the equation and do the same the other way round.
  35. What is the principle of the minimum distance classifier?
    Answer
    • Calculate a center of gravity vector from the sample vectors of a class (mean over sum). This is also called a template.
    • Classification based on distance to template
  36. What is the advantage of the minimum distance classifier compared to the nearest neighbour classifier?
    Answer
    • The Minimum Distance Classifier is more robust against outliers, because not every data point of the sample is considered, but only their main class focuses.
    • The inference should be faster, since only the distance to N class centers is determined and not the distance to each individual data point of the sample.
    • The distances must be calculated only for the centers of gravity
  37. What is the disadvantage of the minimum distance classifier compared to the nearest neighbour classifier?
    Answer
    • Only works under certain conditions
    • Mean values are only a good template in special cases
    • Only linear class boundaries can be displayed
  38. How does the perceptron divide a two-dimensional feature space?
    Answer
    • Through a straight line into two half spaces
  39. What are the parameters of the perceptrons and on which principle does the perceptron learn its values?
    Answer
    • Parameters: Weight values đ‘€ Threshold values đ‘€
    • Principle: Minimization of the classification error-loss function by using a gradient descent method
  40. What does the perceptron cost function represent and what is it ?
    Answer
    • Percepton cost function represents the sum of the distances of all incorrectly classified samples to the decision (hyper) level.
  41. How can the limitation of linear classifiers be overcome?
    Answer
    • Combination of linear classifiers in layers with non-linear activation function

3 EN: Data Science

Basics

  1. What are the two most common stochastic modelling approaches and in which areas (unsupervised/supervised) are they applied?
    Answer
    • Joint probability function (Unsupervised Learning)
    • conditional probability function (Supervised)
  2. What are the reasons for prediction/forecast errors?
    Answer
    • Used model cannot depict real situation
    • Randomness at generating the dataset from the real model (if dataset is small)
    • Randomness at learning (e.g. stochastic gradient descent, simulated annealing)
  3. What is the concept of I.I.D ?
    Answer
    • Independent Identically Distributed. Each of the n feature vectors is independent from the others and all have their origin in the same probability distribution.
  4. What is the definition of the Likelihood L(𝜗 | G)?
    Answer
    • For the likelihood L we usually take data G as given and evaluates the likelihood as function of 𝜗
  5. What is the definition of P(G | 𝜗)?

    Answer

    For a given 𝜗 we evaluate the probability of different results G of the data generation.

    data/3%20EN%20Data%20Science%20fa58d61a8fcb4fe6ae468822baa9dde3/Untitled.png

    probability is the quantity most people are familiar with which deals with predicting new data given a known model ("what is the probability of getting heads six times in a row flipping this 50:50 coin?") while likelihood deals with fitting models given some known data ("what is the likelihood that this coin is/isn't rigged given that I just flipped heads six times in a row?"). I wanted to add this perspective because using the example in this clip, a likelihood quantity such as 0.12 is not meaningful to the layman unless it is explained exactly what 0.12 means: a measure of the sample's support for the assumed model i.e. low values either mean rare data or incorrect model!

    Source: https://www.youtube.com/watch?v=pYxNSUDSFH4

  6. What is the Maximum-Likelihood-Prinicipal?

    Answer

    • We determine parameter 𝜗 that results in the max Likelihood for given data G

      𝜗_hat = argmax 𝐿 ( 𝜗 | đș )

  7. What is the definition of a generative stochastical model?

    Answer

    • We talk about a a generative stochastical model for a data-mining challenge, on the assumption, that the data was generated by a stochastically process with a known structure (which most has unknown parameters)

  8. What is the difference between the Naive Bayes Classifier and the Bayes classifier?
    Answer

Bias-Variance Tradeoff & Bayesian Statistic

  1. What is the definition of Bias?
    Answer
    • The bias of a learned model for a specifically predicted value f(x_0) is the deviation between the expected value E(f(x_0) and the predicted value f(x_0) of the true value. (Ergo the estimaton of the parameter is not unbiased)
  2. How is the Bias produced/generated?
    Answer
    • Often the consequence is that the function class used in learning is not powerful enough to model the true context, or is prevented by regularization or a prior from choosing the parameters appropriately.
  3. Definition Variance
    Answer
    • The variance of the learned model for a certain predicted value 𝑓(đ‘„) is understood to be a random deviation from the predictable portion 𝑓(đ‘„) of the true value.
    • Especially for models with (too) many degrees of freedom, the prognosis result will be strongly influenced by the coincidence of the perturbation 𝜖, because the model can and will then also model this random noise, which turn out differently with each repetition.
  4. What is the composition of the Mean Squared Error (MSE) ?
    Answer
    • MSE = Variance + Bias (E(f(x) - f'(x))^2) + non predictable portion/noise (epsilon)
  5. Describe the Tradeoff between Bias and Variance
    Answer
    • Suppose I took different degrees of freedom. Then my model will have a higher bias with fewer degrees of freedom, but a lower variance
    • The optimal total error is usually achieved somewhere in the middle of the model complexity, i.e. neither too much bias nor too much variance is desired.
    • However, if I know that a parameter is in fact independent of others, then the model would be limited if I did not treat it as such. In this case, after remodelling, I reduce the variance without increasing the bias.
    • → Goal: Find independent Variables

Bayes

  1. What is the difference between the Bayesian learning approach and the classical one?
    Answer
    • The a-priori probability distribution is given and we generate a parameter vector THETA based on it. (One-time execution). This is used to generate the data using random experiments.
    • Based on the random sample, bayes is applied and the a-posteriori probability distribution is calculated (without bayes you would have a histogram here, which would basically be the Prior?)
    • A-posteriori is used for Inferenc/Recall
  2. What are the advantages of the Bayesian learning approach?
    Answer
    • The uncertainty of the estimates is taken into account in a consistent manner
    • Therefore less tendency to overfitting
    • The assumptions become explicitly visible in the form of the prior
  3. What are the disadvantages of Bayesian learning?
    Answer
    • The Prior must be mostly subjectively guessed.
    • The recall/inference is usually too complex for exact calculation and must be approximated numerically. (With all imaginable parameter values must be predicted, followed by weighted averaging).
  4. Definition Prior
    Answer
    • A-priori (i.e. prior knowledge of the data assumed on the basis of prior knowledge) probability distribution of the unknown parameters
  5. Definition Posterior
    Answer
    • A-posteriori (i.e. determined in consideration of data and prior knowledge) probability distribution of unknown parameters
  6. Is Bayes suitable for the practice?
    Answer
    • No, because calculating the integral over all possible parameters analytically is too costly
  7. Why do people still like to use Bayes?
    Answer
    • Serves as a source of intuition as to what one should expect for a good prognosis
    • Approaches of Model Averaging can be considered as approximations of Bayesian learning
  8. Give an example approach for the approximation of Bayes
    Answer
    • Bagging: Different sampling (random) and training of separate models. Inference averages the predictions of all models. Predictions turn out differently because of different samples.

Dimensionality reduction

  1. Why should dimensions be reduced?
    Answer
    • Too high dimensions lead either to computational problems or worsen performance due to irrelevant input. (Curse of Dimensionality)
  2. Which modeling assumptions are made to solve the problem?
    Answer
    • Assumption of a linear relationship (correlation of two features)
    • Assumption of irrelevant features (input pruning)
    • Assumption that the problem can be solved by a linear projection of the input space (PCA)
  3. Name concrete approaches for the reduction of dimensions
    Answer
    • Omit unimportant inputs (Input Pruning, Feature Selection), e.g. by
      1. Greedy Algorithm: remove the (apparently) least important one several times in succession
      2. Ideal selection: Train a model with each subset of inputs and take the best one.
    • Summarize inputs. (e.g. through aggregation or more complex transformations such as PCA (Principal Component Analysis))
  4. What are the critical points of a) Greedy, b) Ideale Selection and c) PCA?
    Answer
    • a) Greedy selection can lead to suboptimal decisions
    • b) Impracticable, as it is too costly. Overtraining in input selection possible.
    • c) PCA considers only the statistical distribution of input values, without considering the relevance for the output.
  5. Explain the idea of PCA (Principal Component Analysis)?

    Answer

    • The idea is to eliminate axes representing variables by appropriate rotation (columns of the rotation matrix are orthogonal). One wants to omit the axes that express little variance of the data, which is caused, for example, by the correlation of two variables. This results in two goals:
      • the resulting variables are linearly independent of each other.
      • the transformed dimensions have as much variance as possible and are sorted by the decreasing variance of the data
    • Intuition:
      • Preserve the information available in n variables as good as possible in m variables, where m<n.

    However: Result depends on the scaling of the inputs.

  6. Name the steps of the PCA

    Answer

    • Step 1: Make the variables as independent as possible by rotation
    • Step 2: Omit the variables with the lowest variance, for example the second main component

Sampling

  1. What is sampling and when should sampling be applied?
    Answer
    • If the data set is too large, the approaches are first tested on small excerpts. If the results are good, the large data set can be used.
  2. What are the possibilities for downsizing?
    Answer
    • Reduce number of tables or columns
    • Reduce number of rows
  3. What options are there for reducing the number of columns?
    Answer
    • Omitting tables or columns (only the important columns are kept)
    • Column aggregation (averaging of columns, e.g. pixels of images) E.g. replace measured values from different sensors by their average value
    • Dimension reduction, e.g. with PCA
  4. What options are there for reducing the number of rows?

    Answer

    • Aggregation of several rows

      E.g. measured values that were recorded at millisecond intervals can be made larger by averaging them at one-second intervals

    • Simple Random Sampling

      • Without putting it back: default case
      • With putting back: Loses more information, desired if stronger randomness is required
    • Stratified Sampling
      • Amount of data is divided into groups/categories and simple random sampling is performed on these groups. This ensures that the same amount of data is sampled from both groups. (Important if you want to keep the original distribution, e.g. elections)
    • Systematic selection (by hand, manual)
    • What is the advantage of random sampling?
    • Both are robust, in case that the data set is arranged according to some systematic.
    • What is the difference between the variant "without replacement" and the variant "with replacement"?
    • Variant without replacement is the standard procedure and produces less variance than "with”
    • "With" is used if more variance is desired (e.g.: bagging)
    • What is a Bias?
    • Systematic estimation error
    • Which value increases the standard deviation for pure random sampling?
    • With the reciprocal of the root of the sampling rate
    • When is bias generated during sampling?
    • If the sampling selects data objects based on different probabilities
    • How can the generated bias be compensated?
    • if the data sets in the sample are assigned weighting factors that are inversely proportional to the probability of their being selected into the sample. → reciprocal value
      • Example: Fraud cases with 100%, not fraud cases with 0.1% in the sample. Compensate by weighting data sets in the sample with the reciprocal of their sample probability. => not fraud * 1000
    • Give two examples of criteria for systematic selection
    • Deterministic selection criterion with link to the data contents. Extreme form of stratified sampling in which nothing is left to random chance.
    • Example: A department store with an inventory of 100,000 products and 100 branches creates a data record of 10 million purchase transactions - for each product one transaction from each branch.
    • Deterministic selection criterion without relevance to the data contents:

    Example: Every 10th row of the data set (corresponds to random sampling for randomly sorted data) - What does the abbreviation IID stand for and what does it mean? - Independent and identically distributed random variables - Rows are independent of each other and are all subject to the same distribution - What problems can occur when sampling relational data (across multiple tables)? - Integrity violations during subsequent joins (e.g. purchase order item with order_id for which there is no master data) - Cross-record key figures can be falsified. (e.g. number of articles per shopping cart, average time interval between orders from a customer) - IntegritÀtsverletzungen bei spÀteren joins (z.B. Bestellposition mit order_id, zu der es keine Stammdaten gibt)

    Falsifications can be corrected, but are very costly and depend on the type of key figure calculation. In addition, the corrected values tend to have a very high variance.

  5. What would be the best solution for relational sampling?

    Answer

    • Look at the data and best is to sample the data on the top level and then join it (e.g. UserId is top level, join from order_ids)

  6. What are the disadvantages of relational sampling?
    Answer
    • If more than one table is processed, it may not be feasible
    • Sampling depends on the model approach and features used. (e.g. if we at Instacart want to have a variable for every order that tells us how shortly before the last order was received by Instacart - no matter which user)

Metrics

  1. What is the ROC curve? (Also called Precision Recall Curve)

    Answer

    ROC and AUC, Clearly Explained!

    • ROC is used if you want to compare different values for a threshold.
    • Comparison of True Positive Rate (Sensitivity) and False Positive Rate (1- Specifity)

      data/3%20EN%20Data%20Science%20fa58d61a8fcb4fe6ae468822baa9dde3/Untitled%201.png

    • For each new threshold the values are recalculated and plotted again. Here you can see that there are two points that would be best. The left point would have no false positives at all. Depending on the use case, a decision must be made here. The points can still be connected to a line.

      data/3%20EN%20Data%20Science%20fa58d61a8fcb4fe6ae468822baa9dde3/Untitled%202.png

  2. What is AUC?

    Answer

    • Area under the Curve. Describes the area below the curve and is very useful for comparing different approaches

      data/3%20EN%20Data%20Science%20fa58d61a8fcb4fe6ae468822baa9dde3/Untitled%203.png

  3. What is the disadvantage of AUC?

    Answer

    • Compared to Precision, an imbalance of the data (e.g. the true negative) can have an effect on the result.

  4. Provide metrics for regression models
    Answer
    • Mean squared error
    • Root mean squared error
    • Mean absolute error
    • Median absolute error
  5. What is the advantage of Root Mean Squared Error?
    Answer
    • Makes the mean squared error more interpretable
  6. What is the advantage of Mean Absolute Error?
    Answer
    • Not so sensitive to outliers (more robust)
  7. What is the advantage of Median Absolute Error?
    Answer
    • Even more robust against outliers
  8. When should Precision be used?
    Answer
    • Suitable for highly unbalanced problems
  9. What are the two approaches to multiclass classification?

    Answer

Graphical Models

  1. What does a graphical model represent?
    Answer
    • The stochastic (in)dependence structures between random variables, that are modeled as nodes
  2. What types of networks are there?
    Answer
    • Causal networks (includes causal structure)
    • Bayesian networks (dependency structure like for causal networks, but without causal structure). Directed Graph
    • Markov networks (undirected graph)
  3. What is the difference between a causal network and a Bayesian network?
    Answer
    • By knowing the causality relationships it is possible to predict what happens when you change variables. With Bayes networks only a conclusion is made about what happens with what probability if you do not manipulate/intervene the system.

EM (Expectation-Maximization)-Algorithm

  1. What is the goal of the EM-Algos?
    Answer
    • Clustering of data points using the a-priori values variance and mean.
  2. Describe the process of the EM

    Answer

    Starting point: A cluster of data points without knowing by which probability function the values are created.

    data/3%20EN%20Data%20Science%20fa58d61a8fcb4fe6ae468822baa9dde3/Untitled%205.png

    1. Estimate the a-priori value for variance and mean value (start with two randomly placed Gaussians)
    2. Estimate Step: Use Bayes Rule to estimate the probability of a point belonging to a certain distribution. Either blue or yellow, but no hard assignment, only the probability (between 0 and 1).
    3. Maximize step: Update mu and variance based on the points that were assigned softly

      data/3%20EN%20Data%20Science%20fa58d61a8fcb4fe6ae468822baa9dde3/Untitled%206.png

    4. Source: https://www.youtube.com/watch?v=REypj2sy_5U&list=PLBv09BD7ez_4e9LtmK626Evn1ion6ynrt&ab_channel=VictorLavrenko

  3. How does EM differ from K-Means?

    Answer

    • If you make a hard assignment in the E step (hard assignment to a cluster), that means each missing value is set to the most likely value, instead of continuing with the whole probability distribution, you get exactly the k-means algorithm.

Data Quality, Missing Values, and Outliers

  1. When do data flaws not bother you?
    Answer
    • If the frequency and type of flaws in the selected model type do not indicate a major change in the result. (e.g. single, minor measurement errors, rare outliers with an approach robust against outliers such as Median Absolute Error instead of Mean Squared Error)
    • If they can be considered as part of the learning problem
  2. When does it make sense to take data flaws into training?
    Answer
    • Training the model with it makes sense if the same deficiencies are also present as input when using the model. The advantage is that no effort is required for data cleansing and if the model can handle it, it is an optimal solution. The disadvantage is that a combination with expert knowledge or models from other sources is difficult. If necessary, rename the feature, for example Temperature → Temperature reported by sensor
  3. Specify causes for missing values
    Answer
    • Incomplete data collection (sensor failure, questionnaire incomplete)
    • Variable makes no sense (e.g. "time since last order" for first order)
  4. How can outliers be detected?
    Answer
    • Check the value range of the variables for plausibility, possibly look at the histogram.
    • Display joint distribution of pairs of variables, e.g. as a scatter diagram
    • Learn the probabilistic model of the common probability distribution of all variables (i.e. unsupervised). Data objects with the smallest probability are outlier candidates. (Or, for non-probabilistic clustering model: Objects with a large distance from their cluster center, as well as clusters with very few assigned points are candidates).
  5. How should outliers be dealt with?
    Answer
    • Leave it in (in any case at (3); also at (2), if the learning process can cope with it).
    • Possibly, use a robust model type (e.g. Median Absolute Error instead of Mean Squared Error)
    • Replace with replacement value or treat as missing.
  6. What types of missing values are there?

    Answer

    1. MCAR: Missing Completely at Random:

      Whether a value is missing or not is stochastically independent of all input and output variables. Example: Transmission error due to radio interference when reading out a radio temperature sensor.

    2. MAR: Missing at Random:

      Whether a value is missing or not is stochastically independent of the missing value. Example: Women are less willing to give their age in a survey than men. Here: stochastic dependence on gender, but not on age itself.

    3. NMR: Not Missing at Random: Whether a value is missing or not depends on the missing value. Example: Older women are less willing to reveal their age than younger ones.

    4. Which of the three types does not lead to a bias?
    5. MCAR
    6. Mention points for dealing with missing values
    7. Use approaches that can handle missing values (for example, Naive Bayes classifier, certain decision tree variants)
    8. Ignore affected data objects completely
    9. Use affected data objects only partially
    10. Insert a substitute value (imputation)
    11. Use probability distribution of the missing value as a substitute. (EM algorithm)
    12. Code 'value missing' as a special value of the variable. Thus, a model (or, in the case of manual analysis, a human) with approaches that require complete data can be applied to incomplete data.

    All these procedures only work under certain conditions. There is no way around finding out (or making assumptions) why the values are missing.

  7. Handling missing values: Explain advantages/disadvantages of Naive Bayes classifier

    Answer

    • Usually leads to bias, if not MCAR.
    • Advantage: Very simple, because you can simply omit one variable in the term.

  8. Handling missing values: Explain advantages/disadvantages when affected data objects are completely ignored
    Answer
    • Advantage: Simplest approach.
    • Disadvantage: Available data shrinks. Under certain circumstances this can happen very strongly. Example: If there are 100 variables, and each individual value is missing with 2% probability (independent of the others), only 13% of the data objects are complete
    • May lead to bias if not MCAR and the output variable is stochastically dependent on the absence of the input variable (Supervised Learning), or if the absence of the input variable depends on the value of a variable whose probability distribution is to be modeled (Unsupervised Learning).
    • Prerequisite: Only suitable when little is missing.
  9. Handling missing values: Explain advantages/disadvantages of affected data objects only partially used

    Answer

    • Advantage: Data is optimally used
    • Prerequisites/Disadvantage: May lead to bias if not MCAR, just like when affected data objects are completely ignored.

      Example: In the scatterplot for the dimensions (X1, X2) omit the data objects where X1 or X2 are missing, in the scatterplot (X2, X3) omit those where X2 or X3 are missing.

  10. Handling missing values: Explain advantages/disadvantages of using substitute values (imputation)

    Answer

    • (a) For quantitative feature, e.g. mean or median
    • (b) For ordinal feature: median
    • (c) For nominal feature: modal value; or
    • d) CPS hot deck (Current Population Survey): Use the value of the previous data object for this feature. (This means that a random value is taken from the distribution of this feature)
    • e) Or in general: Learn a predictive model for the missing feature and thus estimate the missing values. Common: k-nearest neigbor

    Advantage of a-d: Simple

    Downside:

    • a - c falsifies the distribution of the missing attribute even with MCAR

      → possibly Bias in the results

    • (d) still generates some bias (because dependencies are not taken into account), but less than a-c, but additionally variance (model error/variance due to arrangement of objects)

    • (e) produces no variance and even less bias than (d) (bias by using only the most likely value). Is more expensive. (Minimum bias is present because I take the best value)

  11. Handling missing values: Explain advantages/disadvantages of using probability distribution of missing value as a substitute. (EM algorithm)

    Answer

    • Learn the prediction model for the probability distribution of the missing feature to estimate the missing feature from the other features (EM algorithm)
    • Create multiple data objects with the estimated probability values and weight the estimated value with the assigned probability.

      Advantage: Best practice from a statistical point of view, with MCAR there is no bias (or only to the extent that the model used has bias).

      Disadvantage: High effort. Only worth it if many important values are missing.

  12. Handling missing values: Explain advantages/disadvantages of "value missing" encoding as a special value of the variable.

    Answer

    • No problems with nominal variables
    • Not very nice with metric or ordinal variables (e.g. -99 = "missing").

      Whether this is an issue depends on the model types used.

      → Solution:

      • Perhaps assign additional binary variables (F1 ... Fn) to the variables (X1 ... Xn), which specify, which of the X-values are missing (nevertheless you need replacement values for the missing values in X).

      Prerequisite: It makes sense to learn missing values as part of the problem definition.

      Advantage: Little effort. If the model can handle it, optimal solution. Disadvantage: With ordinal or metric features the subject is difficult to model for many model types.

  13. How does the bias behave with MCAR and "omitting data objects", imputation and the EM algorithm?

    Answer

    • Generally unproblematic.
    • In this case, Omitting the relevant data objects results in no bias, but in an increase in the variance of the model because the amount of training data is reduced. A data scientist works on a random sample of the (unknown) complete data set.
    • If you let a model learn to estimate the probability distribution of the missing value depending on the other variables and thus generate replacement values, this also leads to no bias.
    • However, using only the most probable replacement value, or the mean or median, generally results in a systematic deviation of the completed data set from the true values, and therefore in a bias.

  14. How does the bias behave with MAR and by using "omitting data objects", orr imputation and the EM algorithm?

    Answer

    • Omitting in MAR, if the value of the missing variable is to be estimated, leads to distorted estimates (bias increases)
    • Replacing by mean value in MAR if the value of the missing variable is to be estimated, leads as well to distorted estimates (bias increases).

    Then the missing value is stochastically independent of the missing one under the condition that the other variables are known:

    1. If the variables on which the absence is stochastically dependent only serve as input variables, no bias occurs in the conditional probability distribution (or regression task) to be estimated, if one just leaves out the incomplete data sets.
    2. If the absence also depends on output variables, or if a common probability distribution is to be estimated, omitting the data objects with missing values results in a bias. However, this bias can be compensated by creating an estimation model for the probability of the absence depending on the variable values, and then weighting the data records with the reciprocal of the probability of absence.

  15. How does the bias behave in NMR and by using "omitting data objects" or imputation and the EM algorithm?

    Answer

    • Omitting the incomplete data objects generally leads to a bias.
    • Problem: This can not be compensated, because the dependency of the missing missing probability on the missing variable cannot be estimated. (No training data)
    • Also imputation procedures lead to a bias because there is no training data for data objects with missing values, but the value distribution for these data objects is different from that for those without missing values.

    If there is a risk that the missing values will have a significant influence on the results, the data must be collected again.

  16. When is a clean approach to the problem critical?

    Answer

    • If a strong correlation between the missing and the target value of the modelling is to be feared

  17. What should be done first when it is noticed that features are incomplete?
    Answer
    • Make assumptions about the mechanism by which the missing values arise.
  18. When are the generated model errors probably not critical and a simpler procedure should be chosen?
    Answer
    • If only a few values are missing in a non-critical feature, or in general if values are missing in a non-essential feature.
  19. What should be done if data with the same systematics is expected to be missing when given as input to the inference?

    Answer

    → Interpret missing data as part of the problem definition, give the model the information what is missing as input

  20. Requirements for a bayes network?

    Answer

    • It is not allowed to be cyclical
    • Must have directional edges

5 EN: AI - Computer Vision, NLP, Reinforcement Learning

Computer Vision

  1. What are difficulties about classifying an image
    Answer
    • Intra-class variation (Different breed of dogs)
    • Shape Variation (Size of dogs)
    • Illumination variation
  2. What kind of feature did people use before 2012?

    Answer

    • Color Histogram
    • Key Point descriptor

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled.png

    • Histogram of gradients

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%201.png

    • Bag of words model

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%202.png

  3. Name popular benchmark datasets

    Answer

    • MNIST
    • CIFAR-10 (Canadian Institute for Advanced Research)
    • IMAGENET (22 Categories)

  4. Why haven't been CNNs used before?
    Answer
    • Requires huge datasets
    • and immense computational power
  5. Name the most popular loss functions for regression and classification
    Answer
    • Regression: L2 loss (aka mean squared error), L1 loss (aka mean absolute error)
    • Classification: cross-entropy loss, hinge loss (used in SVMs)
  6. Explain cross-entropy loss

    Answer

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%203.png

    • M - number of classes
    • y_i,c - binary indicator (0 or 1) if class label c is the correct classification for observation o
    • (p_i,c) - the predicted probability that observation i is of class c

    Source: Karlsruhe University of Applied Science - Yurii Tolochko, 2019

    Actually the log function would be mirrored on the x axis and a small value of the predicted probability would be < 0, but since our goal is to minimize it we have to turn the function around and we do that by multiplying -1.

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%204.png

    Source: ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html

    Therefore, eventually the network learns to output high probabilities for the correct class and low probabilities for the incorrect ones.

  7. What is a dense layer?

    Answer

    • A dense layer (also called a fully connected layer) is a layer where each input is connected to every output, i.e. what we modelled

  8. What is the standard activation function to use?

    Answer

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%205.png

    Source: https://cdn-images-1.medium.com/max/1200/1*ZafDv3VUm60Eh10OeJu1vw.png

  9. What is the output of the conv filter when it slides over an image?

    Answer

    • Its dot product

  10. What does a pooling layer?

    Answer

    • Used for dimensionality reduction
    • Slide with a stride > 1
    • Instead of a dot product of a conv filter the ouput is the max (there are others like average)

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%206.png

      Source Illustration: Fei-Fei Li & Justin Johnson & Serena Yeung

  11. Impression of a ConvNet from the inside

    Answer

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%207.png

    Source: Karlsruhe University of Applied Science - Yurii Tolochko, 2019

  12. What kind of level feature does a ConvNet encode in the beginning and at the end?

    Answer

    • The early layers encode most basic feature, whereas deeper layers pick up more specfific ones (e.g. eyes of a face)

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%208.png

    Source: Karlsruhe University of Applied Science - Yurii Tolochko, 2019

  13. How did they improve neural networks further?

    Answer

    • Smaller conv filter but deeper networks

  14. What is the idea of Inception Modules?

    Answer

    Create a good local network in a module and stack the modules on top of each other.

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%209.png

    Source Illustration : Fei-Fei Li & Justin Johnson & Serena Yeung

  15. How do Inception Modules work?

    Answer

    • Different techniques are applied independently and concatenated afterwards. It is not clear in the beginning which of the approaches will work, but one of these just might.

  16. What is the problem of Inception Modules?

    Answer

    • It creates a huge computational complexity due to a huge increase of the ouput depth
    • Example:

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2010.png

      Source Illustration : Fei-Fei Li & Justin Johnson & Serena Yeung

  17. How can the problem of Inception Modules be solved?

    Answer

    • By applying 1x1 convolutions at the right position, such as prior to a 5x5 or as successor of a 3x3 max pooling

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2011.png

      Source Illustration : Fei-Fei Li & Justin Johnson & Serena Yeung

    Applying 1x1, leads to:

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2012.png

    Source Illustration : Fei-Fei Li & Justin Johnson & Serena Yeung

  18. What is the goal/purpose of a 1x1 conv filter?

    Answer

    • It reduces the depth by combining feature maps

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2013.png

    Source: Fei-Fei Li & Justin Johnson & Serena Yeung

  19. What are the three parts of the GoogleNet?

    Answer

    • Stem Network
    • Stacked Inception Modules
    • Classifier Outputs

  20. What is the idea of Deep Residual Learning?

    Answer

    • Learn difference to the identity map instead of map itself

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2014.png

      Source: Fei-Fei Li & Justin Johnson & Serena Yeung

  21. What is Transfer Learning?

    Answer

    • Transfer learning focuses on storing knowledge gained for a specific problem and applying it to a different but related problem”

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2015.png

      Source: Fei-Fei Li & Justin Johnson & Serena Yeung

  22. What different challenges are solvable with computer vision?

    Answer

    • Object localization: output the class of an image (object) but also output the position of the object (e.g. in form of a bounding box)
    • Object detection: classify multiple objects on an image as well as their locations (bounding boxes)

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2016.png

    • Semantic segmentation: label the individual pixels that belong to different categories (don’t differentiate between instances). Widely used in self-driving cars.

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2017.png

    • Instance segmentation: detect multiple objects, then label the pixels belonging to each of the detected objects. In contrary to Semantic segmentation: Differentiate between instances!

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2018.png

      Source Illustration: Fei-Fei Li & Justin Johnson & Serena Yeung

  23. What is the idea for object localization?

    Answer

    • We provide the class we are looking for and the NN tries to locate it. We still need to train a classifier for scoring the different classes, because there could be more than one class in the picture. If the picture showed a dog and cat, and we specify the class as cat, we would only get the bounding boxes for the cat. (or all cats)
    • train a regression output for the box coordinates

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2019.png

      Source Illustration: Fei-Fei Li & Justin Johnson & Serena Yeung

  24. What are the different loss functions for object localization?

    Answer

    • Find every object and its bounding boxes in the image.
    • softmax loss for classification
    • L2 loss (MSE) for bounding boxes

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2020.png

      Source: Fei-Fei Li & Justin Johnson & Serena Yeung

  25. What is pose estimation?

    Answer

    • The human body gets split up in different parts, such as left hand, head, right foot. The NN tries to figure out these parts in the image.

  26. What's the ouput of a NN doing pose estimation?

    Answer

    • A vector holding different parts and their coordinates.

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2021.png

      Source: Fei-Fei Li & Justin Johnson & Serena Yeung

  27. What is the loss of pose estimation?

    Answer

    • L2 loss

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2022.png

      Source: Fei-Fei Li & Justin Johnson & Serena Yeung

  28. What are some approaches to do object detection (which is why harder than object localization?)

    Answer

    • We do not know beforehand how many object there will be
    • Therefore, we cannot use the same approach as in localization (where only one object was to be found and classified).
    • A brute-force approach: apply a CNN to many different subsets of an image and classify each as an object or background.

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2023.png

      Source: Fei-Fei Li & Justin Johnson & Serena Yeung

  29. Why is the brute force approach bad?

    Answer

    • checking every crop is computationally costly

  30. What are other algorithms to do object detection?

    Answer

  31. R-CNNs are slow, thats why Fast R-CNN were born. What's the difference?

    Answer

  32. What does Fast R-CNN do?

    Answer

    • Uses a predefined algorithm for creating region proposals, and a Region Proposal Network (RPN) to predict proposals from features.

  33. What are MobileNets and what do they use?
    Answer
    • On mobile we require nn to process in real time, therefore the number of computations has to be reduced. The engine for this is called depthwise separable convolutions.
  34. Explain the idea of Depthwise Separable Convolutions

    Answer

    Key to understand is to know that the dot product calculation is expensive.

  35. What impact do the two parameters have on MobileNets?

    Answer

    The parameters can be used to decrease latency with the drawback to sacrifice accuracy.

    • Width multiplier: How thin or thick the feature map gets. (Must be dimensions going into the back)
    • Resolution multiplier: Parameter for the size of the resolution of input and feature map.
    • Image showing the dropping accuracy with less computation steps

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2029.png

    Source: MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications, A. Howard et al.

    Figure 5 shows the trade off between ImageNet Accuracy and number of parameters for the 16 models made from the cross product of width multiplier α∈{1,0.75,0.5,0.25} and resolutions {224,192,160,128}

  36. What's the difference between semantic segmentation and instance segmentation?

    Answer

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2030.png

    Source: http://vladlen.info/wp-content/uploads/FSO-1.jpg

  37. What is Panoptic Segmentation?

    Answer

    • Combines segmenting stuff (background regions, e.g. grass, sky) and Things (Objects with well defined shape, e.g. person, car). Therefore it combines Semantic and Instance Segmentation

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2031.png

      Source: Panoptic Segmentation A. Kirillov et al.

  38. How does Panoptic Segmentation work?

    Answer

    • Algorithm must label every pixel in the found objects but also label every pixel in the background "stuff". We map every pixel i to a tupel:

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2032.png

      Source: Karlsruhe University of Applied Science - Yurii Tolochko, 2019

  39. Name datasets to train Semantic or Instance Segmentation models

    Answer

    • COCO (Common Objects in Context)
    • BDD100K (A Large-scale Diverse Driving Video Database)
    • Mapillary Vistas Dataset (Data from cities all over the world)
    • Cityscapes (50 cities (49 german cities + Zurich), various seasons (spring, summer, fall))
    • The KITTI Vision Benchmark Suite

  40. What is the Loss Function for Semantic Segmentation?

    Answer

  41. What is the problem for the Loss Function for Semantic Segmentation?

    Answer

    • Here: every pixel is treated equally (because of the averaging). This can lead to some issues related to imbalanced classes.

  42. How can you evaluate the performance of semantic segmentation?

    Answer

  43. What approaches exist for Semantic Segmentation?

    Answer

    • Sliding Windows (Inefficient and unusable in practice)
    • Fully Convolutional Network for Segmentation (still inefficent, because no dimensionality reduction is applied) //Recall: When a CNN gets deeper we want to have more features but of lower dimension.
    • Image Segmentation Network

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2036.png

      Source: Fei-Fei Li & Justin Johnson & Serena Yeung

  44. How can we incorporate dimensionality reduction to a CNN for Semantic Segmentation but still have the output of the original size?

    Answer

    • Design network with downsampling and upsampling (same principal as autoencoder)

  45. How can downsampling be done?

    Answer

    • Pooling or Strided Convolutions

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2037.png

      Source: Karlsruhe University of Applied Science - Yurii Tolochko, 2019

  46. How can upsampling be done?

    Answer

    • Unpooling (parameter-free), e.g. "max-unpooling"
    • Transpose Convolution (trainable upsampling)

  47. What types of Unpooling do you know?

    Answer

    • 1:

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2038.png

      Source: Fei-Fei Li & Justin Johnson & Serena Yeung

    • Bed of Nails (Fill rest of the squares with zeroes)

    • Nearest Neighbour Unpooling. A KIND OF INVERSION OF AVERAGE POOLING OPERATION (Fill rest of the squares with the same number)

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2039.png

      Source: Fei-Fei Li & Justin Johnson & Serena Yeung

    • Max Unpooling as Max Pooling Inverse

      • We remember which element was the max value and recover this in the new output

        data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2040.png

        data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2041.png

        Source: Fei-Fei Li & Justin Johnson & Serena Yeung

  48. Why is the value of the unpooled output slighlty different than it was before?

    Answer

    • For this a transpose convolution is used and the scalar of the reduced output is multiplied by the filter. Important: The values in the filter are trained by the network and not predefined.

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2042.png

      Source: Fei-Fei Li & Justin Johnson & Serena Yeung

  49. How does the checkerboard artifact evolve?

    Answer

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2043.png

    Source: Fei-Fei Li & Justin Johnson & Serena Yeung

  50. Convolution and its transpose side-by-side

    Answer

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2044.png

    Source: https://github.com/vdumoulin/conv_arithmetic

  51. What are other names for transpose convolution?

    Answer

    • Upconvolution
    • Backward strided convolution
    • Fractionally strided convolution
    • Deconvolution - this one is particularly dangerous because deconvolution is a well-defined mathematical operation which is not the same as transpose convolution.

  52. What is the architecture idea for Instance segmentation?

    Answer

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2047.png

    Source: https://saytosid.github.io/images/segnet/Complete architecture.png

  53. What is the motivation for Autoencoders?

    Answer

    We want them to learn efficient and pertinent data encodings of the data.

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2048.png

    Source: https://cdn-images-1.medium.com/max/1600/1*[email protected]

  54. Give a definition of Autoencoders

    Answer

    • Autoencoder is a type of an encoder-decoder network where target space is the same as the input space.

  55. Provide Characteristics of Autoencoders
    Answer
    • Data does not have to be labelled
    • Latent feature space is usually of lower dimension than the input feature space
    • L2 as loss function
  56. Where are autoencoders used?

    Answer

  57. How could we initialize our neural network quite well?

    Answer

    • By passing the features of the autoencoder to the weight initialization

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2052.png

      Source: Karlsruhe University of Applied Science - Yurii Tolochko, 2019

  58. How can Autoencoders be used for Anomaly detection?

    Answer

    • Main idea: let’s say we have a well-trained autoencoder (which means it can reconstruct the data it was trained on without too much error):
      • If it works well on the new input, i.e. reconstruction error is low, we can assume that the input is of the normal class.
      • If the reconstruction error is high, we should think that the input is of an ‘unseen’ class, i.e. it is an anomaly.

  59. Potential applications for Autoencoders
    Answer
    • Detect if input data is of the class that our model was trained on or not.
    • Detect outliers.
    • Fraud analytics.
    • Monitoring sensor data
  60. Give an idea of Generative Models

    Answer

    • It’s not about predicting and classifying things, it’s

      about learning some underlying hidden structure of the data at hand.

      We want to train a model distribution pΞ ,parameterized by our choice, that would fit or resemble pdata.

      If we can successfully obtain this trained model pΞ, it in fact means that we know the underlying structure of the data and we can do a lot of interesting things with it.

      → We start by an empirical data distribution

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2053.png

      Source: Fei-Fei Li & Justin Johnson & Serena Yeung

  61. Which famous models belong to generative models?

    Answer

    • Variational Autoencoder
    • Markov Chain (e.g. Boltzmann Machine)
    • GANs

  62. What are the flavours of Generative Models?

    Answer

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2054.png

    Source: Fei-Fei Li & Justin Johnson & Serena Yeung

  63. On which flavour focuses VAEs and GANs?

    Answer

    • Variational Autoencoder → Explicit
    • GANs → Implicit

  64. What are Variational Autoencoder (VAE)?

    Answer

    • Probabilistic extension of normal autoencoder
    • Also considered as latent variable model
    • Make it learn latent parameters that describes the probability distribution of the data
      • We start with a standard normal distribution and a prior p_theta(z)

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2055.png

    Source: http://kvfrans.com/content/images/2016/08/vae.jpg

  65. What is the loss function of VAE?

    Answer

    • KL (Kullback Leibler) divergence between the learned latent distribution and its prior distribution. This forces the network to learn latent features that follow the prior distribution
    • Reconstruction loss - just as in previous autoencoders, it forces the decoder to match the input.

      "Kullback–Leibler divergence (also called relative entropy) is a measure of how one probability distribution is different from a second, reference probability distribution"

  66. What is the key difference between Autoencoder and Variational Autoencoder?

    Answer

    • Autoencoders learn a “compressed representation” of input (could be image,text sequence etc.) automatically by first compressing the input (encoder) and decompressing it back (decoder) to match the original input. The learning is aided by using distance function that quantifies the information loss that occurs from the lossy compression. So learning in an autoencoder is a form of unsupervised learning (or self-supervised as some refer to it) - there is no labeled data.
    • Instead of just learning a function representing the data ( a compressed representation) like autoencoders, variational autoencoders learn the parameters of a probability distribution representing the data. Since it learns to model the data, we can sample from the distribution and generate new input data samples. So it is a generative model like, for instance, GANs.

  67. Does GANs use a probability distribution?

    Answer

    • Nope

  68. What is the idea of GANs?
    Answer
    • Train an NN that produces images based on an input of some images. This part is called Generator Network. The goal of the Generator Network is, to improve its ability to produce images so that the discriminator will fail.
    • The second actor is the Discriminator Network. Its task is to differentiate between the fake image from the generator and real images from the training set.
  69. How does the loss function look like?

    Answer

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2056.png

    Source: Karlsruhe University of Applied Science - Yurii Tolochko, 2019

  70. What is a drawback of the loss function and how to solve it?

    Answer

    When actually training the generator, it was found that the objective function does not work very well because of gradient plateau (flat gradients means very slow if any training).

    Solve it by:

    • For this reason the generator objective function is “flipped”.
    • An intuitive interpretation is that now instead of minimizing the probability of the discriminator being correct, we maximize the probability of it being wrong.
    • What do Vannila GANs use?
    • Only fully connected layer but no CNN
    • When GANs use CNNs, the discriminator uses a normal cnn. What about the generator?
    • it uses an upsampling CNN with transpose convolution operations (Recall Image segmentation problems)
    • How does deepfakes video work?
    • Rely on GANs for data generation.
    • Combines existing videos with source videos to create new, almost indistinguishable videos of events that are actually completely artificial

NLP

  1. What is the idea of weight initialization?
    Answer
    • Using better weight initialization methods leads to a better gradient transport through the network and a faster convergence
  2. What are different approaches for weight initialization?

    Answer

  3. Why is it important to keep the variance of gradient and activation values of each layer constant?

    Answer

    • Otherwise it would lead to vanishing or exploding gradients, which leads to problems in training.

  4. For linear functions it is easier to have constant variances

    Answer

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2060.png

    Source: Karlsruhe University of Applied Science - Yurii Tolochko, 2019

  5. Constant variances for NON linear, like relu looks like:

    Answer

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2061.png

    Source: Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. He et. al.

  6. What is Batch Normalization and why should it be applied?

    Answer

    Why: Covariate Shift ends in bad performance

    But the problem appears in the intermediate layers because the distribution of the activations is constantly changing during training. This slows down the training process because each layer must learn to adapt themselves to a new distribution in every training step. This problem is known as internal covariate shift.

    Batch normalization: theory and how to use it with Tensorflow

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2062.png

    That simply means that the datapoints can vary extremly which forces the intermediate layers to readjust.

    Where and on what do we do Batch Norm?

    • Usually for the intermediate layers, but it can also be applied on the input layer (taking the raw values). It is not applied on the values itself but on the result of the x vector times the weight vector w, which is z. Z is passed to the activation function and that's why it has to be normalized beforehand. (Sometimes the result of the activation is normalized but this doesn't make sense to me)

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2063.png

    Source: Andrew Ng, Deep Learning, Coursera

  7. What is the idea of Batch Normalization?

    Answer

    • BN prevent a neural net from exploding or vanishing gradient and reducing learning time due to internal covariate shift. It forces the activations to have mean 0 and unit variance by standardizing it.

  8. What are the steps for Batch normalization?

    Answer

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2064.png

  9. Should Batch Normalization always be applied?

    Answer

    No, with the following explanation:

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2065.png

  10. What is the idea of Dropout?

    Answer

    The idea is to prevent an NN from overfitting by removing a specific percentage of neurons from a layer. Often used percentage is between 0.5 and 0.25.

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2066.png

    Source: https://static.commonlounge.com/fp/original/aOLPWvdc8ukd8GTFUhff2RtcA1520492906_kc

  11. Explain Cosine Similarity

    Answer

  12. Explain Bleu Score an when to use it

    Answer

    BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is"

    Scores are calculated for individual translated segments—generally sentences—by comparing them with a set of good quality reference translations. Those scores are then averaged over the whole corpus to reach an estimate of the translation's overall quality. Intelligibility or grammatical correctness are not taken into account

    BLEU's output is always a number between 0 and 1. This value indicates how similar the candidate text is to the reference texts, with values closer to 1 representing more similar texts. Few human translations will attain a score of 1, since this would indicate that the candidate is identical to one of the reference translations. For this reason, it is not necessary to attain a score of 1. Because there are more opportunities to match, adding additional reference translations will increase the BLEU score

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2069.png

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2070.png

    Source: https://cloud.google.com/translate/automl/docs/evaluate#:~:text=BLEU (BiLingual Evaluation Understudy) is,of high quality reference translations.

    1. Berechne Precision fĂŒr jedes N-Gram (wĂ€hle min von ref und cand)
    2. Multipliziere ĂŒber alle n-gram_precision werte
    3. Wende Penalty fĂŒr kurze SĂ€tze an

  13. What are the disadvantages of the Bleu Scores?

    Answer

    • No distinction between content and functional words (ESA - NASA).
    • Weak in capturing the meaning and grammar of a sentence
    • Penalty for short sentences can have strong impact

  14. Why do we need RNNs and not normal Neural Nets?

    Answer

    • Inputs and outputs can be different lengths in different examples.
    • Doesn't share features learned across different positions of text

  15. What's the architecture of a Vanilla RNN?

    Answer

    • a is usually referred as the hidden state, whereas I just see it as the activation from the neuron.
    • x^ is a word of the sentence at position t (same as at time t)
    • As an activation function mostly tanh/Relu is used, but only for the hidden state (a)
    • For y_hat we usually use a sigmoid function
    • a<0> (hidden state) is initialized with zero.
    • g ist the function (It's actually just the layer and what it does is it has a weight vector for incoming a that is multiplied with the activation value and a weight vector for the new word x that is multiplied with the new word plus a bias), this results in a. g also has a weight vector for calculating y. It takes that vector and multiplies it with the previously calculated a<0> plus a bias. The result is the word at position bzw. y

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2071.png

      Source: Andrew Ng, Deep Learning, Coursera

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2072.png

  16. What is the loss of the RNN?

    Answer

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2073.png

    Source: Fei-Fei Li & Justin Johnson & Serena Yeung

  17. What are problems of Vanilla RNNs?

    Answer

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2074.png

    • not good in learning long-term dependencies
    • What are LSTMs and why are they needed?
    • Vanilla RNNs are not good in learning long-term dependencies. LSTMs can be seen as a fancier family of RNNs.
    • Of which components is a LSTM composed?
    • Cell State: C(t) Internat cell state (memory)
    • Hidden State: External hidden state to calculate the predictions
    • Input Gate: Determines how much of the current input is read into the cell state
    • Forget Gate: Determines how much of the previous cell state is sent into the current cell state
    • Output Gate: Determines how much of the cell state is output into the hidden state

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2075.png

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2076.png

      Source: https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21

  18. What is the sigmoid function for?

    Answer

    • The output is a value between 0 and 1. If the network thinks something should be forgotten than it tends to be 0, if it wants to update than it is 1. This is because the output is mostly used for multiplication.

  19. What is the tanh function for?
    Answer
    • tanh function squishes values to always be between -1 and 1. This helps preventing the vectors of having too large numbers.
  20. What is the cell state for?
    Answer
    • Memory of the network. As the cell state goes on its journey, information gets added or removed to the cell state via gates.
  21. What does the forget gate do?

    Answer

  22. What does the input gate do?

    Answer

  23. How is the cell state computed?

    Answer

    • Input: previous cell state + forget gate output + input gate output
    • Multiply prev cell state pointwise with forget gate output
    • Add input gate output to the result

  24. What does the output gate do?

    Answer

  25. How does GRU work?

    Answer

    • Newer generation of RNN
    • got rid of cell state
    • use hidden state to transport information
    • little speedier to train then LSTM’s
    • Reset gate and update gate

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2080.png

      Source: https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21

    • Let the network learn when to open and close the gates, i.e. update the “highway” with new information.

    • Therefore, instead of updating the hidden state at every RNN cell, the network can learn itself when to update the hidden state with new information.

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2081.png

    Source: https://wagenaartje.github.io/neataptic/docs/builtins/gru/

    1. Reset Gate:
      • Input: Weight Matrix W, prev hidden state, and word (x)
      • It first computes the reset value. That indicates which value of the input are important and which aren't (recall sigmoid value → 0)
      • Pass the function of multiplication of prev hidden state and reset vector to the tanh function. This deactivates some prev hidden state values
    2. Update Gate:

      • Input: Weight, prev hidden state, x

        data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2082.png

    3. Compute Output

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2083.png

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2084.png

    I like this illustration more:

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2085.png

    Quote: "Similar to normal RNNs the input gets multiplied by the weight matrix and is added to a hidden layer. However here the input is added to h˜ . The r is a reset switch which represents how much of the previous hidden state to use for the current prediction. Coincidentally there is also a neural network for this reset switch which learns how much of the previous state to allow for predicting the next state. z represents whether to only use the existing hidden state h or use its sum with h~(new hidden state) to predict the output character."

    Source: https://medium.com/datadriveninvestor/lstm-vs-gru-understanding-the-2-major-neural-networks-ruling-character-wise-text-prediction-5a89645028f0

  26. What is the idea of Attention models?

    Answer

    • The main idea of the attention mechanism is to imitate the human behaviour of text processing by "reading" a chunk of the input sentence and process it instead of the whole sentence.
    • The more words the encoder processed, the less information about the single words is contained in the vector. Attention models try to bypass this by save each output vector ht from a state. In contrast to vanilla Seq2Seq-Models, another processing layer between encoder and decoder was added which calculates a score for each ht. The scores indicate how much attention the decoder should pay on a specific ht.
    • → Goal: Increase performance for long sentences

  27. Describe the algorithm of Attention Models

    Answer

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2086.png

  28. Shorten the algorithm in your own words

    Answer

    1. Choose window of words you want to incorporate its attention
    2. Save the hidden states of them and compute a score for these hidden states
    3. Apply softmax on the scores to get a probability distribution, resulting in attention weights
    4. Compute the context vector by attention weights times hidden states
    5. Take context vector and prev hidden state to compute the next word

  29. What is the advantage we have with ConvNets for NLP?

    Answer

  30. How does CNN for NLP work?

    Answer

    • Use filter for multiple words (Similar approach to n-gram)
    • Vanilla CNNs can be used in One-to-One, Many-to-One or Many-to-Many Architecture.
    • Used in Text Classification, Sentiment Analysis and Topic Categorization
    • Problem: Looses the relation information between words, caused by max pooling layers.

  31. How can CNNs be combined with Seq2Seq Models?

    Answer

    • Each layer has two Matrices
    • W - Kernel for convolving input
    • V - used for GRU computation
    • GRU calculation decides based on the sigmoid outcome how much of the conv output is propagated to the next level.

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2088.png

  32. Describe the basic steps of Speech Recognition

    Answer

    • Transform sound wave to spectrogram using fourier transformation
    • Spectogram is splitted into multiple parts, e.g. 20 ms blocks, resulting in 50 block for a second. A second usually contains between 2 and 3 words.
    • Problem: Get rid of repeated characters and separate them into words (CTC = Connectionist Temporal Classification)

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2089.png

      Source: Hannun, "Sequence Modeling with CTC", Distill, 2017.

  33. How can the problem of removing repeated characters be solved?

    Answer

    • Apply Beam Search and take the most likely output

  34. Explain a simple approach for Speech to Text

    Answer

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2090.png

  35. Explain Backpropagation through time (BPTT)

    Answer

    1. Gather loss for all outputs and at every timestamp and sum them up. (but for each state)

      Loss = Sum (y_t - y_t_hat)

    2. Apply "normal" backpropagation to each state, keep in mind the partial derivative’s

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2091.png

      Source: http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/

  36. What is Truncated Backpropagation through Time?

    Answer

    • The longer the sequence is the harder it is to calculate the gradients -> Solution: truncated backprop
    • BPTT is periodically on a fixed number of timesteps applied

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2092.png

      Source: https://srdas.github.io/DLBook/RNNs.html

  37. What is a Language Model?

    Answer

    • It gives you a probability of a following sequence of words/characters for a given word. P(Y) = p(y1 ,y2 ,y3 ,...,yn )
    • This effectively estimates the likelihood of different phrases (in a language).

    → Simpler models may look at a context of a short sequence of words, whereas larger models may work at the level of sentences or paragraphs.

    Useful for:

    • Speech Recognition
    • Machine translation
    • Part-Of-Speech (POS) tagging
    • Optical Character Recognition (OCR)

  38. Explain the idea of N-Grams

    Answer

    • Coalition of n words to "one"
    • 1-gram = Unigram
    • 2-gram = Bigram
    • Can be used to calculate the probability of a given word being followed by a particular one.
    • P(x1, x2, ..., xn) = P(x1)P(x2|x1)...P(xn|x1,...xn-1)

  39. Define One-Hot Encoding

    Answer

  40. What are pros and cons of one hot encoding for text data?

    Answer

    Pros:

    • Can be used on different stages for example words or letters
    • Can represent every text with a single vector

    Cons: There are a few problems with the one-hot approach for encoding:

    • The number of dimensions (columns in this case), increases linearly as we add words to the vocabulary. For a vocabulary of 50,000 words, each word is represented with 49,999 zeros, and a single “one” value in the correct location. As such, memory use is prohibitively large.
    • The embedding matrix is very sparse, mainly made up of zeros.
    • There is no shared information between words and no commonalities between similar words. All words are the same “distance” apart in the n-dimensional embedding space.
    • Example of one hot encoding

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2094.png

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2095.png

    Source: https://programmersought.com/article/64902775751/

  41. What is the idea of word embeddings?

    Answer

  42. What are word embeddings used for?

    Answer

    • Semantic Analysis
    • Named Entity Recognition
    • Weights initilization

  43. What is a problem with word embeddings?
    Answer
    • Need huge corpora to train, that's why mostly pre-trained embeddings are used
  44. What is the basic approach of word embeddings?
    Answer
    • Words that appear in similar contexts (close to each other) are likely to be semantically related.
    • If we want to “understand” the meaning of a word, we take information about the words it coappears with.
  45. How does the training process look like?

    Answer

    We want to have p(context|wt), where wt is the word we are looking at and context the surrounding words.

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2097.png

    1. Used Loss function: -log[p(context|wt)] //Somewhat like cross entropy loss function
    2. Optimize NN based on Used Loss Function
    3. If we can predict context we have good embedding
    4. delete classifier keep embeddings

    Source Illustration: https://laptrinhx.com/get-busy-with-word-embeddings-an-introduction-3637999906/

  46. What was an early attempt of word-embeddings?

    Answer

    • Neural probabilistic Language Models
    • → Generate joint probability functions of word sequences

  47. What were shortcomings of Neural Probabilistic?
    Answer
    • It was only looking at words before but not at words after the predicted one
  48. How does Skip-Gram work?

    Answer

    blue is the given word to predict the surroundings → + + + C + + +

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2098.png

    Source: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

  49. How does word2vec work?

    Answer

    Idea relies on estimating word co-occurences.

    For Skip-gram:

    1. Select Windows size
    2. Select a word wt
    3. For a selected word wt in our training corpus look a randomly selected word within m positions of wt.
    4. train to predict correct context words
    5. Predict by softmax classifier

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%2099.png

  50. What is a problem with word2vec?

    Answer

    • At each window the normalization factor over the whole vocabulary has to be trained (e.g. 10.000 dot products) → computationally too expensive

  51. How can the problems of word2vec be solved?
    Answer
    • Applying negative sampling
  52. What is negative sampling

    Answer

    Main idea: we want to avoid having to train the full softmax at each prediction. “A good model should be able to differentiate data from noise by means of logistic regression”.

    • Train two binary logistic regressions
      1. One for the true pair (center word, output)
      2. One for random pairs (center word, random word)

    Exaxmple:

    • "I want a glass of orange juice to go along with my cereal." For every positive example, we have k-1 negative examples. Here k = 4.

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20100.png

  53. How does CBOW work?

    Answer

    • Target is a single word → C C C + C C C
    • Context is now all words within the context window of size m.
    • Average embeddings of all context words

  54. Differentiate CBOW from Skip-gram

    Answer

    According to the author, CBOW is faster but skip-gram but does a better job for infrequent words.

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20101.png

    Source: https://www.knime.com/blog/word-embedding-word2vec-explained

  55. What does GloVe?

    Answer

    • Calculate co-occurrence directly for the whole corpus
    • We go through the entire corpus and update the co-occurrence counts.

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20102.png

    • Go through all pairs of words in the co-occurrence matrix.
    • Minimize distance between dot product of 2 word embeddings
    • Function f(Pij) - allows one to weight the co-occurrences. E.g. give a lower weight to frequent co-occurrences.
    • Trains much more quickly and with small corpora
    • How can word embeddings be evaluated?
    • Intrinsic Evaluation

      1. Evaluate on a specifically created subtask

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20103.png

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20104.png

      Source Illustration: Pennington, Socher, Manning: https://nlp.stanford.edu/projects/glove/

      1. Another intrinsic evaluation approach is comparing with human judgments.
    • Extrinsic Evaluation

      • How good are our embeddings for solving actual tasks?
      • Example of such task is named entity recognition.
    • Information on dependence on Hyperparameters

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20105.png

    Source Slides: Socher, Manning

  56. What are advantages of Character Level Models?

    Answer

    • Vocabulary is just of the size of the alphabet + punctuation
    • No need for tokenization
    • No unknown word (out of vocabulary)
    • Can learn structures data encoded by characters and only plain text (For example gen informations)

  57. What are disadvantages of Character Level Models?
    Answer
    • Require deeper RNN's because there are more characters in a sentence than words
    • Generally worse capturing long-distance dependencies
    • In order to model long term dependencies, hidden layers need to be larger
  58. What are different types of NNs?

    Answer

    One to One: Image Classification

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20106.png

    Illustration: Andrey Karpathy. The Unreasonable Effectiveness of Recurrent Neural Networks

    One to Many: Generating caption for an image

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20107.png

    Source: https://medium.com/swlh/automatic-image-captioning-using-deep-learning-5e899c127387

    Many to One: Sentiment Classification

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20108.png

    Source:

    Many to Many (Sequence2Sequence): Machine Translation where length of input and output can be different or Entity Recognition

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20109.png

    Source: https://confusedcoders.com/data-science/deep-learning/how-to-build-deep-neural-network-for-custom-ner-with-keras

  59. Of what does the Sequence2Sequence architecture exist?

    Answer

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20110.png

    Source: https://towardsdatascience.com/understanding-encoder-decoder-sequence-to-sequence-model-679e04af4346

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20111.png

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20112.png

  60. How can the process of predicting in the decoder part be improved?

    Answer

    • Further Improvements Remember, we are creating our word vector based on a probabilistic model. Predicting the output vector using a greedy approach may not be the best solution.

  61. Describe Beam Search

    Answer

    Beam search is an approximate search algorithm, so no guarantee to find a global maximum. Beam Search is a heuristic algorithm - it does not guarantee the optimal solution.

    Instead of selecting the first best word, we take the k best matching words and calculate the probabilities for all the next words. Out of these, choose again the top b matching samples and repeat. If we choose b=1, than beam seach is a greedy search.

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20113.png

    Source: https://d2l.ai/chapter_recurrent-modern/beam-search.html

  62. What is a painpoint of beam search?

    Answer

    The probability of a sentence is defined as product of probabilities of words conditioned on previous words. Due to compounding probabilities, longer sentences have a lower probability by definition. Shorter sentences can have higher probability even if they make less sense.

  63. How can the painpoints of beam search be solved?

    Answer

    We need to normalize sentences by their length multiply the probability of a sequence by its inverse length.

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20114.png

  64. Describe the light normalization implementation

    Answer

    1. run, select and save top 3 most likely sequences for every sequence length (e.g. Defined max sequence length = 5 → 1,2,3,4,5)
    2. This results in a set of size 3 containing

      • 1 sequences (single words, e.g.: ("Oh","donald","trump"), ("Go","Barack","Obama"), (...) )
      • 2 sequences (two word sequences)
      • Up to N (here 5) sequences
    3. For every sequence length run a normalized probability score calculation and assign it to every sequence.

    4. Select the sequence with highest probability of all

  65. Which two errors can arise with Beam Search?

    Answer

    • Model errors from our RNN
    • Search errors from our beam search hyperparameter

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20115.png

Reinforcement Learning

  1. Describe the idea of RL
    Answer
    • The idea of RL is that an algorithm learns a task by itself without precise instructions from a human. A developer defines the goal and provides it with a reward. In addition, it must be defined which actions the model is allowed to perform (actions). The model explores several strategies over a period of time and acquires the best one. This is also known as trial-and-error search.
  2. Describe the terminology of RL
    Answer
    • Environment: The abstract world that everything happens in.
    • Agent: The Algorithm which interacts with the environment
    • State: Part of the environment. Every action of the agent with the Env changes their state
    • Reward: Every interaction of an agent with the environment produces a reward. Reward is a signal of how good or bad the interaction was in terms of some goal. At some point, the interaction is terminate. The final goal of the agent is to maximize the cumultative reward
    • Policy: The rules by which the agent acts (π)
  3. When to use Reinforcement Learning
    Answer
    • If the problem is an inherently sequential decision-making problem i.e. a board game like chess or games in general.
    • If the optimal behaviour is not known in advance (unsolved games without "perfect strategy")
    • When we can simulate the system behaviour as good as possible to imitate the real world
    • Solving problems with a non-differentiable loss function
  4. RL is separated into Model-Free RL and Model-Based RL. Name subcategories and associated algorithms

    Answer

    Model-Free RL:

    • Policy Optimization (DDPG, Policy Gradient)
    • Q-Learning (DQN, DDPG)

    Model Based RL:

    • Learn the Model (World Models)
    • Given the Model (AlphaZero)

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20116.png

  5. Tasks of the Environment

    Answer

    • Interface: An Env need to provide an interface with which the agent could interact
    • Calculating Effects: For an action performed by the agent, the Env needs to calculate the:
      • next state
      • reward for this action

  6. Describe Model Free approaches
    Answer
    • The focus is on finding out the value functions directly by interactions with the environment. All Model Free algorithms learn the value function directly from the environment. That's not entirely true. Looking at Policy Gradient approaches we do not learn the value function.
  7. Define the Markov Decision Problem (MDP)

    Answer

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20117.png

    • The MDP is inspired by the theorem "The future is independent of the past given the present". (Markov Property) That means, to calculate values for the future state $S_{t+1}$, only the state $S_t$ counts. It should keep all the information from already passed states.

  8. Describe the Markov Property

    Answer

    • A state contains the Markov property. This gives the probabilities of all possible next states under the condition of the current state in a matrix P. (Markov Property: Capture all relevant information from the past)

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20118.png

  9. What is the discounted reward?

    Answer

    • It is the discounted sum over all rewards that happened in the future. Intuitively speaking it is something like a measurement how efficient the agent solved an episode. E.g.: An agent decided to walk loops in a GridWorld scenario and after a few loops he changed his route towards the goal and reached it. The discounted is still good but it would have been better if he had taken the direct path, because reward further in the future are discounted the most. Hence, the final is always discounted most.
    • Recall that return is the total discounted reward in an episode:
      • G_t = R_t+1 + ÎłR_t+2 + ... + ÎłT-1RT

  10. What is a policy?

    Answer

    A policy is the agent’s behaviour. It is a map from state to action.

    • Deterministic policy: a=π(s).
    • Stochastic policy: π(a|s) =P[A=a|S=s].

  11. What is the State Value Function?

    Answer

    • Value function for a state is the reward under the assumption of taking the best action/policy from there. → Kind of the best Q Value
    • The Value function takes the reward of all possible next states into account (and the next rewards of those states, which is G) and calculates the value by multiplying the action probability (policy) by the reward for the particular state.

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20119.png

    However the description above can be written differently (much smarter so that we can apply dynamic programming on it)!

    Explanation of the belman equation:

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20120.png

    Discounted Reward: That's just the sum of all rewards (discounted with gamma, but this isn't important to understand the concept here).

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20121.png

    Defining G_t brings us to the Expectation_policy

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20122.png

    of

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20123.png

    The expectation is mathematically defined by the probability (here Pi/policy) times over all elements (here G_t). The policy/probability is the probability of a specific action under state s.

    DP, MC and TD Learning methods are value-based methods (Learnt Value Function, Implicit policy).

  12. What is the Action Value Function?

    Answer

    • Equation almost the same as State Value Function

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20124.png

  13. Why do we need the bellman equation and how to solve it?

    Answer

    Can also be applied to q-values!

  14. How is optimal policy defined?

    Answer

  15. The optimal policy and the bellman equation leads to the Bellman optimality equation

    Answer

    • By that we formulate the equation without referring to a specific policy (that's why a star is beneath v)
    • "Intuitively, this says that the value of a state under an optimal policy must equal the expected return for the best action from that state."
    • The next image shows this for q values. The order would be that we select the best action at the current state s, following the optimal policy which results in the best value function value.

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20127.png

      Source: https://www.analyticsvidhya.com/blog/2018/09/reinforcement-learning-model-based-planning-dynamic-programming/

  16. What is the Markov Reward Process (MRP)?

    Answer

    • MRP is a tuple of (S,P,R, gamma)

      • S is a state space
      • P is the state transaction function (probability function)
      • R is the reward function with E(R_t+1|S), i.e. how much immediate reward we would get for a state S.
      • Everything results in G_t, which is the total discounted reward for a time step t. The goal is to maximize this function:

        data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20128.png

      • Gamma is the discount factor which is applied to future rewards. This should play in less than the "closer". This also leads us to avoid infinite returns in Markov processes.

        • Another important point is the value function V(S) which can be adapted to different use cases. The original MDP uses the Bellman Equation:

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20129.png

      For gamma = 1 we aim at long term rewards and for gamma = 0 we tend to short term profit.

      In the following example, a discount of 0.9 is used from the final state reward(Peach) to the start state of the agent (Mario).

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20130.png

      However, to solve the Bellman equation we need almost all the information of our environment: Markov property and calculate v* for all other states. Since this rarely works, many RL methods try to approximate the Bellman function with the last done state transition instead of having a complete knowledge of all transitions.

  17. What is the difference between State Value V(s) and Action Values Q(s,a)?

    Answer

    • Basically two different ways to solve reinforcement learning problems.
    • V_Pi(s) describes the expected value of following the policy Pi when the agent starts at the state S. Basically, the action is greedy.
    • Q_Pi (s,a) describes the expected value when an action a is taken from state S under policy Pi.

  18. What is the relationship between Q-Values (Q_Pi) and the Value function (V_Pi)?

    Answer

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20131.png

  19. Describe the idea of Q-Learning

    Answer

    • The goal of Q-learning is to learn a policy, which tells an agent what action to take under what circumstances.
    • Approximation of the MDP with the approach of action states without solving the Bellman equation. It is a MDP with finite and small enough state space and actions
    • How can a grid world problem be solved with Q-learning?

      1. Initialize a table for each state-action pair
      2. Fill in the table step by step while interacting with the environment
      3. Calculate the Q-Values

        data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20132.png

        Source: http://users.isr.ist.utl.pt/~mtjspaan/readingGroup/ProofQlearning.pdf

      4. Repeat step 2 + 3 for N epochs

      5. Check how well the learned table approximated the State Values (V(s)) for the Bellmann equation. This could look like the following:

        data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20133.png

        Source: https://towardsdatascience.com/qrash-course-deep-q-networks-from-the-ground-up-1bbda41d3677

  20. What is Policy Iteration?

    Answer

    Policy iteration includes: policy evaluation + policy improvement, (both are iteratively)

    • Uses also bellman equation but without maximum.
    • For each state S we check if the Q-value is larger than the value value: q(s,a) > v(s) under the same policy (test different policies). If the Q-Value returns a larger value, the strategy performs better and we switch to it.

    Sometimes it only finds local minima

    1. Choose a policy randomly (e.g. go in each direction with prob = 0.25)
    2. Do policy evaluation: Calculate the Value for each state
    3. Do policy improvement: Change an action a (that results in a new policy pi') and see if the values are better. If so, switch to the new policy,. (We could use the Value function to do this successively for each state
    4. https://www.analyticsvidhya.com/blog/2018/09/reinforcement-learning-model-based-planning-dynamic-programming/

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20134.png

  21. Example of Policy Iteration for Monte Carlo

    Answer

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20135.png

    Source: https://www.programmersought.com/article/39135619445/

  22. What is Value iteration?

    Answer

    Value iteration includes: finding optimal value function + one policy extraction

    Usually finds global minima

    It is similar to policy iteration, although it differs in the following:

    1. Value Iteration tries to find the optimal value function rather than first the optimal policy. Once the best value function was found, the policy can be derived.
    2. It shares the policy improvement function/approach and a truncated policy evaluation

    For value iteration it can be less iterations but for one iteration there can be so much of work, because it uses bellman maximum equation. For the policy iteration more iterations.

  23. Compare Policy Iteration with Value Iteration

    Answer

  24. What is a disadvantage of policy iteration?

    Answer

    • Can get stuck in a local minimum

  25. How can the disadvantages of policy iteration be compensated?

    Answer

    • Early stopping

  26. What is the idea of the Monte Carlo method?

    Answer

    In General: A statistical approach that averages the mean for a state over an episode.

    1. Initialize rewards randomly
    2. Run through a specific number of episodes (e.g. 1000)
    3. ~~Calculate the discounted reward for each state~~
    4. Apply either:
      1. First visit method: Average returns only for first time s is visited in an episode.
      2. Every visit method: Average returns for every time s is visited in an episode.

    (Check again if example is correct (might be wrong))

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20136.png

    Source: https://www.analyticsvidhya.com/blog/2018/11/reinforcement-learning-introduction-monte-carlo-learning-openai-gym/

  27. Give a summary of Monte Carlo

    Answer

    • MC methods learn directly from episodes of experience.
    • MC is model-free: no knowledge of MDP transitions / rewards.
    • MC uses the simplest possible idea: value = mean return.
    • Episode must terminate before calculating return.
    • Average return is calculated instead of using true return G.
    • First Visit MC: The first time-step t that state s is visited in an episode.
    • Every Visit MC: Every time-step t that state s is visited in an episode.
    • https://github.com/omerbsezer/Reinforcement_learning_tutorial_with_demo/blob/master/README.md#MonteCarlo

  28. What is the idea of Temporal Difference Learning?

    Answer

    • Temporal Difference Learning (also TD learning) is a method of reinforcement learning. In reinforcement learning, an agent receives a reward after a series of actions and adjusts its strategy to maximize the reward. An agent with a TD-learning algorithm makes the adjustment not when it receives the reward, but after each action based on an estimated expected reward.

      Source: https://de.wikipedia.org/wiki/Temporal_Difference_Learning

    • Is an on-policy learning approach

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20137.png

      Source: https://github.com/omerbsezer/Reinforcement_learning_tutorial_with_demo

    • Compared to DP and MC, which are both offline because they have to finish the epoch first, TD is a mix of the two. Monte Carlo has to run a branch of States, DP even has to run all of them.

    • Summary Temporal Difference Learning
    • TD methods learn directly from episodes of experience.
    • TD updates a guess towards a guess
    • TD learns from incomplete episodes, by bootstrapping.
    • TD uses bootstrapping like DP, TD learns experience like MC (combines MC and DP).
    • MC-TD Difference
    • MC and TD learn from experience.
    • TD can learn before knowing the final outcome.
    • TD can learn online after every step. MC must wait until end of episode before return is known.
    • TD can learn without the final outcome.
    • TD can learn from incomplete sequences. MC can only learn from complete sequences.
    • TD works in continuing environments. MC only works for episodic environments.
    • MC has high variance, zero bias. TD has low variance, some bias.
    • What is the idea of Dynamic Programming?
    • DP is a general framework concept. It is derived from the divide and conquer principle, where DP tries to break down a complex problem into sub-problems that are easier to solve. In the end, the solutions of the subproblems are merged to solve the main problem.

      In order to apply it, the following properties must be given:

      • The optimal solution can be divided into subproblems
      • Subproblems occur more often
      • Subproblems solutions can be cached
      • Are the dynamic programming requirements met for a MDP?
        • Bellman equation (is a DP approach) gives us a recursive decomposition
        • The value function stores the action (serves as a cache) and can thus be used for DP
      • What is meant by full backups vs sample backups and shallow vs deep?
        • The history of the learned. Full backup utilizes a lot of paths and states, whereas sample backup is a branch in the tree diagram.
        • Shallow means that it does not go very deep into the tree before making an update (fast), whereas deep waits for a complete episode.

  29. What is the difference between on-policy and off-policy?

    Answer

    • Online: during the job: E.g TD Algorithm
    • On policy learning (Monte Carlo): Improve the same policy that is used to make decisions (generate experience).

    • Off policy learning (Q-Learning (TD-Algorithm)): Improve policy different than the one used to generate experience. Here, improve target policy while following behaviour.

      • Target policy - what we want to learn. Therefore, we evaluate and improve upon.
      • Behaviour policy - what we use to generate data.

      Motivation for having 2 policies is the trade-off between exploration and exploitation. In order to know what actions are best, we need to explore as much as possible - i.e. take suboptimal actions as well.

      Target policy π(St) is greedy w.r.t. Q(s,a), because we want our target policy to provide the best action available.

      Behavior policy ÎŒ(St) is often 𝜀-greedy w.r.t. Q(s,a).

      • 𝜀-greedy policy is an extension of greedy policy
        • With probability 𝜀 it will output a random action.
        • Otherwise, it output is the best action.
        • 𝜀 controls how much exploration we want to allow.

      This combination of two policies let’s us combine “the best of both worlds”. Our behavior policy gives enough randomness to explore all possible state-action pairs. Our target policy still learns to output the best action from each state.

  30. What is the difference between Episodic and Continuous Tasks?

    Answer

    • Episodic task: A task which can last a finite amount of time is called Episodic task ( an episode )
    • Continuous task: A task which never ends is called Continuous task. For example trading in the cryptocurrency markets or learning Machine learning on internet.

  31. Differentiate Exploitation from Exploration

    Answer

    • Exploitation: Following a strict policy by exploiting the known information from already known states to maximize the reward.
    • Exploration: Finding more information about the environment. Exploring a lot of states and actions in the environment by avoid a greedy policy every action.

  32. What is the sample efficiency problem?

    Answer

    • After improving our policy by an experience we do not use this information anymore, but with future experiences, which is not sample efficient.

  33. Define Experience Replay
    Answer
    • Another human behaviour imitation technique is called experience replay. The idea behind experience replay is, that the agent builds a huge memory with already experienced events.
    • For this, a memory M is needed which keeps experiences e. We store the agents experience et=(st,at,rt,st+1)e_{t}=\left(s_{t}, a_{t}, r_{t}, s_{t+1}\right)et=(st,at,rt,st+1) there. This means instead of running Q-learning on state/action pairs as they occur during simulation or actual experience, the system stores the data discovered for [state, action, reward, next_state] - typically in a large table. Note this does not store associated values - this is the raw data to feed into action-value calculations later.
    • The learning phase is then logically separated from gaining experience, and based on taking random samples from this table which does not stand in a directional relationship. You still want to interleave the two processes - acting and learning - because improving the policy will lead to different behaviour that should explore actions closer to optimal ones, and you want to learn from those.
  34. What are the advantages of experience replay?
    Answer
    • More efficient use of previous experience, by learning with it multiple times. This is key when gaining real-world experience is costly, you can get full use of it. The Q-learning updates are incremental and do not converge quickly, so multiple passes with the same data is beneficial, especially when there is low variance in immediate outcomes (reward, next state) given the same state, action pair.
    • Better convergence behaviour when training a function approximator. Partly this is because the data is more like i.i.d. data assumed in most supervised learning convergence proofs.
  35. What are advantages of Experience Replay?

    Answer

    • This turns our RL problem into a supervised learning problem!
    • Now we reuse our data, which is far more sample efficient.
    • Before, we worked with non i.i.d., strongly correlated data - we always saw states

      one after the other in the trajectory (now we randomly sample)

      • We break the correlation by presenting (state,value) pairs in a random order.
      • This greatly increases the convergence rate (or even makes possible).

  36. What are the disadvantages of experience replay?

    Answer

    • It is harder to use multi-step learning algorithms, such as $Q(\lambda)$ , which can be tuned to give better learning curves by balancing between bias (due to bootstrapping) and variance (due to delays and randomness in long-term outcomes). Multi-step DQN with experience-replay DQN is one of the extensions explored in the paper Rainbow: Combining Improvements in Deep Reinforcement Learning.

  37. How can we re-use the data?

    Answer

    1. Saving experiences (Gather data)

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20138.png

    1. Apply Experience Replay
      1. Define cost function, such as Least Square
      2. Random sample (state, value) pair from experience
      3. Apply SGD Update
    2. What is the idea of Deep Q-Learning (DQN)?

    DQN is quite young and was introduced only in 2013. It is an end-to-end learning for the Q(s,a) values for pixels with which well-known Atari games can be solved. Two main ideas are followed:

    • Experience Replay (kind of generator for the data generation of the training process).
    • Fixed Q-Targets:
      • Form of off-policy method
      • Direct manipulation of a used Q Value leads to an unstable NN.
      • To solve this two parameters are needed:
        1. current Q values
        2. target Q values
        3. During training, we update the target Q values with Stochastic Gradient Descent. After the epoch, we replace our current weights with the target weights and start a new epoch.

  38. So far we have seen Value Based RL which relies on improving the model by the state value function, but there is also the possibility to learn a policy directly. What are the advantages?

    Answer

    • Better convergence properties
    • When the space is large, the usage of memory and computation consumption grows rapidly. The policy based RL avoids this because the objective is to learn a set of parameters that is far less than the space count.
    • Can learn stochastic policies. Stochastic policies are better than deterministic policies, especially in 2 players game where if one player acts deterministically the other player will develop counter measures in order to win.

  39. What are disadvantages of Policy Based RL?
    Answer
    • Typically converge to a local rather than global optimum
    • Evaluating a policy is typically inefficient and high variance policy based RL has high variance, but there are techniques to reduce this variance.
  40. How can we fix a RL case where the problem is too large (keeping track of every single state or state-action pair becomes infeasible)

    Answer

    • Value Function Approximation. Therefore we extend the original Value function of a state and the q- values with a weight parameter.

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20139.png

  41. What kind of function can be used for Value Function approximation? Elaborate.

    Answer

    • Linear combinations of features:

      • In this case we represent the value function by a linear combination of the features and weights.

        data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20140.png

        • This lets us generalize from seen states to unseen states, by optimizing the function to already seen value function for a state and apply the parameters for unseen states.

      Big advantage of this approach - it converges on a global optimum.

    • Neural networks:

      • Most general goal: find optimal parameter w by finding its (local) minimum. Cost function: mean squared error.
    • ~~Monte Carlo Method:~~
    • In which step during the value function approximation is the approximation checked?
    • Policy evaluation

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20141.png

    Source: https://dnddnjs.gitbook.io/rl/chaper-8-value-function-approximatiobn/learning-with-function-approximator

    Slide: D. Silver 2015.

  42. What are Policy Gradient Methods?

    Answer

    • Return a matrix of probabilities for each action via softmax
    • Exploration is handled automatically
    • We don't care about calculating accurate state values
    • Adjust the action probabilities depending on rewards

  43. What is a parametrized policy (belongs to Policy Gradient Methods)?
    Answer
    • Assign a probability density to each action
    • Such policies can select actions without consulting a value function
    • As we will see, the most sophisticated methods still utilize value functions to speed up learning, but not for action selection (but this is considered as Actor-Critic)
  44. A stochastic Policy is a Policy Based RL. Why would we choose that?

    Answer

    At first it is important to note that stochastic does not mean randomness in all states, but it can be stochastic in some states where it makes sense. Usually maximizing reward leads to deterministic policy. But in some cases deterministic policies are not a good fit for the problem, for example in any two player game, playing deterministically means the other player will be able to come with counter measures in order to win all the time. For example in Rock-Cissors-Paper game, if we play deterministically meaning the same shape every time, then the other player can easily counter our policy and wins every game.

    • I.e. at every point select one of the three actions with equal probability (0.33).
    • Therefore, a stochastic policy is desirable!

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20142.png

    Source: https://en.wikipedia.org/wiki/Rock_paper_scissors

  45. Another Policy Based Algorithm is REINFORCE with Monte Carlo Policy Gradient, explain it. (Also called Monte-Carlo Policy Gradient)

    Answer

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20143.png

  46. REINFORCE, also called Vanilla Policy Gradient (VPG) can also work with Baseline, explain.

    Answer

    • With the baseline, the variance is mitigated

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20144.png

    Source Illustration: https://papoudakis.github.io/announcements/policy_gradient_a3c/

  47. What are disadvantages of REINFORCE?

    Answer

    • REINFORCE is unbiased and will converge to a local minimum.
    • However, it is a Monte Carlo method which has its disadvantages.
    • Slow learning because of high variance (partially solved by baseline).
    • Impossible to implement online because of return.
    • we need to generate a full episode to perform any learning.
    • Therefore, can only apply to episodic tasks (needs to terminate).

  48. Provide a intersection diagram to show what kind of learning a RL Algo can have

    Answer

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20145.png

    Source: Karlsruhe University of Applied Science - Yurii Tolochko, 2019

  49. Explain the Actor-Critic Method

    Answer

    Actor-Critic methods are a buildup on REINFORCE-with-baseline.

    Two parts of the algorithm:

    • Actor - updates the action selection probabilities
    • Critic - provides a baseline for actions’ value

    Actor Critic Method has the same goal as the Baseline for REINFORCE: It should remove the high variance in predicting an action. Instead of applying a Monte Carlo method for value function approximation, we perform an one-step Temporal Difference algorithm.

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20146.png

    Source Illustration: Reinforcement Learning: An Introduction. A. Barto, R. Sutton. Page 325

  50. What is the motivation for Monte Carlo Tree search rather than use Depth-First search?

    Answer

    • DFS fails with more complicated games (too many states and branches to go through)

  51. Explain Monte Carlo Tree search

    Answer

    • TLDR: Random moves is the main idea behind MCTS.
    • Monte Carlo Tree Search is an efficient reinforcement learning approach to lookup for best path in a tree without performing a depth search over each path. Imagine an easy game like tik tak toe, where we have only 9 field so a game definitively ends after 9 actions. A depth-first search can easy find the best path through all actions. But for games like go or chess, where wie have a huge amount of states (1012010^12010120), it is impossible to perform a normal depth-first search. This is where the MCTS comes in.

    Minimax and Game Trees

    • A game tree is a directed graph whose nodes are positioned in a game. The following picture is a game tree for the game tic tac toe.

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20147.png

      Source: https://upload.wikimedia.org/wikipedia/commons/1/1f/Tic-tac-toe-full-game-tree-x-rational.jpg

    • The main idea behind MCTS is to make random moves in the graph. For any given state we are no longer looking through every future combinations of states and actions. Instead, we use random exploration to estimate a state’s value. If we simulate 1000 games from a state s with random move selection, and see that on average P1 wins 75% of the time, we can be fairly certain that this state is better for P1 than P2.

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20148.png

  52. Vanilla MCTS has a problem with the mere number of possible next states. How could it be improved?

    Answer

    • The Vanilla MCTS suffers in the point, that if there are a lot of possible actions in a state, only a few would lead to a good result. The chance to hit one of these good states is really low. An improvement for this could be to change these with something more intelligent. This is where MCTS Upper-Confidence Bound comes in. We can use the information after our backpropagation. The states currently keep information about their winning chance. We do actual move selection by visit count from root (max or as proportional probability). Idea is that by UCT "good" nodes were visited more often.

    • But of course, a balance between exploration and exploitation is quite important. We need to make sure that we explore as many states as possible, especially in the first few rounds. The following formula can be used to calculate the selection of the next node:

      data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20149.png

    • This approach can be leveraged by using a neural net which does value function approximation. So we do not need to explore a lot of states because the NN tells us, how a good a state is.

  53. How does AlphaGo Zero work?

    Answer

    How AlphaGo Zero works - Google DeepMind

    data/5%20EN%20AI%20-%20Computer%20Vision,%20NLP,%20Reinforcement%20Lear%2079441a9ad0954925a2b006d0ae801fd2/Untitled%20150.png

    Source Illustrations: David Silver

  54. Shorten the algorithm of AlphaGo Zero

    Answer

    1. Use current network and play some games to create training data. use MCTS (Monte Carlo Tree Search) for that. Completly focused on self-training.
    2. Use this data to retrain the network (Fix and target values). We train a policy and a value network.
    3. Now we have a slightly better network, repeat 1 + 2.

    Notes:

    • AlphaGo Zero uses ResNet
    • Takes the last 8 images of the board into account

    Output of the feature vector:

    • Value vector:
    • Policy vector: Probability distribution for all actions
    • Define Actor-Critic
    • Learnt value function
    • Learnt policy
    • Value function “helps” learn the policy

6 XAI

Explainability

  1. What is the Saliency Map?

    Answer

    • Saliency describes unique features in an image (Pixels, resolution, ..). The purpose is to visualize the conspicuity of an image or in other words to differentiate the visual features of it.
    • This could be the color scale of an image, which is done by converting it to black and white, displaying a given temperature, or think of night vision (brightness is green, and darkness is black).

    data/6%20XAI%200fe6f8c58d1e489ba64035eab21d66e2/Untitled.png

    Could be used for traffic light detection. (Saliency detection)

    Source: https://analyticsindiamag.com/what-are-saliency-maps-in-deep-learning/

  2. What is Integrated Gradients?

    Answer

    • Attribution Algorithm. Works by specifying the target class.

  3. What is the difference betweeen CNN Heat Maps: Gradients vs. DeconvNets vs. Guided Backpropagation?

    Answer

    CNN Heat Maps: Gradients vs. DeconvNets vs. Guided Backpropagation

  4. What is NMF?

    Answer

    • A matrix factorisation technique (another one would be SVD). It stands for Non Negative Matrix Factorization and splits the user item matrix R into the matrices U and V. The optimization problem is that the multiplication of U and V should result in R.

  5. What is NCF (Neural Collaborative Filtering)?

    Answer

    • Models User Item interactions in the latent space with a NN
    • Is a generalization of Matrix Factorization
    • In the input layer, the user and item are one-hot encoded.
    • Neural CF Layers can be anything that is non linear

    They claim that the precendent dot product of matrix factorization limits the expressiveness of user and item latent vectors. (vector a might be closer to vector b than to c, although it is more similar to c.

    data/6%20XAI%200fe6f8c58d1e489ba64035eab21d66e2/Untitled%201.png

    Neural CF Layer is replaced by a multiplication layer, which performs element-wise multiplication on the two inputs. Weights from multiplication layer to output is a identity matrix with a linear activation function.

    data/6%20XAI%200fe6f8c58d1e489ba64035eab21d66e2/Untitled%202.png

    y_ui = predicted item of recommender matrix

    L = Linear activation function

    circle with dot = element-wise multiplicaton

    p_u = latent vector user

    q_I = latent vector item

    Since J is an identity matrix it transforms the element-wise multiplication to a dot product. Because L is a linear function it means that the input would only be up or downscaled, but there is no real transformation. Hence, we can omit it. We end up with the same function as for Matrix Factorization, which shows, that NCF can be a generalization of Matrix Factorization.

    To leverage the architecture of NCF we need non-linearity. It includes Multiple Layer Perceptron (MLP), Generalized Matrix Factorization (GMP) and a non linear activation function sigmoid.

    data/6%20XAI%200fe6f8c58d1e489ba64035eab21d66e2/Untitled%203.png

    Source:

    https://towardsdatascience.com/paper-review-neural-collaborative-filtering-explanation-implementation-ea3e031b7f96

  6. What are the components of a NeuMF (Non-linear NCF)?

    Answer

    • GMF (Generalized Matrix Factorization)
    • MLP (Multi Layer Perceptron)

  7. What is a Multilayer perceptron network (MLP)?

    Answer

    It is just a name for the simplest kind of deep neural network.

    • There's at least one hidden layer
    • Each neuron has a non-linear activation function
    • Densely connected (each neuron in one layer is connected to every neuron the next or previous layer)
    • Uses as usual optimizer (e.g. Adam) and as a Loss function Binary cross-entroy for example
    • What is the Mean Reciprocal Rank (MRR)?

    MRR takes the single highest ranked positive/relevant item (since a system always returns an ordered list, it is the first positive solution). Here depicted as rank_i. It does not care about other relevant items (e.g. at position 4 or 20)

    User MRR if:

    1. There is only one result
    2. You care only about the highest ranked result (web search scenario)

    data/6%20XAI%200fe6f8c58d1e489ba64035eab21d66e2/Untitled%204.png

    Example:

    ```xml Query: "Cities in California"

    Ranked results: "Portland", "Sacramento", "Los Angeles"

    Ranked results (binary relevance): [0, 1, 1] Number of correct answers possible: 2 Reciprocal Rank = 1/2

    Precision at 1: 0/1 Precision at 2: 1/2 Precision at 3: 2/3 Average precision = 1/m∗[1/2+2/3]=1/2∗[1/2+2/3]=0.38 ```

    Source: https://stats.stackexchange.com/questions/127041/mean-average-precision-vs-mean-reciprocal-rank

  8. How does MRR differ to MAP?

    Answer

  9. What is the difference between implicit and explicit matrix factorization?

    Answer

    • Implicit describes mostly unary ratings, such as that the user has seen an item or not (0 or 1)
    • Explicit desribes the actual given rating, such as 3.5 by a range from 0 to 5

  10. What is the median rank?

    Answer

    • The position of the relevant item that is in the middle of the relevant item list (50%). If odd than it is one above.

    python Example: Relevant items position: [4, 60, 67, 77, 80] Median rank: 67

  11. What is Binary Cross-entropy?

    Answer

    • With binary cross entropy I can only classify two classes, whereas categorical cross entropy can classify many classes.

  12. What algorithms can be used for recommendation?

    Answer

  13. What is the Monte Carlo method?

    Answer

XAI Methods

  1. What is a partial dependence plot and how does it work?

    Answer

    • Shows how the average prediction in your dataset changes when the j-th feature is changed
    • Requires to see the model as additive.
    • Computes the marginal effect on one or two features.

    Note: Marginal effect means, what one feature adds to the prediction if they others are independent. In order to calculate it we have to average over those non-interesting

    data/6%20XAI%200fe6f8c58d1e489ba64035eab21d66e2/Untitled%207.png

    Source:

    https://christophm.github.io/interpretable-ml-book/pdp.html

    This means the following: In order to make a statement about a variable and its partial dependency to the output we need to make the inferences with all possible values of the other features, resulting Here in B3,B3, C2,C3. At the end we have n predicted outcomes, which need to be averaged. This averaged y is the marginal contribution of A1.

    Example:

    data/6%20XAI%200fe6f8c58d1e489ba64035eab21d66e2/Untitled%208.png

    Source: https://towardsdatascience.com/introducing-pdpbox-2aa820afd312

  2. What are advantages of PDP?

    Answer

    • perfect result, if features are independent
    • easy to implement
    • calculation is causal for model but not for the real world

  3. What are disadvantages of PDP?
    Answer
    • max feature = 2
    • Some plots do not show the distribution → misleading interpretation
    • Assumption of independence (weight ↔ height example) → Solution: Accumulated Local Effect plots (ALE) // works with conditional distribution
    • Heterogeneous effects might be hidden. Cancelling out large positive and negative predictions (→ resulting in zero, meaning no effect on the model) → Solution: Individual Conditional Expectation Curves ⇒ ICE-Curves
  4. What is Individual Conditional Expectation plot?
    Answer
    • Extension of Partidal Dependence Plot
  5. What is an additive model?

    Answer

    • It is characterized that the effect of one variable does not depend on another (independence) and their contribution/effect adds up to the total effect/prediction. Therefore, we can assume to understand each variable individually (synonym marginally), and as a result, we are able to interpret the model.
    • Think of good old regression models. In a simple regression model, the effect of one variable does not depend on other variables and their effects add up to the total effect. If we can understand each variable individually or marginally, we would be able to interpret the entire model. However, this will no longer be true in the presence of interactions.

  6. What is the idea of Shapley Values?

    Answer

    • Averaging over the contribution of all N! possible orderings. (Since non-linear models always depend on the order of the features)

  7. What is LAYER-WISE RELEVANCE PROPAGATION (LRP)?

    Answer

    • It is similar to Saliency Maps as it produces a heatmap, but LRP connects it to the neurons that contributed the most.

    data/6%20XAI%200fe6f8c58d1e489ba64035eab21d66e2/Untitled%209.png

    Source: https://dbs.uni-leipzig.de/file/XAI_Alexandra_Nau.pdf

  8. What is the idea of a regularizer?

    Answer

    • Discourage the NN memorizing/overfitting

  9. What is the manifold space?

    Answer

    • A continuous, non-intersecting surface

    data/6%20XAI%200fe6f8c58d1e489ba64035eab21d66e2/Untitled%2010.png

  10. What is the null hypothesis?

    Answer

    • A null hypothesis is a hypothesis that says there is no statistical significance between the two variables in the hypothesis. It is the hypothesis that the researcher is trying to disprove. In the example, Susie's null hypothesis would be something like this: There is no statistically significant relationship between the type of water I feed the flowers and growth of the flowers. A researcher is challenged by the null hypothesis and usually wants to disprove it, to demonstrate that there is a statistically-significant relationship between the two variables in the hypothesis. Source: https://study.com/academy/lesson/what-is-a-null-hypothesis-definition-examples.html

  11. What Is an Alternative Hypothesis?
    Answer
    • An alternative hypothesis simply is the inverse, or opposite, of the null hypothesis. So, if we continue with the above example, the alternative hypothesis would be that there IS indeed a statistically-significant relationship between what type of water the flower plant is fed and growth. More specifically, here would be the null and alternative hypotheses for Susie's study:
    • Null: If one plant is fed club soda for one month and another plant is fed plain water, there will be no difference in growth between the two plants.
    • Alternative: If one plant is fed club soda for one month and another plant is fed plain water, the plant that is fed club soda will grow better than the plant that is fed plain water.
    • Source: https://study.com/academy/lesson/what-is-a-null-hypothesis-definition-examples.html
  12. What is the Friedman's test?
    Answer
  13. What is the Lower Bound?

    Answer

    data/6%20XAI%200fe6f8c58d1e489ba64035eab21d66e2/Untitled%2011.png

    Source: https://www.mathsisfun.com/definitions/lower-bound.html

  14. What is Jensens inequality?

    Answer

    Linear transformation of a variable followed by another linear transformation leads to the same result as the other way around.

    Example:

    python mean(f(x)) == f(mean(x)), for linear f()

    Jensens inequality says:

    The intuition of linear mappings does not hold for nonlinear functions.

    Non linear functions are not a straight line when we think of the transformation on the input to the output. It is either convex (curving upward) or concarve (curving downwards).

    Example:

    python (x) == x^2 is an exponential convex function.

    I quote:

    "Instead, the mean transform of an observation mean(f(x)) is always greater than the transform of the mean observation f(mean(x)), if the transform function is convex and the observations are not constant. We can state this as:"

    python mean(f(x)) >= f(mean(x)), for convex f() and x is not a constant

    data/6%20XAI%200fe6f8c58d1e489ba64035eab21d66e2/Untitled%2012.png

    Out intuition that the mean of a row of numbers multiplied by a non linear function (square), would be the same as applying the square to each of them and then calculate the mean, is wrong! The latter is greater.

    • used to make claims about a function where little is known about the distribution
    • used to compare arithmetic mean and geometric mean

    Source: https://machinelearningmastery.com/a-gentle-introduction-to-jensens-inequality/

  15. What is the geometric mean?

    Answer

    • The Geometric Mean is a special type of average where we multiply the numbers together and then take a square root (for two numbers), cube root (for three numbers) etc.
    • For n numbers we take the nth root. (Nth root is concave)

  16. What is data scarcity?

    Answer

  17. What is a Cox Model?

    Answer

    • Quote: "A Cox model is a statistical technique for exploring the relationship between the survival of a patient and several explanatory variables. Survival analysis is concerned with studying the time between entry to a study and a subsequent event (such as death)."

      Source: http://www.bandolier.org.uk/painres/download/whatis/COX_MODEL.pdf

  18. Can a linear model describe a curve?

    Answer

    Yes, that is possible. Usually a linear model only contains linear parameters and linear variables, but as long as the parameter remain linear and the variable to change is independent, it's exponent can be changed. For example can a independent variable be raised to quadratic, to fit a curve.

    Rule for a linear model:

    data/6%20XAI%200fe6f8c58d1e489ba64035eab21d66e2/Untitled%2013.png

    data/6%20XAI%200fe6f8c58d1e489ba64035eab21d66e2/Untitled%2014.png

    Source: https://statisticsbyjim.com/regression/difference-between-linear-nonlinear-regression-models/

  19. What are Restricted Boltzmann Machines?

    Answer

    • Can be used for: dimensionality reduction, classification, regression, collaborative filtering, feature learning, and topic modeling.
    • The goal is to learn a compressed version of features (lower dimensional representation) that describe the input well. The concept is similar to AEs but in contrary they don't have a dedicated decoding part. The decoding flows back to the input and updates the weights directly instead of having additional weights. The flown back reconstruction is compared and the weights updated.
    • Therefore, we need two biases. One is for the forward pass (hidden bias) and the other for the backward pass (reconstruction). The input layer is considered as visible layer and the second layer as Hidden Layer.

    data/6%20XAI%200fe6f8c58d1e489ba64035eab21d66e2/Untitled%2015.png

    • RBM uses stochastic units with a particular distributions. Based on this we use Gibbs Sampling:

    data/6%20XAI%200fe6f8c58d1e489ba64035eab21d66e2/Untitled%2016.png

    • As an alternative to backpropagation and we use Contrastive Divergence. (I guess it is due to the probability distribution. VAE solves this problem by the reparameterization trick)

    data/6%20XAI%200fe6f8c58d1e489ba64035eab21d66e2/Untitled%2017.png

    Source: https://medium.com/edureka/restricted-boltzmann-machine-tutorial-991ae688c154

    Additional Source: https://www.patrickhebron.com/learning-machines/week5.html

  20. What is a Deep Belief Network?

    Answer

    • Many layers of RBMs stacked together. For updating the weights, the Contrastive Divergense is still be used.

  21. What is synthetic data?
    Answer
    • data generated by an algorithm
  22. What is the inverse covariance matrix?

    Answer

    https://www.quora.com/How-can-I-use-Gaussian-processes-to-perform-regression

  23. What are explicit factor models?

    Answer

    • Explicit factor models (EFM) [17] generate explanations based on the explicit features extracted from users’ reviews.

  24. What is Explainable Matrix Factorization (EMF)?

    Answer

    • another algorithm that uses an explainability regularizer or soft constraint in the objective function of classical matrix factorization. The con- straining term tries to bring the user and explainable item’s latent factor vectors closer, thus favoring the appearance of explainable items at the top of the recommendation list.

    Source: An Explainable Autoencoder For Collaborative Filtering Recommendation (https://arxiv.org/abs/2001.04344)

  25. What is the optimal transport cost (OT)?

    Answer

    • It uses the Wasserstein distance to measure how much work has to be invested to transport one probability distribution to another. In other words: The cost to transform one distribution into the other.

  26. What is the infimum?

    Answer

    • That is the lower bound (not to be mixed up with the minimum).

  27. What is the difference between Wasserstein and Kullback-Leibler?

    Answer

  28. What is the chi-squared test?

    Answer

    • used for categorical data
    • Does one categorical variable affects another one?
    • calculates the p-value
    • In case of a low p-value:
      • Result is thought of as being "significant" meaning we think the variables are not independent.
    • chi square = (Observed - Expected)^2 / Expected
    • Calculate the degree of freedom: Degree of Freedom = (rows − 1) × (columns − 1)
    • Look up p-value in pre-computed table by using chi-square and degree of freedom

      data/6%20XAI%200fe6f8c58d1e489ba64035eab21d66e2/Untitled%2019.png

    • Source: https://www.mathsisfun.com/data/chi-square-table.html

  29. What is a CSR Matrix?

    Answer

    Compressed sparse row (CSR), Algorithm to compress the size of a matrix by indexing the 1s with row_id, col_id

    data/6%20XAI%200fe6f8c58d1e489ba64035eab21d66e2/Untitled%2020.png

    Source: https://machinelearningmastery.com/sparse-matrices-for-machine-learning/

  30. What is rocchio feedback?

    Answer

    • Improving the search query by transforming the query and documents into a vector space and try to approach the centroid of the relevant documents while penalizing the distance to the centroid of the non-relevant documents.
    • Described as an equation:

    data/6%20XAI%200fe6f8c58d1e489ba64035eab21d66e2/Untitled%2021.png

    data/6%20XAI%200fe6f8c58d1e489ba64035eab21d66e2/Untitled%2022.png

    data/6%20XAI%200fe6f8c58d1e489ba64035eab21d66e2/Untitled%2023.png

    Source and example calculation: https://www.coursera.org/lecture/text-retrieval/lesson-5-2-feedback-in-vector-space-model-rocchio-PyTkW

  31. What is a Weighted Matrix Factorisation?

    Answer

    data/6%20XAI%200fe6f8c58d1e489ba64035eab21d66e2/Untitled%2024.png

    • In contrast to SVD that has a cost function that processes all in one (seen and unseen), weighted MF sums over observed and unobserved items separately. SVD leads to poor generalisation when the corpus is large due to a sparse matrix.

    data/6%20XAI%200fe6f8c58d1e489ba64035eab21d66e2/Untitled%2025.png

  32. How can the objective function (cost) be minimized?

    Answer

  33. What are the differences betweeen SGD and WALS?

    Answer

    SGD

    +: Very flexible—can use other loss functions.

    +: Can be parallelized.

    -: Slower—does not converge as quickly.

    -: Harder to handle the unobserved entries (need to use negative sampling or gravity).

    WALS

    -: Reliant on Loss Squares only.

    +: Can be parallelized.

    +: Converges faster than SGD.

    +: Easier to handle unobserved entries.

  34. What are advantages and disadvantages of Collaborative Filtering?

    Answer

    Taken from: https://developers.google.com/machine-learning/recommendation/collaborative/summary

    Advantages:

    No domain knowledge necessary

    We don't need domain knowledge because the embeddings are automatically learned.

    Serendipity

    The model can help users discover new interests. In isolation, the ML system may not know the user is interested in a given item, but the model might still recommend it because similar users are interested in that item.

    Great starting point

    To some extent, the system needs only the feedback matrix to train a matrix factorization model. In particular, the system doesn't need contextual features. In practice, this can be used as one of multiple candidate generators.

    Disadvantages:

    Cannot handle fresh items

    The prediction of the model for a given (user, item) pair is the dot product of the corresponding embeddings. So, if an item is not seen during training, the system can't create an embedding for it and can't query the model with this item. This issue is often called the cold-start problem. However, the following techniques can address the cold-start problem to some extent:

    • Projection in WALS. Given a new item i0 not seen in training, if the system has a few interactions with users, then the system can easily compute an embedding vi0 for this item without having to retrain the whole model. The system simply has to solve the following equation or the weighted version:

      minvi0∈Rd‖Ai0−Uvi0‖

      The preceding equation corresponds to one iteration in WALS: the user embeddings are kept fixed, and the system solves for the embedding of item i0. The same can be done for a new user.

    • Heuristics to generate embeddings of fresh items. If the system does not have interactions, the system can approximate its embedding by averaging the embeddings of items from the same category, from the same uploader (in YouTube), and so on.

    Hard to include side features for query/item

    Side features are any features beyond the query or item ID. For movie recommendations, the side features might include country or age. Including available side features improves the quality of the model. Although it may not be easy to include side features in WALS, a generalization of WALS makes this possible.

    To generalize WALS, augment the input matrix with features by defining a block matrix A¯, where:

    • Block (0, 0) is the original feedback matrix .

      A

    • Block (0, 1) is a multi-hot encoding of the user features.

    • Block (1, 0) is a multi-hot encoding of the item features.

awesome-ai-questions's People

Contributors

sneakypad avatar

Stargazers

Lukas avatar Phillip Diekmann avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.