In version dated "2018-05-12":
- Part 2.4 Predictive Modeling Across Sets, first sentence:
Physicians have an strong preference towards logistic regression due to its inherent interpretability.
Should be "a"
- Part 2.4 Predictive Modeling Across Sets, under Fig 2.10:
It is also interesting to note that the model of the risk set requires all 8 predictors while recursive feature elimination for the risk, imaging predictors and imaging predictor interactions set only requires only 4 predictors to achieve a better cross-validated area under the ROC curve.
One "only" is unnecessary.
- Part 2.2 Preprocessing, under Fig 2.3:
These three pairs are highlighted in red boxes along the diagonal of the coorleation matrix in Figure 2.3.
Should be "correlation".
- Part 3 A Review of the Predictive Modeling Process second sentence:
These topics are fairly general with regards to empirical modeling and include: metric for measuring performance for regression and classification problems, approaches for optimal data usage which includes data splitting and resampling, best practices for model tuning, and recommendations for comparing model performance.
Should be "include" not "includes"
- Part 3.1 Illustrative Example: OkCupid Profile Data second paragraph:
While the imbalance hasa significant impact on the analysis, the illustration presented here will mostly side-step this issue by down-sampling the instances such that the number of profiles in each class are equal.
Should be "has a".
- Part 3.2 Measuring Performance, near "(McElreath 2016)" reference:
The question that one really wants to know is “if my value was predicted to be an event, what is are the chances that it is truly is an event?” or Pr[Y = STEM|P = STEM].
"is" is unnecessary.
- Part 3.2 Measuring Performance, near "(McElreath 2016)" reference:
Sensitivity (or specificity, depending on one’s point of view) are the “likelihood” parts of this equation.
Should be "is ... part" instead.
- Part 3.2 Measuring Performance, above Fig 3.3:
Table 3.1 can also be visualized using a mosaic plot such as the one shown in Figure 3.3(b) where the size of the blocks are proportional to the amount of data in each cell.
Should be either "where the sizes ... are" or "where the size ... is".
- Part 3.2 Measuring Performance, above Fig 3.3:
The mosaic plot for this confusion matrix is shown in Figure 3.3(a) where the blue block in the upper left becomes larger but there is also an increase in the red block in the lower right.
There is no red block in the lower right (probably meant to be upper right).
- Part 3.3 Data Splitting, first paragraph:
The test set is used only at the conclusion of these activities for estimating a final, unbiased assessment of the model’s performance. It is critical that the test set not be used prior to this point. Looking at its results will bias the outcomes since the testing data will have become part of the model development process.
"Its" in the sentence above refers to test set which has no "results", hence indicated sentence probably requires rephrasing.
- Part 3.4.1 V-Fold Cross-Validation and Its Variants, third to last paragraph:
Also, Section 4.4 has a more extensive description of how the assessment datasets can be used to drive improvements to models.
Should be "data sets" instead of "datasets".
- Part 3.4.6 What Should Be Included Inside of Resampling?, second sentence:
This is somewhat of a simplification.
"of" is unnecessary.
- Part 3.4.6 What Should Be Included Inside of Resampling?, first paragraph:
For example, in Section 1.1, a transformation procedure was used to modify the predictors variables and this resulted in an improvement in performance.
"predictors variables" is incorrect, I think, it should be "predictor variables", or only "predictors" or "variables".
- Part 3.4.6 What Should Be Included Inside of Resampling?, second to last paragraph:
While the test set data often have the outcome data blinded, it is possible to “train to the test” by only using the training set samples that are most similar to the test set data.
Should be "has" instead of "have".
- Part 3.6 Model Optimization and Tuning near "(Srivastava et al. 2014)" reference:
This is the rate at which coefficients are randomly set to zero during and is most likely to attenuate overfitting (Srivastava et al. 2014).
Unfinished part of sentence probably - "during ..." ?
- Part 3.6 Model Optimization and Tuning, above Table 3.3:
The learning rate parameter controls the rate of decent during the parameter estimation iterations and these values were contrasted to be between zero and one.
Should be "descent".
- Part 3.6 Model Optimization and Tuning, last paragraph:
Depending on the problem, this bias might over-estimate the model’s true performance.
Should be "overestimate" instead.