Git Product home page Git Product logo

carpentries-incubator / high-dimensional-stats-r Goto Github PK

View Code? Open in Web Editor NEW
12.0 7.0 18.0 377.5 MB

High-dimensional statistics with R

Home Page: https://carpentries-incubator.github.io/high-dimensional-stats-r

License: Other

Ruby 0.01% Makefile 0.02% R 0.07% Shell 0.01% Python 0.17% TeX 0.01% HTML 99.72% Dockerfile 0.01%
lesson pre-alpha carpentries-incubator statistics high-dimensional-statistics english r

high-dimensional-stats-r's Introduction

High dimensional stats with R

Create a Slack Account with us

This repository is part of The Carpentries Incubator, a place for The Carpentries community to collaboratively create, test, and improve lessons.

Contributing

We welcome all contributions to improve the lesson! Maintainers will do their best to help you if you have any questions, concerns, or experience any difficulties along the way.

We'd like to ask you to familiarize yourself with our Contribution Guide and have a look at the more detailed guidelines on proper formatting, ways to render the lesson locally, and even how to write new episodes.

Please see the current list of issues for ideas for contributing to this repository. For making your contribution, we use the GitHub flow, which is nicely explained in the chapter Contributing to a Project in Pro Git by Scott Chacon. Look for the tag good_first_issue. This indicates that the maintainers will welcome a pull request fixing this issue.

Reviews

The lesson has been iteratively developed and improved. For information on the development process, reviews and feedback from instructors following teaching see REVIEWS.

Maintainer(s)

Current maintainers of this lesson are

  • Alan O'Callaghan
  • Ailith Ewing
  • Catalina Vallejos
  • Hannes Becher

Authors

A list of contributors to the lesson can be found in AUTHORS

Citation

To cite this lesson, please consult with CITATION

high-dimensional-stats-r's People

Contributors

actions-user avatar ailithewing avatar alanocallaghan avatar andrzejromaniuk avatar annajiat avatar catavallejos avatar david-a-parry avatar ewallace avatar gsrob-scu avatar hannesbecher avatar hyweldd avatar mallewellyn avatar nathansam avatar tobyhodges avatar zkamvar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

high-dimensional-stats-r's Issues

Explain design / model matrix more?

Currently the lesson refers to design / model matrices several times starting in episode 2 with:

The process of running a model in limma is somewhat different to what you may have seen when running linear models. Here, we define a model matrix or design matrix, which is a way of representing the coefficients that should be fit in each linear model. These are used in similar ways in many different modelling libraries.

In instructor discussions on 2022-09-28 delivery with @hannesbecher and @luciewoellenstein44, we are wondering if it would help to explain more. The model/design matrix is a core concept for the lesson, and learners who have only used lm/glm may never have learned it before.

Maybe add a drop-down box explainer in more detail and links to some further reading?

Review comments: Episode 4 - principal component analysis

Episode 4

I really like this practical presentation of PCA - I can see this being genuinely very useful to someone actually wanting to implement it. I have made some comments below, with minor comments written at the bottom.

Again, where possible, I will submit pull requests for these changes.

  • Line 49/Introduction: propose a minor re-wording here just for clarity (also, if learners have completed previous episodes, they'll have a good idea what this looks like - "imagine" leads me to believe you're talking about something different). Something like:

"Suppose a dataset contains many variables ($p$), close to the total number of rows in the dataset ($n$). It is likely that some of these variables are highly correlated. Variables may even be so highly correlated that they represent the same overall effect."

Also, just checking this - the gene expression example later has p>>n. Maybe it's better to say something more vague here about the use cases of PCA so it's consistent with this. Something like:

"If a dataset contains many variables ($p$), it is likely that some of these variables are highly correlated. Variables may even be so highly correlated that they represent the same overall effect."

  • Line 65: Could add a small extension to this sentence just to make it clear that this single feature is capturing the overall
    effect of the previous 3 variables, just to reinforce this is intuitively the goal of PCA. Something like:

"As an example, PCA might reduce several variables representing aspects of patient health (blood pressure, heart rate, respiratory rate) into a single feature capturing an overarching "patient health" effect."

  • Line 70/Advantages and disadvantages of PCA: I like this summary of the advantages and disadvantages, but I would propose moving this to the end of the episode as it's quite difficult to understand without first understanding what PCA is.

  • Line 75/Advantages and disadvantages of PCA: I would propose rewording "The calculations used in a PCA are easy to understand for statisticians and non-statisticians alike" as "The calculations used by PCA are simple to understand compared to other methods for dimension reduction".

  • Line 156/What is a principal component?: I really appreciate this description of PCA - I think it explains PCA is an extremely
    understandable way and avoids the temptation to just present the maths. As such, I think this section deserves to be called "Principal component analysis" for signposting as it describes the whole process. A short sentence at the start saying that PCA describes the data by breaking it down to "principal components" could also help with this.

  • Line 203/What is a principal component: I think this formula could be linked with the description of the first PC above just to make it absolutely clear how this mathematical description comes about and how these two parts are linked (and what the PC "scores" are in the example above).

  • Line 216/A prostate cancer dataset: This prostate data is used throughout the episodes where it's perhaps more informative to demonstrate the methods on a non high-dimensional data set. I don't have a problem with this per se, but I think a brief statement making it clear that the data are not technically high-dimensional (and are simply used to illustrate the method
    (as in episode 1)) could be included to avoid confusion. Could even say that we apply the method to a (very!!) high-dimensional data set later (the gene expression data).

Also, I'd be tempted to remove this title because there's no text between that and the title before. Could be combined into the title "How do we perform PCA" or removed since the subsequent text is clear that this is the data set

  • Line 240/A prostate cancer dataset: "Standard PCAs are carried out using continuous variables only."
    I think this sort of information is better given in the section above explaining PCA. It may get lost in the example here. I'm thinking that people may back reference the section on PCA for all examples of this section/their own examples.

  • Line 264/Do we need to standardise the data: I think a brief sentence at the start of this section about why you would
    standardise data for PCA would help the subsequent explanation and the justification for not standardising
    in the next example. It may also help someone practically implement PCA on a new data set.

Something like:

"Since PCA derives principal components based on the variance they explain in the data, we may need to scale variables
in our data set if we want to ensure that each variable is considered equally in the PCA. This is particularly useful
if we don't want the PCA to ignore variables that may be important to our analysis just because they have low variance."

  • Line 277/Do we need to standardise the data: "It is clear from this output that we need to scale each.." would suggest removing "It is clear" as it may not be.

If editing this section as per the previous comment, could rewrite to "Since we want each of these variables to contribute equally to our analysis, but there are large differences in variance, we need to scale each of these variables before including them in the PCA. In this example, we standardise all five variables to have a mean of 0 and a standard..."

Then the challenge just reinforces this.

  • Line 318/A prostate cancer dataset: Query - why is a different package for PCA used now?

  • Line 324/A prostate cancer dataset: I don't think the scale=TRUE argument changes the mean - perhaps should say
    "Note that the [center = TRUE and] scale = TRUE arguments are used to standardise the variables to have a mean 0 and standard deviation of 1."

  • Line 373/How many principal components do we need?: Adding lines to this scree plot would really help in visualising the elbow.

  • Line 380/How many principal components do we need?: A brief sentence explaining how many PCs we would choose from this scree plot as we haven't addressed this yet despite the section heading.

  • Line 467/Using PCA to analyse gene expression data: It's not clear why we're using another package again here.

  • Line 527/A gene expression dataset of cancer patients: I think swapping the order of the first two points in this paragraph may help with flow.

  • I think it needs to be stated somewhere that choosing <p (or <n if high-dim) PCs results in loss of information from the model/data set.

  • Line 656/Challenge 4: "...and suggest an appropriate number of principal components." to test how well people have understood?

Minor changes

  • Line 278/Do we need to standardise the data: "In this example ..." -> "In this example, ..."

  • Line 334/A prostate cancer dataset: "importance of each component" -> "importance of (variance explained by) each component"

  • Line 354/A prostate cancer dataset: repetition of "also called". Could reword as "A plot of the amount of variance accounted for by each PC is called a scree plot. Note that the amount of variance accounted for by a principal component is given by "eigenvalues". Thus, the y-axis in scree plots if often labelled “eigenvalue”."

  • Line 376/How many principal components do we need?: "scree plot" -> "screeplot".

  • Line 529/A gene expression dataset of cancer patients: "high dimensional data" -> "high-dimensional data".

  • Line 751: "prooces" -> "produces"

  • Line 768/Principal component regression: Repetition of "This is called PC regression"

  • Captions/alt text to be filled.

Review comments: Regularised regression

Regularised regression

https://carpentries-incubator.github.io/high-dimensional-stats-r/03-regression-regularisation/index.html

Overall, the approach in this episode which emphasises genuine understanding the methods, is one I really like. It is very long. There is quite a lot of necessary jargon and it would help if terms and the links between them were explained more fully. Overall, the steps and ideas could be broken down so that the rationale for each bit of code, output or exploration is more clear. Some of the outputs and figures are left to speak for themselves. I might be wrong but I think the average biologist with a standard stats/maths background would be overwhelmed. You could consider breaking this episode into separate episodes covering 1. what regularisation is and the rationale for it, including the concepts of scaling, and cross validation 2. Rigde and lasso regression. 3. Blending ridge regression and the LASSO. I think tidymodels is a whole other course - or alternatively just use that approach throughout.

I would expand the intro to explain what is going to be happening in this lesson in overview.

IIRC this episode is the first time the term "feature selection" is used - perhaps a sentence or two could be added to previous episodes to explain that that is the collective term for the processes covered?

In the The objective of a linear model section:

I think "minimises sum of the squared residuals." is clearer than "minimises the residual sum of squares."

"Mathematically, we can write that as" -> "Mathematically, we can write the sum of the squared residuals as"

Say what Horvath is.

I think this is the first time data have centred and scaled so it is not clear why that is being done.

in the Exercise 2.Fit a model on the training data matrix and training age vector, it might not appear "similar to what we did in the previous episode." because lmFit() was used then.

I would explain what the pred_lm v test_age figure shows explicitly.

In the Why would we want to restrict our model? section - first use of penalised and it is not made clear that penalised regression is what is discussed in the section above.

"2.Plot the predictions for each method against the ground truth. " -> "2. Plot predicted ages for each method against the observed ages"

Consider loosing the Bayesian section and possibly adding a sentence at the start of the episode that explains there are Bayesian and frequentist approaches and the one considered here is the latter.

If cross validation is not to be covered in detial here, consider leaving out the "Centering and scaling in cross-validation" detail.

clustering with correlation example

cor_example <- data.frame(sample_a = rnorm(10),
                          sample_b = rnorm(10))
cor_example$sample_c <- cor_example$sample_a + 5
rownames(cor_example) <- paste(
  "Feature", 1:nrow(cor_example)
)
head(cor_example)

pheatmap(cor_example)
plot(1:nrow(cor_example),
     rep(range(cor_example), 5),
     type = "n")
lines(cor_example$sample_a, col = "firebrick")
lines(cor_example$sample_b, col = "dodgerblue")
lines(cor_example$sample_c, col = "forestgreen")


clust_dist <- hclust(dist(t(cor_example)))
plot(clust_dist)

clust_cor <- hclust(as.dist(1 - cor(cor_example)))
plot(clust_cor)

Addition to Extra Resources

I think adding a few non-book/text based resources within the "Extra resources" field could also be useful.

For instance, the youtube channel StatQuest by Josh Starmer has some wonderful fundamental videos on high dimensional stats and off-late ML as well. (https://www.youtube.com/c/joshstarmer). This has personally been really helpful for me in nailing the fundamentals.

These are my two cents,
Harithaa

feedback

Day 1 Feedback:
positives:

  • I liked that it was a very practical example
  • Examples are very clear and easy to understand as a biologist
  • Easy to follow
  • Good stepwise explanations when live coding
  • I liked the pace and depth of explanations of each topic

suggestions for improvement:

  • Some of the equations are written out in code but not visible as equations in the html page (displays fine for me on firefox but not on chrome-based browsers)
  • At times pace could be a bit faster
  • Perhaps dwelling on simpler and widely understood concepts (e.g. Bonferroni correction) unnecessary
  • Feel like we could go through the material a bit quicker and and just summerise some coding answers instead of live coding them again.
  • feel like we could go through the exercises a bit quicker

Day 2:
Positives:

  • informative figures in the slides and I liked how Alan went through the plots step by step

Suggestions for improvement:

  • It was quite theoretical and I missed some examples of practical application

Day 3:
Positives:

  • The step by step walking through how the R objects looked like was helpful in thinking about how to plot / analyse the data!
    Suggestions for improvement:
  • it would be great to have the codes available maybe with some comments explaining what the code does. I can follow today but if i look at it in a month i think i won't remember what i've done and I am trying to cope with the speed of coding and catching up as well. Overall its a really useful course!
  • The theoretical explaination in the slides went a little too quick for me!

Day 4:

Positives:

  • It was good to evaluate the hierarchical clustering with lots of plots. The visulasiation makes it easier to conceptualise.

Suggestions for improvement:

  • It could be better if we have more examples of a method. In different cases, the use of the method may be different.
  • I would've liked an overview of the general steps for each method, just to recap what the specific characteristics of the methods are and when they're useful.

Ready for Pilot Workshops?

Hi @alanocallaghan and Gail Roberston!

I'm working on organizing a series of mini-workshops for next semester at my institution. We are hoping to pilot some of incubator lessons. We have an instructor interested in teaching this lesson. Do you think this lesson would be in a good place to pilot later this year? Any additional info you could share in it? We could also provide feedback from learners/instructors/helpers.

Happy to meet and discuss more. If you'd like to set up a meeting, let me know.

Best,
Sarah

cc: @SteveGoldstein

Remove or explain more of the linkage methods

The methods that produce weird dendrograms are probably useful for some scenarios so just showing them failing here is maybe underselling them. Prob best to recommend settings imo

Third delivery suggested changes

A list of proposed changes following the May delivery of HDS

These are in addition to the changes in the pull request ailith_delivery3 and to the changes that Hannes made that have yet to be pushed to the main course materials.

Throughout

  • bold package names and include () for functions

Intro

  • Change high-dimensional data definition
  • Switch out prostate dataset or make it much clearer that it's a toy dataset for the purposes of explanation
  • Change view to head and dim
  • Expand challenge 1 solution
  • More specific question than examine the dataset in challenge 2 (from Emma's review in #39)
  • Check how we're referring to figures e.g not by number if there's no number
  • Could add a challenge question to show what happens with correlated variables (see Emma's review in #39)
  • Take out bioconductor intro as we never teach it (maybe condense and put in a callout box?)
  • Add brackets for function names in text, e.g. pairs() (from Emma's review in #39)
  • Explain why you are using here? (from Emma's review in #39)
  • STRUCTURAL Challenges section focus on two things: (a) ill-defined model (more predictors than observations) can add figure with one dot only, and (b) correlated predictors perhaps add code and show unstable coefficient estimates.
  • STRUCTURAL Rewirtre section on which statistical methods are used to give an overview of the course. Focus on problems and what analysis is used when (exploring one outcome with many similar features (methylation/expression) / predicting outcomes with more features than observations / reducing dimensionality/grouping/making sense of similar predictors / clustering observations)

Regression with many features (many outcomes)

  • rank results in toptable by effect size
  • include small intro to feature selection to motivate why these techniques are useful as we took the feature selection lesson out of the 2-day course.
  • check exercises aren't introducing new concepts
  • check direction of smoker is consistent between model and plot
  • Add brackets for function names in text, e.g. pairs() (from Emma's review in #39)
  • Explore whether the episode can be made shorter or divided (from Emma's review in #47)
  • Add a reference for the source of the methylation data
  • Change title to regression with many outcomes and add a brief comment to distinguish between dealing with many outcomes and/or many features (we can mention that the regularisation episode will address that). Potentially, we can create a separate episode Regression in high-dimensional settings where we introduce the methylation data and the two different types of problems. However, this is outside the scope for this round of changes. Creating this separate episode would also address some of Emma's concerns.
  • Add mention of dream() from VariancePartition which is similar to limma but can handle grouping (random effects)

Regularisation

  • needs split up
    • motivation & rationale - in expanded intro
    • intro to model selection/cross validation
    • what is regularisation in general?
    • ridge and lasso
  • more explanation of Horvath
  • greater figure explanation in the materials
  • fix overuse of Xi
  • more detail on extracting coefficients and model interpretation
  • glossary of jargon
  • add link to ML course for related materials (from #7)

CAV (20220206) Link added to episode 1 instead as it's general across different types of ML approaches.

  • review plot labels (from #7)

CAV (20220206) I can't recall what the specific issue was, but the episode has been extensively revised and labels look ok.

  • review phrasing in "why would we...?" - Alan marked it as convoluted (from #7)

CAV (20220206) Paragraph was revised, so hopefully OK now.

  • review ridge/EN equations (partially from #7)

CAV (20220206) Notation review.

  • in exercise 2, maybe ask why mean squared rather than sum of squared (from #7)
  • Add brackets for function names in text, e.g. pairs() (from Emma's review in #39)
  • move up the section "Using regularisation to impove generalisability"
  • add reason for training and test intro, like: "Before we move on to regularised regression, we have to introduce..."
  • when talking about elastic net, say we've used it all along - lasso and ridge are special cases with alpha=0/1

PCA

  • consider removing scaling from gene expression pca (include box about gene expression normalisation to emphasise that that's not what we're talking about)
  • Add brackets for function names in text, e.g. pairs() (from Emma's review in #39)
  • Equation half way down needed at all (which refers to original exaple?
  • add note the PCAtools taks data in the Bioconductor orientation
  • STRUCTURAL add table comparing terms for loadings and scores used in different packages

FA

  • move advantages and disadvantages of FA up so it's in the introduction
  • more detail on communality and uniqueness
  • mention confirmatory factor analysis
  • discuss ways of determining number of factors
  • Add brackets for function names in text, e.g. pairs() (from Emma's review in #39)

K means

  • fix border=NA in sil plot
  • check the coloured blocks on bootstrapping the clusters (check set.seed)
  • include this for silhouette scores https://medium.com/@cmukesh8688/silhouette-analysis-in-k-means-clustering-cefa9a7ad111
  • exercise 1 bugged (from #7; unsure if still needed)
  • initial mcq before callout? (from #7; unsure if still needed)
  • title for 1st practical bit (from #7; unsure if still needed)
  • formal description of silhouette width (from #7; unsure if still needed)
  • k of 5 or k=5, not both (from #7; unsure if still needed)
  • title for introducing bootstrap (from #7; unsure if still needed)
  • title for applying bootstrap (from #7; unsure if still needed)
  • more detail on bootstrap (from #7; unsure if still needed)
  • Add brackets for function names in text, e.g. pairs() (from Emma's review in #39)

Hierarchical clusters

Other

  • Consider temporarily removing optional episodes until reviewed/edited.
  • Edit setup.md to indicate approx time based on RStudio cloud (~30 mins) (from #34)
  • Check whether the list in dependencies.csv can be reduced (see #34)
  • Test setup.md in different environments. (see #34)
  • Create a docker with setup.md?

Review comments: Regression with many features

Regression with many features
https://carpentries-incubator.github.io/high-dimensional-stats-r/02-high-dimensional-regression/index.html

This episode does a really good job of demonstrating why a different approach is needed to hi-di data than just manually repeating thousands of linear models. I especially like that an methylation ~ age model was fitted for one feature. This episode seems long. You could consider breaking it in to 2 or 3 shorter sessions with more challenges testing the concepts tested.

Maybe show people how to see the sample-level data: colData(methylation)

You could lose "even with this reduced number of features - the original dataset contained over 800,000!" since this isn't discussed earlier

The first challenge makes good points but introduces new concepts rather than tests presented content. That might be fine depending how closely you want to flow the carpentries model.

Typos in the Solution of the first exercise:
"2.......... sample sizes for soem ethnicities" -> "2....... sample sizes for some ethnicities"
"3. If we perform 14 \times 5000..." Latex issue. Maybe: "5000 tests for each of fourteen variables" so it is more clear where 14 came from

Maybe say: If we use a p value of 0.05 we would expect 3500 significant results just by chance. (later I noticed this forms an exercise)

"this is to see if we can “predict” the expected methylation value for sample j at a given locus i, which we can write as X_{ij}, using age"
Would be clearer as
"this is to see if we can use age to “predict” the expected methylation value for sample j at a given locus i, which we can write as X_{ij}."

fit_smoke: I think the model specified is smoker = 1, not smoker = 0 but figure is for smoker = 0, not smoker = 1. The effect direction is reversed.

The multiple testing section is really good.

Structural changes suggested by after recent

Structural changes

Episode 1 - intro

Challenges section

  • split into ill-definded regression model (no coefficients in default regression)and
    • figure with only one dot
  • correlated variables
    • figure/code?

Statistical methods section

  • rewrite and mention what to use when
  • exploring one outcome with many similar features (methylation/expression)
  • predicting outcomes with more features than observations
  • reducing dimensionality/grouping/making sense of similar predictors
  • clustering observations

Episode 2 - many regressions

  • add mention of dream() from VariancePartition

Episode 3 - regularised regression

  • move up the section "Using regularisation to impove generalisability"
  • add reason for training and test intro, like: "Before we move on to regularised regression, we have to introduce..."
  • when talking about elastic net, say we've used it all along - lasso and ridge are special cases with alpha=0/1

Episode 4 - PCA

Equation half way down needed at all (which refers to original exaple?

  • table comparing terms in loadings sections
  • add note the PCAtools taks data in the Bioconductor orientation

Episode 7 - hierarchical clustering

Issues spotted during second delivery

Regression with many features:

coef_df not defined/created in episode code. (vector of p-values)
Requested code for heatmap (may we could hide it in a box like solutions?)

Regularisation:

Sum of squared residuals isn't squared (Fixed)
glmnet scales and centres internally so no need to scale/center separately

Leftover issues in Intro episode

After compiling the website, I could find some unresolved issues:

  • In Challenge 1, the answer for 4. is missing.
  • In the solution to Challenge 2, the requested changes failed to propagate
  • Same for Challenge 3.

@hannesbecher can you go through and see if I missed anything.

Note: I think this happened whenever my review was indicated as comments rather than direct suggested edits.

Review comments: Episode 2 - regression with many outcomes

Episode 2

Again, I really like this episode and believe it's really valuable to explore many outcomes as well as many predictors! I have a few comments here again. This time the comments are largely "queries" about the content.

I will again submit pull requests where possible :)

  • Line 61/DNA methylation data: Query - I'm not sure what this object is - is it derived from an object, which in turn is derived from a class? And what does this mean? This possibly makes more sense to someone familiar with Python rather than R. I'm really not sure without understanding, but perhaps there's a way to simplify the language/remove anything not relevant to the subsequent analysis and reference the vignette as you have done?

  • Line 75/DNA methylation data: Initially reading this I thought we had many more observations that features. I think putting the statement that "samples or observations are stored as columns, while features (in this case, sites in the genome) are stored as rows" first would help clarify. The first sentence stating the number of features etc could come after this. Also, stating number of rows/columns and then stating the translations to features and "observations/samples" in this sentence could help clarify.

  • Line 110/DNA methylation data: Query - is this definitely the Prostate data or is this a typo (should it be the methylation data)?

  • Line 112/Figure caption: Query - it's unclear to me in what way there are too many models to fit by hand. And if you mean by hand or using R but using a model formula? The examples below are just two models, right? And we could fit these in R pretty easily (they just might not be very good? Is this clarified by "it's clear that there's too many features to fit each possible model (combination of features) separately"

But if this latter statement is correct, how does this link to points (1) and (2) that follow?

  • Line 129/DNA methylation data: Perhaps something like "In general, it is scientifically interesting to answer two main questions using the three types of data:"

  • Line 145/Challenge 1: Query - possibly quote the name of the column rather than "columns in colData" to clarify what relationship we're hypothesising? How does this relate to the age-related problems above? Have I misunderstood?

  • Line 206/Regression with many outcomes: Query - the first two paragraphs of this section make perfect sense to me, but it's unclear to me how the plots relate to the third paragraph and regression with many outcomes overall and then how looking at linear regression helps.

Minor comments:

  • Line 11/Objectives: "high dimensional regression" -> "high-dimensional regression"

  • Line 64/DNA methylation data: ", and optional sample-level colData and feature-level metadata" should this be ", optional sample-level colData, and feature-level metadata"?

  • Line 140/DNA methylation data: Possibly clearer to swap the order in the list above and say we focus on the first problem here and second problem next?

  • Line 132/DNA methylation data: "leves" -> "levels"

  • Line 157/Challenge 1: "signif" -> significant

  • Line 224 & 228/Regression with many outcomes: "primary" -> "first", "secondary" -> "second"?

  • Line 229/Regression with many outcomes: "in very high-dimensional..." -> "when implemented on very high-dimensional..."

  • Alt text and captions for figures.

This could well be me misunderstanding things, please feel free to correct me!

Review comments: Episode 3 - regularised regression

Episode 3

Another nice episode. Although it's quite long, I think it covers what are often quite challenging ideas in a very approachable way. Most of the comments I have relate to how regularisation is motivated and some minor re-wording to clarify. I do, however, have a more challenging query about the placement of the linear regression background information. I have highlighted this in bold below!

Again, I will submit pull requests where possible and very happy to discuss anything.

  • Line 22/Key points: I can't see where this last key point re computational speed is discussed. Potentially omit or add a brief comment about this in the text?

  • Line 46/Introduction: The end of this sentence could clarify the differences between the approach of this episode and the previous one. Something like: "that instead [what regularisation does and how it differs from information sharing]".

  • Line 79/Introduction: I like the example to show what happens when you fit a model using all features. Could add a sentence like "We explain what singularities are and why they appear when fitting high-dimensional data below" to just clarify that this is addressed!

Also, a brief explanation of why some effect sizes are very high as this doesn't seem to be addressed.

  • Line 84/Singularities: I think it needs to be clearer why high-dimensional data result in singularities (i.e., explicitly saying that determinant of matrix is zero for high-dimensional data, thus singularities are always present when fitting models naively to high-dimensional data and R often cannot fit the model).

  • Line 141/Correlated features -- common in high-dimensional data: Perhaps this summary of the challenges could be placed outside of the pinned box. I think it deserves more attention as it succinctly summarises both of the above points and motivates the need for regularisation.

Also, the fact that p>n is problematic is discussed in a lot of detail (resulting in singularities), and so I think should be added to this point to justify the use of regularisation (or else the discussion on high-dimensional data being problematic by its very size could be removed and just discussion of correlations retained)!

Something like: "Regularisation can help us to deal with correlated features." -> "Regularisation can help us to deal with correlated features, as well as effectively reduce the number of features (dimension) in our model, and thus addresses these issues".

  • Line 152/Challenge 1: "Discuss in groups:" could be "Consider or discuss in groups:" to make suitable for an individual learner.

  • Lines 175 & Line 311: Query - it feels, from the section above, as though you are about to describe regularisation as this is now fully motivated. These regression sections therefore confused me a bit. There's a similar interlude in the previous episode. Could it be possible to cover this as preliminary material at the start of the episode, reference another tutorial on linear regression + training and testing models and reference this throughout all episodes, or use this as a "bridging" episode at the start of the formal set of episodes and back-reference?

Could then provide the information re restricting the model after the example (where we try to fit a linear model on the whole data set) as further motivation and explain how this relates to generalisability.

  • Line 515/Using regularisation to improve generalisability: A brief few words about the trade-off here with the extent of regularisation would help to clarify.

  • Line 516/Why would we want to restrict our model?: "This type of regularisation is called ridge regression" makes it sound as though ridge regression=OLS. Slight re-wording to tie ridge regression to the non-zero penalty may help.

  • Line 518: Somewhere in this section, could possibly add a title to improve signposting of the ridge method (since there's one for lasso).

  • Line 520: These problems already feel explained and thus I would propose a minor re-wording to make this a recap and demonstration of the fact that regularisation actually works.

  • Line 742/Cross-validation to find the best value of $\lambda$: Haven't defined what a 'good' value for lambda looks like (to define what the 'best' looks like)

  • Line 749: Cross-validation to find the best value of $\lambda$: "Cross-validation is a really deep topic that we're not going to cover in more detail today, though!" possibly a slight re-wording as the subsequent analysis seems to cover cross-validation in a bit more detail.

  • Line 795/Blending ridge regression and the LASSO - elastic nets: Would possibly re-order this title to put elastic nets first if also using lasso and ridge titles (just for signposting).

  • Line 1019/Other types of outcomes: It is a little unclear to me that the intercept-only model is selected. Could annotate the code to demonstrate how this happened.

Minor comments

  • Line 87/Singularities: remove comma before "and why are they..."

  • Line 441/Using regularisation to impove generalisability: "This can be done with regularisation. The idea to add another condition to the problem we’re solving with linear regression." -> "This can be done with regularisation: adding another condition to the problem we’re solving with linear regression."

  • Line 512/Using regularisation to impove generalisability: "punished" -> "punishes"

  • Line 548/Why would we want to restrict our model?: "trend" -> "tend".

  • Line 756/Cross-validation to find the best value of $\lambda$: "We can use this new idea to choose a lambda value, by" -> "We can use this new idea to choose a lambda value by"

  • Line 798/Blending ridge regression and the LASSO - elastic nets: "So far, we've used ridge regression, where alpha = 0, and LASSO regression, where alpha = 1." -> "So far, we've used ridge regression (where alpha = 0) and LASSO regression (where alpha = 1)."

"improve this page" link does not work

Mistake in hclust exercise

Examine how changing the h or k arguments in the hclust function affects the value of the Dunn index

This should be "the cutree function"

Review comments: Episode 6 - K-means

Episode 6

Again, a really nice episode. I particularly like how this episode builds gradually from an initial example. The narrative is therefore very clear. I have a couple of comments that largely relate to wording, but overall I found this episode really
informative and easy to understand.

I will submit pull requests :)

  • Line 35: I think a short sentence here linking back to previous episodes and the data considered may clarify the motivation for this episode from the outset, particularly how clustering is different to the methods already introduced/when it should be applied.

The subsequent paragraph also appears to jump between motivation, methodological description and application areas and is a little confusing. I would perhaps re-order this for flow while addressing the above:

"As we saw in previous episodes, visualising high-dimensional data with a large amount of features is difficult and can limit our understanding of the data and associated processes. In some cases, a known grouping causes this heterogeneity (sex, treatment groups, etc). In other cases, heterogeneity may arise from the presence of unknown subgroups in the data. [Something linking to PCA/FA, "PCA/FA..."]

Clustering is a set of techniques that [how different to PCA] and allows us to discover unknown groupings. Cluster analysis involves finding groups of observations that are more similar to each other (according to some feature) than they are to observations in other groups and are thus likely to represent the same source of heterogeneity. Once groups (or clusters) of observations have been identified using cluster analysis, further analyses or interpretation can be carried out on the groups, for example, using metadata to further explore groups.

Cluster analysis is commonly used to discover unknown groupings in fields such as bioinformatics, genomics, and image processing, in which large datasets that include many features are often produced."

(Note that I'm unsure if the description has lost accuracy!)

  • Line 58: Would propose a minor rewording to make it clear that the iterative updating of clusters follows

"Clusters can be updated in an iterative process so that over time we can become more confident in size and shape of clusters." -> "Using this process, we can also iteratively update clusters so that we become more confident about
the shape and size of the clusters"

  • Line 62: I would propose presenting this section on believing in clusters after the methodology ("What is K-means clustering") has been introduced as it's hard to follow without really understanding what clustering is doing.

  • Line 109: "K-means clustering is a clustering method which groups data points into a user-defined number of distinct non-overlapping clusters." defining clustering using the word clustering/unclear.
    Maybe simply: K-means clustering groups data points into a user-defined number of distinct non-overlapping clusters."

This paragraph could also link the way we group (minimising within-clustering variation) to how this creates clusters for clarity. A small change (with the above changes too) like:

"K-means clustering groups data points into a user-defined number of distinct non-overlapping clusters. To create clusters of 'similar' data points, K-means clustering forms clusters by minimising the within-cluster variation."

  • Line 114: I would possibly omit this sentence - I'm not really sure what this sentence means - what is a specified clustering algorithm and how does it increase our confidence that our data can be partitioned into groups.

  • Line 119: Given the considerations re defining the initial point described below, picking co-ordinates randomly here may be misleading for someone just referring to this section. I would suggest just referencing that this is discussed later.

  • Line 130: For signposting and consistency with the way additional considerations are presented later, I would
    present this as a new section "Initialisation".

  • Line 157: Reference data set name (scRNAseq) here?

  • Line 163: I think it needs to be clear here (and from the start as above) how clustering and PCA are different to
    clarify why we would apply PCA first. Technically both can be used for dimension reduction?

  • Line 203: "K" not explicitly defined yet. Instead of "Cluster the data using a $K$ of 5,.." could say "Cluster the data using $K=5$ clusters,"

  • Line 247: I think the intuitive definition of silhouette width needs to be given here rather than just its properties.

  • Line 314: "Is it better or worse than before? Can you identify where the differences lie?" a little unclear as the method hasn't changed and I'm not sure what "where the differences lie" refers to. Should this be "Do 5 clusters appear appropriate? Why/why not?"

  • Line 369: I think, given this whole section is about bootstrapping, the callout of a small section is confusing. This section could simply be called "Cluster robustness - bootstrapping" and the callout combined into the main text as an example.

  • Line 377: unclear how bootstrap helps us to address sensitivity of clusters to data. Could add something like this

"We can bootstrap: sample the data with replacement to reproduce a 'new' data set. We can then calculate new clusters for this data set and compare these to the to the clusters on the original data set, thus helping us to see how the clusters may change for small changes in the data.

  • Line 467: "Are the results better or worse" is confusing - the results/clusters won't be better or worse, we've just investigated them more.

Maybe "Do the results appear better or worse"?

instead of...

"To assess this, we can use the bootstrap. What we do here is to take a sample from the data with replacement"

Minor comments

  • Line 110: "In K-means clustering ..." -> "In K-means clustering, ..."

Also, this and the sentence beginning Line 112 start the same way. Could probably just remove "In K-means clustering, ..."
from the second.

  • Line 120: "until convergence..." -> "until appropriate clusters have been formed:"

  • Line 154: "Single-cell RNA sequence data" throughout (half abbreviated currently)

  • Line 372: "That is, if the data we observed were slightly different, the clusters we would identify in this different data would be very similar." -> "That is, we want to ensure that the clusters identified do not change substantially if the observed data change slightly."

  • Line 501: "This method can use k-means, or other clustering methods." -> "This method can use k-means or other clustering methods."

  • Mixture of lower case and upper case 'K's with reference the number of clusters throughout.

  • Alt text and captions

Self-review notes

Some notes for revisions after largely finishing content.

In general:

  • revise timings

  • revise/clarify questions/objectives/keypoints

  • probably don't title sections boringly like "introduction"

  • more "further reading" links

  • 02:

    • description of linear regression could use more detail
    • screening should use permuted age
    • plot labels/legends are at times iffy
  • 03:
    - [ ] plot labels moved to #64

    • callout box on collinearity
    • brackets on yhat - y -> fixed by 7ff074f
      - [ ] 2nd exercise a bit on-the nose agreed to skip (see comments below)
      - [ ] maybe ask why mean squared rather than sum of squared moved to #64
      - [ ] brackets on l2 norm/ridge equation a bit confusing moved to #64
      - [ ] phrasing convoluted in "why would we...?" moved to #64
      - [ ] callout box on flipping signs I don't think this is needed, trying to restrict too many extra info
      - [ ] bayesian callout box needs finishing Ailith and I decided to leave Bayesian refs out as there is no time to cover background knowledge
      - [ ] lasso bayesian callout box Ailith and I decided to leave Bayesian refs out as there is no time to cover background knowledge
    • finish bias-variance callout, make it distinct from elastic net. should come before final exercise
    • explain multinomial warnings in text

- 04: moved to #65
- [ ] machine learning course isn't carpentries and should be linked (or an alternative
- [ ] forward stepwise description a bit sparse
- [ ] forward/reverse could do with more detail

- 09: Moved to #64
- [ ] exercise 1 bugged
- [ ] initial mcq before callout?
- [ ] title for 1st practical bit
- [ ] formal description of silhouette width
- [ ] k of 5 or k=5, not both
- [ ] title for introducing bootstrap
- [ ] title for applying bootstrap
- [ ] more detail on bootstrap

- 10: moved to #65
- [ ] explain densfun warnings (good Q)
- [ ] error in dnorm, exercise 1
- [ ] warnings in histograms (suppress?)
- [x] #17
- [ ] more detail on fitting a bivariate normal

Review comments: Introduction to high-dimensional data

https://carpentries-incubator.github.io/high-dimensional-stats-r/01-introduction-to-high-dimensional-data/index.html

I really like this introduction - great idea to explicitly define high dimensional data and give examples. There's good reiteration of the important points in the text.

I like Challenge 1. It would be great if the reasons for the correct and incorrect answers were given in the solution.

Challenge 2 is good. Perhaps "Examine the dataset" could be replaced by explicit questions that provoke examination e.g, How many patiants were recorded? What variables were measured? What are n and p?

Possibly just a style difference but I like to add brackets for function names in text, (e.g., variables using the pairs() function) and make package names bold (e.g., from the lasso2 package)

"ratio of observations to features in a dataset is almost equal" might be clearer as "the number of observations and features in a dataset are almost equal".

As the figure isn't numbered, perhaps use "the figure on the left below" etc instead of 2a etc. Or number the fig.

I'm not sure what in Challenge 3 prompted:
## correlation matrix for variables describing cancer/clinical variables
cor(Prostate[, c(1, 2, 4, 6, 9)])

Or what prompted the examination of the residuals.

Maybe be some specific questions could be added to the Challenge to help illustrate what happens in linear models when we include correlated explanatory variables.

I really like the rationale given for using different methods for hi-di data

"Let’s install the minfi package" -> " Let’s load the minfi package"

I'm getting 404s minfi User's Guide.

Explain why you are using here?

Review comments: Episode 1 - introduction to high-dimensional data

Episode 1

Overall, I really like this as an introduction to the course. I think it strikes a good balance between motivating the lesson clearly while avoiding overwhelming a learner with information. I have listed some comments below. In the most part, I think these can be addressed by re-ordering sentences and adding additional signposting. I've also proposed some minor changes at the very bottom.

I will submit pull requests for these changes but feel free to reject changes where appropriate, of course!

  • Line 53/What are high-dimensional data?: Sentence "Such data sets pose a challenge for data analysis as standard methods of analysis, such as linear regression, are no longer appropriate." Possibly a brief few words to foreshadow that this is discussed later or why this is an issue would help to motivate the lesson more clearly.

  • Lines 41-66/What are high-dimensional data?: I think these paragraphs are very informative
    but perhaps a little cyclical in places since each paragraph starts with a more specific use
    case of the above and ends with the challenges. It may be clearer (not definitely!) if the applications are first described (getting more descriptive as is) and then the challenges are described to keep the flow of ideas. The challenges could even come after the initial plot of the data.

  • Line 107/Challenges in dealing with high-dimensional data: I like this as a heading, but given we first see "Challenge" with reference to problems for the learners to try directly above, the use of "Challenge" again in this title initially confused me a little. Maybe a synonym for challenge would help in this title.

  • Line 112/Challenges in dealing with high-dimensional data: Sentences in the paragraph from "This is because, ...". I think these sentences nicely explain why high-dimensional data are so prevalent, but as such I think it should be moved to the section above "What are high-dimensional data". It feels as though the first sentence is starting to explain why methods for high-dimensional data analysis are challenging (don't have as many tools historically) which feels relevant to the section, but the next parts go on to explain why this exists. If moving the explanation for the existence of high-dimensional data, the paragraph could also then pin down what the overall challenge is with high-dimensional data (even if this is a list of the things described later in the lesson like less developed or impossible visualisation) just to clearly motivate what's to come.

  • Line 140/Challenge 2: Given that this is the first bit of R code in the episode, a comment could be added to make sure people have completed the setup instructions. I think this is probably particularly useful for independent learners. We could also consider extending this to every first R challenge of each episode (I'm thinking that this would be consistent with the Carpentries' consideration for independent learners).

Second, is there any way we can use a high-dimensional data set here? It feels confusing to talk about high-dimensional data and the challenges and use this as an example of the challenges in high-dimensional data (as stated in the text above). I understand the point completely after looking at the problem and solution, but it's not immediately clear to me.

  • Line 144/Challenge 2: Possibly a new line for each part a, b and c of this challenge. Should part c also end with a question like "What problem with high-dimensional data analysis does this illustrate?" to get people thinking?

  • Line 188: I think the example used in this section is really clear. However, I'm a little confused if this section is related to the challenge we discussed above re visualising lots of variables being difficult, or if it's a distinct issue. Some text here either differentiating the issues or linking may be useful. If differentiating between the two, titles/subtitles/paragraph titles could also help signpost and make clear.

  • Line 210: This really feels like the biggest problem and is explained below, but maybe more link between the challenge and first text here would help. Something like "Let's explore why high correlations might be an issue in a Challenge".

  • Line 287/What statistical methods are used to analyse high-dimensional data?: I'm confused how the challenges described in this sentence relate to what we've just discussed. Further, some list the cause of the challenge (e.g. high correlation), while some list the effect (over-fitting). Maybe just summarising the causes first and then effect just to reinforce what was just discussed: difficult to visualise leading to challenges identifying suitable response variables, more features than observations leading to over-fitting, correlations between variables causing challenges including over-fitting (I don't think this latter point is actually explained above but could be useful to explain there. Also there may be more challenges with correlation between variables, hence the wording with that one). Having some sort of change re Line 188 could also help clarify.

Also - "can be difficult due to..." rather than "is difficult due to: " may be more accurate since there are other issues we've not considered yet. Also, not sure about this, but is the use of the colon correct in the latter phrase?

Some minor comments:

  • Line 53/What are high-dimensional data?: LaTeX formatting of the inequality of "$p$>=$n$" (-> "$p>=n$")

  • Line 57/What are high-dimensional data?: "Subjects like genomics and medical sciences often use both tall (in terms of $n$) and wide (in terms of $p$) datasets that can be difficult to analyse or visualise using standard statistical tools." Is it more precise to say "large $n$" and "large $p$" here since n and p can't themselves be tall and wide respectively.

  • Line 77/Challenge 1: "Which of these are considered to have high-dimensional data?" Should this be "Which of these scenarios use high-dimensional data?"

  • Line 289/ What statistical methods are used to analyse high-dimensional data?: "In this course we will cover four topics:" suggest a slight re-wording to make it clear that these are methods for dealing with high-dimensional data "In this course, we will cover four methods that help in dealing with high-dimensional data:"

  • Line 322/What statistical methods are used to analyse high-dimensional data?: hyphenation of "high dimensional datasets" (-> "high-dimensional datasets")

  • Line 348 & 367/Using Bioconductor to access high-dimensional data in the biosciences: minfi is loaded twice

  • Alt text and captions for figures.

Happy to discuss :)

Feedback from September 2022 delivery

DRAFT TO BE UPDATED AFTER DAY 4 - saved here to get started, currently updated to day 3.

EdCarp delivery 2022-09-27 to 2022-09-30, with instructors @hannesbecher, @luciewoellenstein44, @ewallace.
https://edcarp.github.io/2022-09-27_ed-dash_high-dim-stats/

Collaborative document:
https://pad.carpentries.org/2022-09-27_ed-dash_high-dim-stats

Overall went very well, good material, happy and engaged students.

Day 1 - Introduction, Regression with many features

Learner feedback

Please list 1 thing that you liked or found particularly useful

  • Well, all this is exactly what I need right now for my work. So, it was all very useful. (very useful help on the model.matrix - thank you!! (Pete)
  • It's nice to go through every function/word in R and know what they mean all the time.
  • Very helpful, especially in explaining what each part of the function actually means
  • Great learning experience
  • Very useful and insightful first day!
  • I really appreciate getting a chance to go through the code step by step. It's useful to be able to hear what it is exactly, and how it works.

Please list another thing that you found less useful, or that could be improved

  • While this is out of your control, moving between windows and internet tabs on a small screen takes a little time so, from time to time I missed something. (Pete) +1
  • Sometimes it is hard to read the material in time for the group sessions. +1
  • maybe more breaks to people could catch up +1
  • Perhaps a glossary/definitions of functions used could be useful in case you miss anything that has been spoken
  • I spent a bit of time trying to find the column header for the smoking exercise! Should have checked the question first, but I didn't and wasted loads of time trying to figure out it was $smoking.

Instructor feedback

Day 2 - Regularised regression

Learner feedback

Please list 1 thing that you liked or found particularly useful

  • the detailed explaination of regression models, from ridge to Lasso and eslatic, it is just fantastic to know how those algorithms relate to each other, been using them for many years, never understanded the links
    the coding and visulisation of the results are really helpful.
  • The depth of the models and the background was great. +1
  • Very happy with the explanations of how the maths works. Also it's great to be finally able to make a predictive model, even if it was very simple.
  • Increased my understanding of regression, but it was a tough day! Lots to take in.

Please list another thing that you found less useful, or that could be improved

  • There were a few times I was confused with the R syntax being used, mainly because I am not used to them. Are there any supporting documents that could be displayed for some of the exercises to help us solve the tasks?
  • Although in contridiction to my "positive" comment, it was heavy going :) - although I did enjoy it. The material is there for us to go back over. +1
  • I think its good to do some examples. I struggled to keep up at times and got a little lost. I think this is just my lack of familiarity. Maybe a little slower would be good.
  • I found it tough going, and there was a lot of detail. Felt a bit out of my depth at times, but I did learn a bit more.

Instructor feedback

Learners had several questions about extra arguments in calls to lm(), glmnet(), and so on. See etherpad day 2. Those should give clues to places to simplify:

  • Why as.data.frame? Comparing simplerfit_horvath <- lm(train_age ~ train_mat) to the example
    fit_horvath <- lm(train_age ~ ., data = as.data.frame(train_mat))
  • What does the -1 do to the methyl_mat matrix in k-fold cross validation? (in lasso <- cv.glmnet(methyl_mat[, -1], age, alpha = 1)

Day 3 - Principal component analyses, Factor analysis

Learner feedback

Please list 1 thing that you liked or found particularly useful

  • I thought this lesson was explained really well! I finally understand what these two models do. The run-through in R was in depth and really helpful.
  • Very practical lesson! Easy to follow.
  • For me this was great as I kind of do this sort of thing anyway. So, to actually be taught it filled in some gaps. The course materail, as everyday, is excellent. Very detailed. Course delivery excellent too.
  • perfect level for me today - I've used PCA in genetics to look for relatedness, so had a bit of understanding into how it works, but didn't know how to use it on non-genetic data. Really helpful, and I get it now!! I can see how to use it in my research.
  • likewise - only ever used PCA for pop genomics as a bit of a black box so great to develop my understanding. v interested in factor analysis
  • Fantastic, I can see the material here coming in very good use!
  • very interesting and detailed explaination on PCA and factor analysis, love it

Please list another thing that you found less useful, or that could be improved

  • difficult one: Maybe the time for coding could be expanded slightly?
  • I am curious about factor analysis and would be great to discuss it more

Instructor feedback

PCA (Episode 4)

  • Really nice introductory explanations.
  • Episode 4 PCA, Challenge 1, example 2 is ambiguous as it could be interpreted as PCA-appropriate. Could that be clarified or discussed?

An online retailer has collected data on user interactions with its online app and has information on the number of times each user interacted with the app, what products they viewed per interaction, and the type and cost of these products. The retailer would like to use this information to predict whether or not a user will be interested in a new product.

  • Challenge 2 some of the students said "seems like a trick question"
  • Loading is introduced approximately 3 times, but only explained later in the lesson. Could that be rationalised so it's introduced strongly once? Understanding the loadings helps understand how you calculate PCs, and that could come before you decide how many PCs you want to keep
  • The distance between the plot styles with base plot earlier and ggplot2-based later is striking and perhaps distracting. For example, one biplot looks very different from another biplot. This could also make the code fragile for learners as in the same lesson biplot is used for PCAtools::biplot and stats::biplot.
  • Are the labels in the biplot needed in PCAtools/microarray example? It seems like unnecessary and distracting information here given we are not going to explain GSMxxxxx or 211122_s_at. Also they are hard to read - too small and/or overlapping and give ggrepel error messages.
  • This lesson introduced to me the terms "screeplot" and "biplot" as I didn't have special names for them before. Maybe an extra sentence of explanation each would be helpful.
  • "Remove the lower 20% of PCs with lower variance" was unclear to learners.
  • In some code snippets, the comments happening after the code means they appear after the output instead of next to the code they refer to. Maybe more helpful to more the comments immediately before the line of code they refer to?
  • plotloadings was unclear to instructors and to learners. We wondered how the included variables chosen, and is it important to include it? Reading the ?plotloadings, it's says that the rangeRetain argument gives a "Cut-off value for retaining variables" in terms of "top/bottom fraction of the loadings range". I (Edward) find that unintuitive. For example there are still many points in 1/10000th of the loadings range: plotloadings(pc, labSize = 3, rangeRetain = 1e-5)

Factor analysis (Episode 5)

  • There's some confusion about the difference between PCA and FA. Current introduction says "we introduce another method", "Factor analysis is used to identify latent features in a dataset from among a set of original variables ... FA does this in a similar way to PCA", "Unlike with PCA, researchers using FA have to specify the number of latent variables.". This overall gives the impression of "similar but different" and doesn't explain well either why you'd need to learn both or the ideas underlying the difference.
  • Some online materials give clearer PCA vs FA explanations, e.g. https://towardsdatascience.com/what-is-the-difference-between-pca-and-factor-analysis-5362ef6fa6f9 and https://support.sas.com/resources/papers/proceedings/proceedings/sugi30/203-30.pdf
  • Still the learners seemed very happy. That seems to reflect the hands-on approach of the lesson that they can follow along with, less mathy than previous episodes.

Day 4 - K-means clustering, Hierarchical clustering

Learner feedback

Instructor feedback

Factor Analysis conceptual suggestions

Hi there!

I'm new to Carpentries and quite a weak coder. But I do have extensive experience with factor analysis within a social studies context and have a few suggestions about conceptual ideas if they're relevant. I totally understand things must be kept brief due to time contraints, but I think there are a few concepts that can be helpful.

EFA vs CFA: I wonder if it may be useful to describe the differences between exploratory and confirmatory factor analysis. As far as I can see, your analyses seem to be exclusively using EFA. Even if just focusing on EFA, it may be useful to mention there are more confirmatory approaches too. As of now, students may assume that EFA is the only method of factor analysis available.

Factor Enumeration: I wonder if it may be useful to briefly mention that there are various methods for determining the optimum number of latent factors in EFA. You don't even have to mention Kaiser's criterion, Cattel's scree plots, Horn's parallel analysis, and Velicer's minimal average partial, but just mentioning there being methods available may be helpful.

Next steps: I think pointing students towards R packages that make factor analysis easier may be beneficial. the psych package (running EFA), EFA.dimensions package (factor enumeration for EFA), and lavaan (CFA) are also incredibly useful.

I am happy to discuss or try to implement any of the comments raised further!

Many thanks,
Christie

Potential changes to optional episodes

  • 03 (from #7):

    • machine learning course isn't carpentries and should be linked (or an alternative
    • forward stepwise description a bit sparse
    • forward/reverse could do with more detail
  • 10 (from #7):

    • explain densfun warnings (good Q)
    • error in dnorm, exercise 1
    • warnings in histograms (suppress?)
    • #17
    • more detail on fitting a bivariate normal

Setup page code seems outdated

Th code on the Setup page seems outdated, although I may be missing something here. I see "small_methylation.rds" is noted as one of downloadebales, but when the code is run it returns an error, I also cannot find this file manually within the repo. In turn, "prostate.rds" is not mentioned in the setup, but on the first lesson page it is recalled as if it is already downloaded by the learner (and can be downloaded).

Actions problems

Recent build actions are skipped, eg https://github.com/carpentries-incubator/high-dimensional-stats-r/actions/runs/1842306596 or https://github.com/carpentries-incubator/high-dimensional-stats-r/actions/runs/1842192882

The latter provides the error An error occurred while provisioning resources (Error Type: Failure).

Seems maybe this is either a random thing or a resource usage thing https://github.community/t/jobs-on-macos-latest-sometimes-gets-cancelled-an-error-occurred-while-provisioning-resource/17434/13

Merging and updating build ahead of workshops

Would it be possible to merge the approved pull requests and rebuild the website before the next teaching round begins (27th February).

Happy to give the website build a go if this helps.

Review comments: Episode 5 - factor analysis

I really like this episode and think the length is good given the information in the previous episode. I have relatively few comments, listed below.

I will also submit pull requests!

  • Line 18 & 20/Keypoints: these key points about identifying the number of factors are discussed at the start but I think should also be mentioned where choosing the number of factors is discussed (paragraph beginning Line 198)

  • Line 32/Introduction: This possibly needs to differentiate between FA and PCA and when you may use them more clearly. It's sort of covered towards the end of the introduction, but I think it needs to be more explicit.

  • Line 35: "Here, we introduce more general set of methods..." -> "Here we introduce an alternative but related set of methods.." to clarify that they're different approaches.

  • Line 40/Introduction: "latent variable" not defined until later. Could just define here instead.

  • Line 41/Introduction: I would remove "data-driven" here as both EFA and CFA are data-driven techniques.

  • Line 54/An example: Call this section "Student scores" for consistency with other episodes?

  • Line 74/Advantages and disadvantages of Factor Analysis: As in the last episode, I think it is hard to understand
    the advantages and disadvantages of FA without understanding what it is. I think this should come at the end of the episode.

  • Line 190/Performing EFA: A brief statement summarising the interpretation of factors/loadings in this example may be useful here just to clarify why you might use EFA.

  • Line 198/Performing EFA: Could add a section heading here for consistency with PCA episode/consistency. Could also
    back reference to PCA in Line 200 to highlight the similarities between the approaches here.

  • Line 200/Performing EFA: "In practise, we repeat the factor analysis using different values in the factors argument." ->
    "In practice, we repeat the factor analysis for different numbers of factors (by specifying different values in the factors argument) since the upshot is that we're changing the number of factors.

  • Line 206/Performing EFA: the hypothesis test wording. Should it be:

"If the p-value is less than 0.05, we reject the null hypothesis that the number of factors is sufficient. If the p-value
is greater than 0.05, we do not reject the null hypothesis that the number of factors used captures variation in the data. We
often therefore conclude that this number of factors is sufficient"

rather than

"If the p-value is less than 0.05, we reject the null hypothesis and accept that the number of factors included is too small. If the p-value is greater than 0.05, we accept the null hypothesis that the number of factors used captures variation in the data."

Would also add "and we repeat the analysis with more factors. When the p-value is greater than 0.05..." after the first sentence to make it clear that this is iterative.

Also, I know we don't want to complicate things, but it may be more accurate to say "if the p-value is less than our significance level..." instead of using 0.05 as a hard threshold. If the p-value was 0.06, you'd probably also reject in practice?

  • Line 261: This feels incomplete. Maybe could include a brief statement about what this plot tells us about the relationship between variables and factors to tie things together.

  • Line 269/Challenge 2: "discuss in groups" should maybe be adapted for the individual learner. "Consider or discuss in groups" as proposed by my review of episode 3?

Minor comments

  • Captions and alt text.

  • Line 43/Introduction: "a-priori" -> "a priori".

  • Line 100/Prostate cancer patient data: I like that the prostate data is used as a simple example in these episodes but again think that it needs to be made clear that it's not high-dimensional and is used for pedagogical purposes!

  • Line 200/Performing EFA: "In practise" -> "In practice"

  • Line 203/Performing EFA: "output shows" -> "output then shows"

  • Line 224/Performing EFA: "explaind" -> "explained"

GH actions takes far too long

There's most likely a way to make the build process much faster.

My suggestions:

  • docker image
  • better R package caching (maybe borrow an action or two from repos that do it well)
  • Reduce number of dependencies, and/or try to prevent scripts being needlessly triggered

minfi error when building the website

After merging the last set of changes, I get an error because minfi cannot be installed.

As far as I understand, minfi is not needed as we have a 'local' copy of the methylation data. @hannesbecher could you please check this? I am not quite sure why we are loading minfi in all/some episodes.

If we remove it we will also need to update https://github.com/carpentries-incubator/high-dimensional-stats-r/blob/main/dependencies.csv accordingly and add a note to say what's the source of the data (similar to what was done for the prostate dataset)

Thanks!

Screening issue

Removing these for now

Selecting variables before running models

To get around the problem of multiple testing, people sometimes reduce the
number of variables as input. There are some valid and many invalid ways of
doing this. One (invalid) method is to select variables based on correlation
with the outcome. The p-values we get out of this kind of approach
model are basically meaningless, because we're doing a 2-stage model and only
reporting one set of p-values (ignoring all the non-significant ones). This
means that we are biasing the results towards significance, and further that
we are not correctly adjusting for the true number of tests we're
performing.

## calculate correlation between each feature and the outcome
cors <- apply(methyl_mat, 1, function(col) cor(col, age_perm))
## select only features with the 50% highest correlation
x_cor <- methyl_mat[abs(cors) > quantile(abs(cors), 0.5), ]
## create design matrix
design_age <- model.matrix(~age_perm)
## fit model and apply shrinkage
fit_cor <- lmFit(x_cor, design = design_age)
fit_cor <- eBayes(fit_cor)
## create table of features
toptab_cor <- topTable(fit_cor, coef = 2, number = nrow(fit_cor))
## make a plot with two panels
par(mfrow = c(1, 2))
## first panel
plot(toptab_cor$logFC, -log10(toptab_cor$P.Value),
    xlab = "Effect size", ylab = bquote(-log[10](p)),
    pch = 19
)
## get the feature names of our new matrix
feats <- rownames(toptab_cor)
## subset the original topTable results and our new results with this list 
## of features
## here we are using adjusted p-values
pvals_both <- cbind(
    Original = toptab_age[feats, "adj.P.Val"],
    Screened = toptab_cor[feats, "adj.P.Val"]
)
## calculate x and y limits for the plot, so it's symmetric
lims <- range(pvals_both)
plot(pvals_both, pch = 19, xlim = lims, ylim = lims, log = "xy")
## plot dashed red lines at p-value thresholds of 0.05
abline(h = 0.05, lty = "dashed", col = "firebrick")
abline(v = 0.05, lty = "dashed", col = "firebrick")
## plot a dashed black line through the identity line x=y
abline(coef = 0:1, lty = "dashed")

This two-step selection process biases the results towards
significance, and it means that the p-values we
report aren't accurate.

One way to screen for variables that does work is to use a filter
or screen that is independent of the test statistic.
Correlation is not independent of the t-statistic. However,
the overall variance of a feature is independent of this statistic, because
the overall variability level does not. We might suspect that
features that don't vary much at all don't vary in our groups of interest,
or alongside our continuous features (age in this example).

This approach was introduced by
Bourgon, Gentleman and Huber (2010)
and has be shown to be valid. This is because variance and the t-statistic
are not correlated under the null hypothesis, but are correlated under
the alternative.

## calculate variance of each feature independent of the outcome
vars <- apply(methyl_mat, 1, var)
## select the top 50% variable features
x_var <- methyl_mat[vars > quantile(vars, 0.5), ]
## fit model and apply shrinkage
fit_var <- lmFit(x_var, design = design_age)
fit_var <- eBayes(fit_var)
## get the results for our screened features
toptab_var <- topTable(fit_var, coef = 2, number = nrow(fit_var))
## make a plot with two panels beside each other
par(mfrow = c(1, 2))
## first plot - a volcano plot
plot(toptab_var$logFC, -log10(toptab_var$P.Value),
    xlab = "Effect size", ylab = bquote(-log[10](p)),
    pch = 19
)
## as before, select the screened feature from the original set of models
## and the ones screened by variance
feats <- rownames(toptab_var)
pvals_both_var <- cbind(
    Original = toptab_age[feats, "adj.P.Val"],
    Screened = toptab_var[feats, "adj.P.Val"]
)
lims <- range(pvals_both_var)
## plot these two sets of p-values against each other, with
## red dashed lines at p=0.05 and a black dashed line along the identity line
## of x=y
plot(pvals_both_var, pch = 16, xlim = lims, ylim = lims, log = "xy")
abline(h = 0.05, lty = "dashed", col = "firebrick")
abline(v = 0.05, lty = "dashed", col = "firebrick")
abline(coef = 0:1, lty = "dashed")

{: .callout}

review comments: setup.md

destfile argument is missing from download.file() command
e.g destfile = "dependencies.csv"

Maybe add a few comments to the set up and give an indication of time taken?

For ref: I did in a fresh RStudio Cloud Project (public: https://rstudio.cloud/project/3121818), attempted package installation of all 50 packages took 30 mins and I had several fails:
1: package(s) not installed when version(s) same as current; use force = TRUE to re-install: 'stats' 'MASS' 'cluster' 'BiocManager' 'utils'
2: In .inet_warning(msg) :
installation of package ‘scran’ had non-zero exit status
3: In .inet_warning(msg) :
installation of package ‘GenomicFeatures’ had non-zero exit status
4: In .inet_warning(msg) :
installation of package ‘bumphunter’ had non-zero exit status
5: In .inet_warning(msg) :
installation of package ‘ensembldb’ had non-zero exit status
6: In .inet_warning(msg) :
installation of package ‘minfi’ had non-zero exit status
7: In .inet_warning(msg) :
installation of package ‘scRNAseq’ had non-zero exit status
8: In .inet_warning(msg) :
installation of package ‘IlluminaHumanMethylationEPICanno.ilm10b4.hg19’ had non-zero exit status
9: In .inet_warning(msg) :
installation of package ‘IlluminaHumanMethylationEPICmanifest’ had non-zero exit status
10: In .inet_warning(msg) :
installation of package ‘FlowSorted.Blood.EPIC’ had non-zero exit status

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.