Episode 6
Again, a really nice episode. I particularly like how this episode builds gradually from an initial example. The narrative is therefore very clear. I have a couple of comments that largely relate to wording, but overall I found this episode really
informative and easy to understand.
I will submit pull requests :)
- Line 35: I think a short sentence here linking back to previous episodes and the data considered may clarify the motivation for this episode from the outset, particularly how clustering is different to the methods already introduced/when it should be applied.
The subsequent paragraph also appears to jump between motivation, methodological description and application areas and is a little confusing. I would perhaps re-order this for flow while addressing the above:
"As we saw in previous episodes, visualising high-dimensional data with a large amount of features is difficult and can limit our understanding of the data and associated processes. In some cases, a known grouping causes this heterogeneity (sex, treatment groups, etc). In other cases, heterogeneity may arise from the presence of unknown subgroups in the data. [Something linking to PCA/FA, "PCA/FA..."]
Clustering is a set of techniques that [how different to PCA] and allows us to discover unknown groupings. Cluster analysis involves finding groups of observations that are more similar to each other (according to some feature) than they are to observations in other groups and are thus likely to represent the same source of heterogeneity. Once groups (or clusters) of observations have been identified using cluster analysis, further analyses or interpretation can be carried out on the groups, for example, using metadata to further explore groups.
Cluster analysis is commonly used to discover unknown groupings in fields such as bioinformatics, genomics, and image processing, in which large datasets that include many features are often produced."
(Note that I'm unsure if the description has lost accuracy!)
- Line 58: Would propose a minor rewording to make it clear that the iterative updating of clusters follows
"Clusters can be updated in an iterative process so that over time we can become more confident in size and shape of clusters." -> "Using this process, we can also iteratively update clusters so that we become more confident about
the shape and size of the clusters"
-
Line 62: I would propose presenting this section on believing in clusters after the methodology ("What is K-means clustering") has been introduced as it's hard to follow without really understanding what clustering is doing.
-
Line 109: "K-means clustering is a clustering method which groups data points into a user-defined number of distinct non-overlapping clusters." defining clustering using the word clustering/unclear.
Maybe simply: K-means clustering groups data points into a user-defined number of distinct non-overlapping clusters."
This paragraph could also link the way we group (minimising within-clustering variation) to how this creates clusters for clarity. A small change (with the above changes too) like:
"K-means clustering groups data points into a user-defined number of distinct non-overlapping clusters. To create clusters of 'similar' data points, K-means clustering forms clusters by minimising the within-cluster variation."
-
Line 114: I would possibly omit this sentence - I'm not really sure what this sentence means - what is a specified clustering algorithm and how does it increase our confidence that our data can be partitioned into groups.
-
Line 119: Given the considerations re defining the initial point described below, picking co-ordinates randomly here may be misleading for someone just referring to this section. I would suggest just referencing that this is discussed later.
-
Line 130: For signposting and consistency with the way additional considerations are presented later, I would
present this as a new section "Initialisation".
-
Line 157: Reference data set name (scRNAseq
) here?
-
Line 163: I think it needs to be clear here (and from the start as above) how clustering and PCA are different to
clarify why we would apply PCA first. Technically both can be used for dimension reduction?
-
Line 203: "K" not explicitly defined yet. Instead of "Cluster the data using a $K$ of 5,.." could say "Cluster the data using $K=5$ clusters,"
-
Line 247: I think the intuitive definition of silhouette width needs to be given here rather than just its properties.
-
Line 314: "Is it better or worse than before? Can you identify where the differences lie?" a little unclear as the method hasn't changed and I'm not sure what "where the differences lie" refers to. Should this be "Do 5 clusters appear appropriate? Why/why not?"
-
Line 369: I think, given this whole section is about bootstrapping, the callout of a small section is confusing. This section could simply be called "Cluster robustness - bootstrapping" and the callout combined into the main text as an example.
-
Line 377: unclear how bootstrap helps us to address sensitivity of clusters to data. Could add something like this
"We can bootstrap: sample the data with replacement to reproduce a 'new' data set. We can then calculate new clusters for this data set and compare these to the to the clusters on the original data set, thus helping us to see how the clusters may change for small changes in the data.
- Line 467: "Are the results better or worse" is confusing - the results/clusters won't be better or worse, we've just investigated them more.
Maybe "Do the results appear better or worse"?
instead of...
"To assess this, we can use the bootstrap. What we do here is to take a sample from the data with replacement"
Minor comments
- Line 110: "In K-means clustering ..." -> "In K-means clustering, ..."
Also, this and the sentence beginning Line 112 start the same way. Could probably just remove "In K-means clustering, ..."
from the second.
-
Line 120: "until convergence..." -> "until appropriate clusters have been formed:"
-
Line 154: "Single-cell RNA sequence data" throughout (half abbreviated currently)
-
Line 372: "That is, if the data we observed were slightly different, the clusters we would identify in this different data would be very similar." -> "That is, we want to ensure that the clusters identified do not change substantially if the observed data change slightly."
-
Line 501: "This method can use k-means, or other clustering methods." -> "This method can use k-means or other clustering methods."
-
Mixture of lower case and upper case 'K's with reference the number of clusters throughout.
-
Alt text and captions