For run_all_consensus_partition_methods : <blockqu

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

top_n limitation of 5000? about cola HOT 5 OPEN

jarbet commented on July 17, 2024

top_n limitation of 5000?

from cola.

Comments (5)

jokergoo commented on July 17, 2024

If the number of expected subgroups is, say 2000, then using random 5k rows from top 20k rows and directly using top 20k rows may give different results. But if the expected number of subgroups is, say 10, then I would expect randomly sampling 5k from top 20k can already give a perfect approximation for the subgroup identification, and it is not necessary to use the complete 20k rows.

from cola.

jarbet commented on July 17, 2024

If the number of expected subgroups is, say 2000, then using random 5k rows from top 20k rows and directly using top 20k rows may give different results. But if the expected number of subgroups is, say 10, then I would expect randomly sampling 5k from top 20k can already give a perfect approximation for the subgroup identification, and it is not necessary to use the complete 20k rows.

To be more specific, my concern is when there is a very large number of features (I am working with ~200,000 DNA methylation CpG features). I want to generate clusters that reflect different methylation profiles, using ALL methylation features. Rather than assuming there is a sparse subset of important features, I want ALL (or most) features to contribute to the clusters.

Currently, cola would only be able to resample 5000 features at a time. My intuition is that this will not give all features enough chance to contribute to the clusters (since in any given partition, ~99% of CpGs are not contributing at all). Although I understand each feature would still be given approximately equal weight when averaging over all partitions in the final consensus clustering, so maybe this approach is fine, I am not sure.
What do you think?

from cola.

jokergoo commented on July 17, 2024

@jarbet In cola, the 5000 features are not sampled from all 200K probs, it is sampled from top_n top features. Let's say you may have 10k top most variable probs, the 5000 features are only samples from the top 10k probs.

from cola.

jarbet commented on July 17, 2024

@jarbet In cola, the 5000 features are not sampled from all 200K probs, it is sampled from top_n top features. Let's say you may have 10k top most variable probs, the 5000 features are only samples from the top 10k probs.

Okay, so if I set top_n = 200K, then it will resample 5000 features from ALL 200K, correct? That way I can give all 200K features a chance to appear in the clusters, right?

from cola.

jokergoo commented on July 17, 2024

The thing is, if you expect, say 5~20 clusters from all samples, then randomly sampling 5k features from 200K can give a good approximation. If you expect 1000 clusters from all samples, maybe randomly sampling 5K features is not a good idea.

from cola.

top_n limitation of 5000? about cola HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent