Hello, thanks for this package, very cool looking and I am glad you

A recent commit, <a class="commit-link" data-hovercard-type="commit" data-hovercard-ur

There are more speed-ups possible: everything so far is single

Slow plotting of a big dataset (10^6 observations, 11 sets) about complex-upset HOT 6 CLOSED

krassowski commented on May 28, 2024

Slow plotting of a big dataset (10^6 observations, 11 sets)

from complex-upset.

Comments (6)

krassowski commented on May 28, 2024 1

Hi @aderzelle - I know it was a long wait, but I finally got to optimize the code, in a few steps:

I vectorized loops in compute_matrix() and compute_unions():
upset_data() was benchmarked with TCGA BRCA mutations data, showing a speed-up by factor of:
- two for (from 0.8s to 0.4s) for ~1000 observations x 70 sets
- six for (from 26s to 4s) for ~1000 observations x 1000 sets
Further re-writing of compute_matrix(), compute_unions(), and names_of_members() led to another improvement:
- down to 0.12s for ~1000 observations x 70 sets
- down to 0.65s for ~1000 observations x 1000 sets
Finally I narrowed down another performance issue, this time specific to the datasets with very large #observations (incorrect assignment to with_sizes data frame), bringing the times down to:
- 0.10s for ~1000 observations x 70 sets
- 0.52s for ~1000 observations x 1000 sets
- 6.1s for 1.5 milion rows x 11 sets, as in your question @aderzelle.

I used TCGA BRCA data set (84723 unique SNPs), randomly selected 11 patients, duplicated it 18 times to get over 1.5 milion rows. Plotting this dataset after all the optimizations took 7.9s.

I only optimized the code for plain use case of calling upset(data, columns) so there still may be arguments that will execute non-optimized code (e.g. filtering or sorting) - I will address these subsequently (please feel welcome to highlight any such cases by opening a new issue!). These changes will be available in 0.7.4 version (it can be already installed directly from GitHub).

I would love to learn if the performance is now satisfactory for you.

from complex-upset.

krassowski commented on May 28, 2024

Thanks for bringing this up. I also noted some slowdown when using omics datasets - I will have a look at improving the performance later tonight!

from complex-upset.

commented on May 28, 2024

Perfect, I will be happy to do the testing.

from complex-upset.

commented on May 28, 2024

It actually took roughly 1h to get the final plot.

from complex-upset.

krassowski commented on May 28, 2024

A recent commit, df58c0d, should have a side effect of slightly improving the performance (although it would not be a big difference).

from complex-upset.

krassowski commented on May 28, 2024

There are more speed-ups possible:

everything so far is single-core, yet most of the functions are embarrassingly parallelizable. parallel could become a suggested package, and if installed, it could be used for some heavier tasks.
my code is to a degree still influenced by Python mindset where the key is not to vectorise everything, but to make a clever use of hashing. It is odd that R does not expose hastables as a concept, but it seems those are indeed in use: https://www.r-bloggers.com/2015/04/hash-table-performance-in-r-part-iiin-part-i-of-this-series-i-explained-how-r-hashed/
use lazy evaluation for union size calculation

from complex-upset.

Slow plotting of a big dataset (10^6 observations, 11 sets) about complex-upset HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent