Git Product home page Git Product logo

Comments (6)

krassowski avatar krassowski commented on May 28, 2024 1

Hi @aderzelle - I know it was a long wait, but I finally got to optimize the code, in a few steps:

  1. I vectorized loops in compute_matrix() and compute_unions():
    upset_data() was benchmarked with TCGA BRCA mutations data, showing a speed-up by factor of:
    • two for (from 0.8s to 0.4s) for ~1000 observations x 70 sets
    • six for (from 26s to 4s) for ~1000 observations x 1000 sets
  2. Further re-writing of compute_matrix(), compute_unions(), and names_of_members() led to another improvement:
    • down to 0.12s for ~1000 observations x 70 sets
    • down to 0.65s for ~1000 observations x 1000 sets
  3. Finally I narrowed down another performance issue, this time specific to the datasets with very large #observations (incorrect assignment to with_sizes data frame), bringing the times down to:
    • 0.10s for ~1000 observations x 70 sets
    • 0.52s for ~1000 observations x 1000 sets
    • 6.1s for 1.5 milion rows x 11 sets, as in your question @aderzelle.

I used TCGA BRCA data set (84723 unique SNPs), randomly selected 11 patients, duplicated it 18 times to get over 1.5 milion rows. Plotting this dataset after all the optimizations took 7.9s.

I only optimized the code for plain use case of calling upset(data, columns) so there still may be arguments that will execute non-optimized code (e.g. filtering or sorting) - I will address these subsequently (please feel welcome to highlight any such cases by opening a new issue!). These changes will be available in 0.7.4 version (it can be already installed directly from GitHub).

I would love to learn if the performance is now satisfactory for you.

from complex-upset.

krassowski avatar krassowski commented on May 28, 2024

Thanks for bringing this up. I also noted some slowdown when using omics datasets - I will have a look at improving the performance later tonight!

from complex-upset.

 avatar commented on May 28, 2024

Perfect, I will be happy to do the testing.

from complex-upset.

 avatar commented on May 28, 2024

It actually took roughly 1h to get the final plot.

from complex-upset.

krassowski avatar krassowski commented on May 28, 2024

A recent commit, df58c0d, should have a side effect of slightly improving the performance (although it would not be a big difference).

from complex-upset.

krassowski avatar krassowski commented on May 28, 2024

There are more speed-ups possible:

  • everything so far is single-core, yet most of the functions are embarrassingly parallelizable. parallel could become a suggested package, and if installed, it could be used for some heavier tasks.
  • my code is to a degree still influenced by Python mindset where the key is not to vectorise everything, but to make a clever use of hashing. It is odd that R does not expose hastables as a concept, but it seems those are indeed in use: https://www.r-bloggers.com/2015/04/hash-table-performance-in-r-part-iiin-part-i-of-this-series-i-explained-how-r-hashed/
  • use lazy evaluation for union size calculation

from complex-upset.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.