Git Product home page Git Product logo

Comments (7)

grst avatar grst commented on May 31, 2024 1

Thanks, the multiprocessing seems to work now!

I think using BLAS with dense matrices would be a great enhancement. My use-case is that I used external tools (scanpy) for filtering and removing confounding factors. Additionally, some recent tools for combining multiple datasets (Scanorama, Harmony) work in PCA-space, and I think it would be great to apply too-many-cells to large, integrated datasets.

Concerning runtime:
The documentation says it takes "some time" to generate the tree. What ballpark runtime (hours/days/weeks?) can I expect on a (sparse) dataset with 10k cells on a server with 32 cores?

from too-many-cells.

GregorySchwartz avatar GregorySchwartz commented on May 31, 2024 1

I'll close this since the initial issue has been resolved. If you have another issue, please open up a separate one.

from too-many-cells.

GregorySchwartz avatar GregorySchwartz commented on May 31, 2024

It's a csv file, so too-many-cells expects row and column names. Also, the initial column name should be empty as it names the rows. See the example section at https://gregoryschwartz.github.io/too-many-cells/. I'll try to add a better error message. Let me know if it works!

from too-many-cells.

GregorySchwartz avatar GregorySchwartz commented on May 31, 2024

Wait, so that is an issue that you need column and row names. However, the real issue is that you are using PCA, so the default cell and gene filtering and normalization are removing and scaling incorrectly (they aren't counts). So the answer to fix your issue is to use --no-filter --normalization NoneNorm if they aren't counts. I updated the documentation to reflect this option. Also, because the matrix is now dense, I can't make any guarantees about how long it will take to run.

from too-many-cells.

grst avatar grst commented on May 31, 2024

Thanks the filtering options did the trick! Now it does something :)

Concerning runtime:

  • Shouldn't it be a lot faster on PCA even though it's "dense" simply because it has a lot fewer dimensions?
  • The docker container only uses one core, is that normal behaviour?

from too-many-cells.

GregorySchwartz avatar GregorySchwartz commented on May 31, 2024

It's optimized for sparse matrices and the library I'm using uses an IntMap (IntMap Double) structure, so it would probably be slower on dense for multiplication and the like. I can see about allowing for dense matrices (and thus using BLAS/LAPACK for much faster running). To use multiple cores, add +RTS -N${NUMCORES} to the end of the entire command, where -N4 uses 4 cores for instance. Just -N uses all available cores. Of course, mileage may vary depending on the nature of the data.

from too-many-cells.

GregorySchwartz avatar GregorySchwartz commented on May 31, 2024

Thanks for using it! Let me know if you have any difficulties or want added features! A ballpark runtime for a 10k cells in a sparse scRNA-seq count matrix would take around 20 minutes on a single core. 50k maybe a few hours. I have not tested multi-core, it may actually be slower depending on the fine-grain or coarse-grain nature of the beast.

from too-many-cells.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.