Git Product home page Git Product logo

scrnaseq_benchmark's People

Contributors

davycats avatar lcmmichielsen avatar tabdelaal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

scrnaseq_benchmark's Issues

Disable "default" jobs

Hi,

I am trying to run SVM and SVM_rejection using the snakemake and singularity pipeline on a new dataset. In the configuration file I have only listed the two tools.

When running snakemake, I get the list of the jobs that will be run:

Job counts:
        count   jobs
        1       SVM
        1       SVM_rejection
        1       all
        2       evaluate
        1       generate_dropouts_feature_rankings
        6

Is it possible to disable the evaluate and generate_dropouts_feature_rankings jobs? The reasons I want this are as follows.

  1. I want to change the evaluation and therefore am planning to manually run an updated evaluate.R on the results.

  2. The generate_dropouts_feature_rankings rule is failing on my data because it is too big (snippet from the log):

numpy.core._exceptions.MemoryError: Unable to allocate array with shape (58184, 11261) and data type float64

Thank you.

Python imports in baseline container

You've been more than helpful already, so I am sorry to keep bothering, but I have come across one more bug that I just cannot figure out. It only manifested this afternoon -- prior to that, everything was going pretty well, even past this step. The problem is with a python module import when running one of your scripts in the corresponding container. To reproduce it, try this:

$ singularity shell docker://scrnaseqbenchmark/baseline:latest
Singularity> python3 
>>> from sklearn import linear_model

I get this error.

/usr/local/lib/python3.5/dist-packages/scipy/__init__.py:115: UserWarning: Numpy 1.13.3 or above is required for this version of scipy (detected version 1.10.1)
  UserWarning)
RuntimeError: module compiled against API version 0xb but this version of numpy is 0xa
Traceback (most recent call last):
  File "rank_gene_dropouts.py", line 8, in <module>
    from sklearn import linear_model
  File "/usr/local/lib/python3.5/dist-packages/sklearn/__init__.py", line 76, in <module>
    from .base import clone
  File "/usr/local/lib/python3.5/dist-packages/sklearn/base.py", line 16, in <module>
    from .utils import _IS_32BIT
  File "/usr/local/lib/python3.5/dist-packages/sklearn/utils/__init__.py", line 13, in <module>
    from scipy.sparse import issparse
  File "/usr/local/lib/python3.5/dist-packages/scipy/sparse/__init__.py", line 230, in <module>
    from .csr import *
  File "/usr/local/lib/python3.5/dist-packages/scipy/sparse/csr.py", line 13, in <module>
    from ._sparsetools import (csr_tocsc, csr_tobsr, csr_count_blocks,
ImportError: numpy.core.multiarray failed to import

I am completely stumped -- running an import command inside of a container should be the definition of reproducible; what could either of use possibly do differently? If you have any ideas, I am happy to send any info you need.

Ask for help on python based package installation

I want to valid the result according to the data you provided. However, these python based packages including: DigitalCellSorter, Cell_BLAST, moana was depending on different version of pythooon and some other python packages. Could you please help on how to install them on my own computer. Do you install them with conda? which version was used for all three packages? Or, could you please provide the installation scripts for me? Thanks

Missing rank_genes_dropouts.csv

Hi all,

Very nice work on this project, and I really enjoyed your "one rule to rule them all" allusion! I am having an issue with the gene ranking. Details are attached. Is there a way to either 1) get the gene ranking to work or 2) bypass it, since I am using all genes anyway?

Thank you!

P.S. Details. I am using this command with these snake- and config files, and I get this error log. I am running it from the Snakemake/ subdirectory, because I tried running it from the root and then it cannot find the snakefile.

snakemake --configfile config.yml --use-singularity

Snakefile.txt

config.yml.txt

error.log

Fix MemoryError

Hi,

I have a rather big dataset and the python scripts fail while loading the data (I have tried this for rank_gene_dropouts.py and run_SVM.py) with:

numpy.core._exceptions.MemoryError: Unable to allocate array with shape (58184, 11261) and data type float64

I have successfully fixed the issue by asking pandas to read the file in chunks. I have also applied the log transformation while reading the chunks.

This is how the reading the data looks like for these two python files:
rank_gene_dropouts.py:

  • Current
# read the data
data = pd.read_csv(DataPath,index_col=0,sep=',')
data = data.iloc[tokeep]
data = np.log2(data+1)
  • Updated
# read the data
data = pd.DataFrame()
for chunk in pd.read_csv(DataPath, index_col=0, sep=',', chunksize=1000):
    data = pd.concat([data, np.log2(chunk+1)], ignore_index=True)
data = data.iloc[tokeep]

run_SVM.py:

  • Current (I have skipped the irrelevant lines that were in between)
# read the data
data = pd.read_csv(DataPath,index_col=0,sep=',')
data = data.iloc[tokeep]
# normalize data
data = np.log1p(data)
  • Updated
# read and normalize the data
data = pd.DataFrame()
for chunk in pd.read_csv(DataPath, index_col=0, sep=',', chunksize=1000):
    data = pd.concat([data, np.log1p(chunk)], ignore_index=True)
data = data.iloc[tokeep]

It might be a good idea to update the python scripts to better handle bigger datasets.

Adding Docker containers

In #7 , you mention that the pipeline looks on Dockerhub for pre-built containers. If I want to add a container for another tool such as ACTINN, how should I do that? I'm sorry to bother you, but I checked the readme files and the singularity docs about Docker, and nothing describes how to create a new Docker image on Dockerhub. Do you use Docker to create the images and Singularity to run them?

Running benchmarking pipeline without Singularity

Hello! Thanks so much for making your code so well-documented and accessible. I'm interested in benchmarking a new cell type identification tool using your snakemake pipeline, but our computing cluster is having issues getting singularity installed. Is there a way to run the comparison tests without using singularity?

Zheng et al. sorted data

Hi, In your paper, it mentions that the Zheng et al. sorted PBMC data was subsampled down to 2k cells per cell type. I am wondering if the data (or cell barcodes) for these subsampled cells are provided or can be provided. I do not see this dataset on the Zenodo repository.

Thanks!
Matt

Which figures are covered?

Hi Dr. Abdelaal et al! Very impressive and useful work here. Does your code reproduce all of the figures, or just some of them? I am particularly interested in figure 4 (cross-species comparison).

link to the paper

Small request: link to the paper off of the main README for people finding this through GitHub. Thanks!

Re-running cross-species brain atlas tests

If I want to re-run your cross-species benchmarks, would the following tactic work?

  1. Disable the feature that generates folds for cross-validation. I will need to specify the training and test sets directly.
  2. In the data download from Zenodo, parse the cross-validation RData files and split them up according to which dataset is used for training and which for testing. For example, MouseALM_HumanMTG_folds.RData would become train=MouseALM__test=HumanMTG_folds.RData and train=HumanMTG__test=MouseALM_folds.RData.
  3. Run the pipeline separately for each resulting CV folds file.

Thank you!

P.S. I got the snakemake pipeline working with singularity and cluster configuration and everything! It's always hard to get snakemake configured on a new platform, but it's really useful for me to learn more about these tools.

reconstructing original datasets from merged ones

Many thanks for your study and data !
I would like to reconstruct the original datasets from the merged ones you provide but I have some trouble understanding the Statistics.xlsx file.
For example in Inter-dataset/Pancreatic/statistics.xlsx, which indices to take to reconstruct the full Xin dataset ? What about the Muraro dataset which has two sets of Training indices ?
Thanks a lot for your help !

Paper's results: Data for Median F1 Accuracy vs Time?

Hi, I am looking for a scatter plot of the Median F1 Accuracy vs the Computation Time of each method on a common dataset, perhaps Zheng sorted or 68k.

I could not find either the plot nor a table of the actual computation values in the paper and supplementary figures. The closest was Supplementary Figure S13, which was a plot of computation time vs no. of cells.

Just wondering if this information is available anywhere? Or, if not, you would be able to provide it?

scPred dockerfile version of scPred and other installation

I want to valid the result of scPred, and I am using the newest version of scPred(1.9.2), however, it didn't work and the Rstudio told me that I didn't have several functions which is related to scPred including eigenDecompose. So I want to know which version of scPred should be used and other config file of this baseline!

Could not find the config file

Hello!

I read your bioRxiv paper and was interested in taking a look at your snakemake pipeline.

I could not fInd your config file, though, and running snakemake yields this error:

KeyError in line 17 ./scRNAseq_Benchmark/Snakefile:
'tools_to_run'

Is it available somewhere?

Thanks,
Tiago

Train in one data and test on the another one

Great work your cell identification paper.
So how to use
run_scPred('/TM/Filtered_TM_data.csv','/TM/Labels.csv','/TM/CV_folds.RData','/Results/TM/')
If I need to test the algorithm in Separate training and testing dataset. The same thing you did in your manuscript
Training in one platform and test on the other.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.