tabdelaal / scrnaseq_benchmark Goto Github PK
View Code? Open in Web Editor NEWBench-marking classification tools for scRNA-seq data
License: MIT License
Bench-marking classification tools for scRNA-seq data
License: MIT License
Hi,
I am trying to run SVM and SVM_rejection using the snakemake and singularity pipeline on a new dataset. In the configuration file I have only listed the two tools.
When running snakemake, I get the list of the jobs that will be run:
Job counts:
count jobs
1 SVM
1 SVM_rejection
1 all
2 evaluate
1 generate_dropouts_feature_rankings
6
Is it possible to disable the evaluate and generate_dropouts_feature_rankings jobs? The reasons I want this are as follows.
I want to change the evaluation and therefore am planning to manually run an updated evaluate.R on the results.
The generate_dropouts_feature_rankings rule is failing on my data because it is too big (snippet from the log):
numpy.core._exceptions.MemoryError: Unable to allocate array with shape (58184, 11261) and data type float64
Thank you.
You've been more than helpful already, so I am sorry to keep bothering, but I have come across one more bug that I just cannot figure out. It only manifested this afternoon -- prior to that, everything was going pretty well, even past this step. The problem is with a python module import when running one of your scripts in the corresponding container. To reproduce it, try this:
$ singularity shell docker://scrnaseqbenchmark/baseline:latest
Singularity> python3
>>> from sklearn import linear_model
I get this error.
/usr/local/lib/python3.5/dist-packages/scipy/__init__.py:115: UserWarning: Numpy 1.13.3 or above is required for this version of scipy (detected version 1.10.1)
UserWarning)
RuntimeError: module compiled against API version 0xb but this version of numpy is 0xa
Traceback (most recent call last):
File "rank_gene_dropouts.py", line 8, in <module>
from sklearn import linear_model
File "/usr/local/lib/python3.5/dist-packages/sklearn/__init__.py", line 76, in <module>
from .base import clone
File "/usr/local/lib/python3.5/dist-packages/sklearn/base.py", line 16, in <module>
from .utils import _IS_32BIT
File "/usr/local/lib/python3.5/dist-packages/sklearn/utils/__init__.py", line 13, in <module>
from scipy.sparse import issparse
File "/usr/local/lib/python3.5/dist-packages/scipy/sparse/__init__.py", line 230, in <module>
from .csr import *
File "/usr/local/lib/python3.5/dist-packages/scipy/sparse/csr.py", line 13, in <module>
from ._sparsetools import (csr_tocsc, csr_tobsr, csr_count_blocks,
ImportError: numpy.core.multiarray failed to import
I am completely stumped -- running an import command inside of a container should be the definition of reproducible; what could either of use possibly do differently? If you have any ideas, I am happy to send any info you need.
I want to valid the result according to the data you provided. However, these python based packages including: DigitalCellSorter, Cell_BLAST, moana was depending on different version of pythooon and some other python packages. Could you please help on how to install them on my own computer. Do you install them with conda? which version was used for all three packages? Or, could you please provide the installation scripts for me? Thanks
Hi all,
Very nice work on this project, and I really enjoyed your "one rule to rule them all" allusion! I am having an issue with the gene ranking. Details are attached. Is there a way to either 1) get the gene ranking to work or 2) bypass it, since I am using all genes anyway?
Thank you!
P.S. Details. I am using this command with these snake- and config files, and I get this error log. I am running it from the Snakemake/
subdirectory, because I tried running it from the root and then it cannot find the snakefile.
snakemake --configfile config.yml --use-singularity
Hi,
I have a rather big dataset and the python scripts fail while loading the data (I have tried this for rank_gene_dropouts.py
and run_SVM.py
) with:
numpy.core._exceptions.MemoryError: Unable to allocate array with shape (58184, 11261) and data type float64
I have successfully fixed the issue by asking pandas to read the file in chunks. I have also applied the log transformation while reading the chunks.
This is how the reading the data looks like for these two python files:
rank_gene_dropouts.py
:
# read the data
data = pd.read_csv(DataPath,index_col=0,sep=',')
data = data.iloc[tokeep]
data = np.log2(data+1)
# read the data
data = pd.DataFrame()
for chunk in pd.read_csv(DataPath, index_col=0, sep=',', chunksize=1000):
data = pd.concat([data, np.log2(chunk+1)], ignore_index=True)
data = data.iloc[tokeep]
run_SVM.py
:
# read the data
data = pd.read_csv(DataPath,index_col=0,sep=',')
data = data.iloc[tokeep]
# normalize data
data = np.log1p(data)
# read and normalize the data
data = pd.DataFrame()
for chunk in pd.read_csv(DataPath, index_col=0, sep=',', chunksize=1000):
data = pd.concat([data, np.log1p(chunk)], ignore_index=True)
data = data.iloc[tokeep]
It might be a good idea to update the python scripts to better handle bigger datasets.
In #7 , you mention that the pipeline looks on Dockerhub for pre-built containers. If I want to add a container for another tool such as ACTINN, how should I do that? I'm sorry to bother you, but I checked the readme files and the singularity docs about Docker, and nothing describes how to create a new Docker image on Dockerhub. Do you use Docker to create the images and Singularity to run them?
Hello! Thanks so much for making your code so well-documented and accessible. I'm interested in benchmarking a new cell type identification tool using your snakemake pipeline, but our computing cluster is having issues getting singularity installed. Is there a way to run the comparison tests without using singularity?
Hi, In your paper, it mentions that the Zheng et al. sorted PBMC data was subsampled down to 2k cells per cell type. I am wondering if the data (or cell barcodes) for these subsampled cells are provided or can be provided. I do not see this dataset on the Zenodo repository.
Thanks!
Matt
Hi Dr. Abdelaal et al! Very impressive and useful work here. Does your code reproduce all of the figures, or just some of them? I am particularly interested in figure 4 (cross-species comparison).
Small request: link to the paper off of the main README for people finding this through GitHub. Thanks!
If I want to re-run your cross-species benchmarks, would the following tactic work?
MouseALM_HumanMTG_folds.RData
would become train=MouseALM__test=HumanMTG_folds.RData
and train=HumanMTG__test=MouseALM_folds.RData
.Thank you!
P.S. I got the snakemake pipeline working with singularity and cluster configuration and everything! It's always hard to get snakemake configured on a new platform, but it's really useful for me to learn more about these tools.
Many thanks for your study and data !
I would like to reconstruct the original datasets from the merged ones you provide but I have some trouble understanding the Statistics.xlsx file.
For example in Inter-dataset/Pancreatic/statistics.xlsx, which indices to take to reconstruct the full Xin dataset ? What about the Muraro dataset which has two sets of Training indices ?
Thanks a lot for your help !
Hi, I am looking for a scatter plot of the Median F1 Accuracy vs the Computation Time of each method on a common dataset, perhaps Zheng sorted or 68k.
I could not find either the plot nor a table of the actual computation values in the paper and supplementary figures. The closest was Supplementary Figure S13, which was a plot of computation time vs no. of cells.
Just wondering if this information is available anywhere? Or, if not, you would be able to provide it?
I want to valid the result of scPred, and I am using the newest version of scPred(1.9.2), however, it didn't work and the Rstudio told me that I didn't have several functions which is related to scPred including eigenDecompose. So I want to know which version of scPred should be used and other config file of this baseline!
Hello!
I read your bioRxiv paper and was interested in taking a look at your snakemake pipeline.
I could not fInd your config file, though, and running snakemake yields this error:
KeyError in line 17 ./scRNAseq_Benchmark/Snakefile:
'tools_to_run'
Is it available somewhere?
Thanks,
Tiago
Great work your cell identification paper.
So how to use
run_scPred('/TM/Filtered_TM_data.csv','/TM/Labels.csv','/TM/CV_folds.RData','/Results/TM/')
If I need to test the algorithm in Separate training and testing dataset. The same thing you did in your manuscript
Training in one platform and test on the other.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.