gregoryschwartz / too-many-cells Goto Github PK

View Code? Open in Web Editor NEW

100.0 3.0 19.0 22.8 MB

Cluster single cells and analyze cell clade relationships with colorful visualizations.

Home Page: https://gregoryschwartz.github.io/too-many-cells/

License: GNU General Public License v3.0

Haskell 97.93% R 0.38% Nix 1.69%

single-cell visualization single-cell-analysis single-cell-rna-seq bioinformatics-pipeline bioinformatics-algorithms

too-many-cells's Introduction

too-many-cells

Website

See https://github.com/GregorySchwartz/too-many-cells for latest version. See too-many-peaks for more information about scATAC-seq usage. See spatial for more information about spatial usage.

See the publication (and please cite!) for more information about the algorithm.

Description

too-many-cells is a suite of tools, algorithms, and visualizations focusing on the relationships between cell clades. This includes new ways of clustering, plotting, choosing differential expression comparisons, and more! While too-many-cells was intended for single cell RNA-seq, any abundance data in any domain can be used. Rather than opt for a unique positioning of each cell using dimensionality reduction approaches like t-SNE, UMAP, and LSA, too-many-cells recursively divides cells into clusters and relates clusters rather than individual cells. In fact, by recursively dividing until further dividing would be considered noise or random partitioning, we can eliminate noisy relationships at the fine-grain level. The resulting binary tree serves as a basis for a different perspective of single cells, using our birch-beer visualization and tree measures to describe simultaneously large and small populations, without additional parameters or runs. See below for a full list of features.

New features for v3.0.0.0

Added new spatial entry point for spatial analysis of cells! Can make interactive plots of the cells in-situ with their features as well as quantify spatial relationships between pairs of cells.
Overhauled the command line interface, so expect to find possible instability with the options. Open an issue at https://github.com/GregorySchwartz/too-many-cells/issues if you encounter any expected errors or behavior!
Added MinMaxNorm for min-max normalization and TransposeNorm to transpose the matrix to apply normalizations back and forth between axes, for instance, --normalization QuantileNorm --normalization TransposeNorm --normalization MinMaxNorm --normalization TransposeNorm will first apply quantile normalization to each cell, then min-max normalization to each column (before returning the cells to the proper axis with another tranpose).
Incompatibility: Projection file format changed “barcode” column to “item”.

New features for v2.2.0.0

--no-edger replaced with --edger as the default is now Kruskal-Wallis.
Can now use backgrounds for motifs.
Can specify motif for genome analysis (i.e. findMotifsGenome.pl from HOMER).
Temporary directories are now variables to correctly specify location.
Added q-values for differential.
Updated documentation for too-many-peaks.

New features for v2.0.0.0

Support for scATAC-seq for chromatin state relationships with too-many-peaks !
Find enchriched regions as peaks for scATAC-seq with peaks.
Find motifs from differential chromatin state using motifs.
Linear relationships across the tree as pseudotime with paths.
Classify single-cell data from bulk with classify.
New dimensionality reductions with --lsa.
Output transformed matrix with matrix-output.
Bypass labels.csv with -Z quick labels.
MADs-from-median-based thresholds for multi-gene overlay plots
Multiple normalization application
And much more!

New features since initial launch

Now packaged for the functional package manager nix (Linux only)! No more dependency shuffling or root for Docker needed!
A new R wrapper was written to quickly get data to and from too-many-cells from R. Check it out here!
Now works with Cellranger 3.0 matrices in addition to Cellranger 2.0
Can prune (make into leaves) specified nodes with --custom-cut.
Can analyze sets of features averaged together (e.g. gene sets). Breaks API, so update your =–draw-leaf “DrawItem (DrawContinuous "Cd4")”= argument to =–draw-leaf “DrawItem (DrawContinuous ["Cd4"])”= (notice the list notation).
Outputs values from differential entry point plots (from --features), and can aggregate features by average.

Installation

We provide multiple ways to install too-many-cells. We recommend installing with nix . nix will provide all dependencies in the build, supports Linux, and should be reproducible, so try that first. We also have docker images and a Dockerfile to use in any system in case you have a custom build (for instance, a non-standard R installation) or difficulty installing. macOS and Windows users: too-many-cells was built and tested on Linux, so we highly recommend using the docker image (which is a completely isolated environment which requires no compiling or installation, other than docker itself) as there may be difficulties in installing the dependencies.

nix

too-many-cells can be installed using the functional package manager nix . While you will need sudo to install, no sudo is required after the correct setup. First, install nix following the instructions on the website. Then, with an unset LD_LIBRARY_PATH,

git clone https://github.com/GregorySchwartz/too-many-cells.git
cd too-many-cells
nix-env -f default.nix -i too-many-cells

Stack (unsupported in `too-many-cells >= v2.0.0.0`, use nix)

Dependencies

You may require the following dependencies to build and run (from Ubuntu 14.04, use the appropriate packages from your distribution of choice):

build-essential
libgmp-dev
libblas-dev
liblapack-dev
libgsl-dev
libgtk2.0-dev
libcairo2-dev
libpango1.0-dev
graphviz
r-base
r-base-dev

To install them, in Ubuntu:

sudo apt install build-essential libgmp-dev libblas-dev liblapack-dev libgsl-dev libgtk2.0-dev libcairo2-dev libpango1.0-dev graphviz r-base r-base-dev

too-many-cells also uses the following packages from R:

cowplot
ggplot2
edgeR
jsonlite

To install them in R,

install.packages(c("ggplot2", "cowplot", "jsonlite"))
install.packages("BiocManager")
BiocManager::install("edgeR")

Install `stack`

See https://docs.haskellstack.org/en/stable/README/ for more details.

curl -sSL https://get.haskellstack.org/ | sh
stack setup

Install `too-many-cells`

Source

Probably the easiest method if you don’t want to mess with dependencies (outside of the ones above).

git clone https://github.com/GregorySchwartz/too-many-cells.git
cd too-many-cells
stack install

Online

We only require stack (or cabal), you do not need to download any source code (but you might need the stack.yaml.old dependency versions), just run the following command to place too-many-cells < v2.0.0.0 in your ~/.local/bin/:

mv stack.yaml.preV2 stack.yaml
stack install too-many-cells

If you run into errors like Error: While constructing the build plan, the following exceptions were encountered:, then follow the advice. Usually you just need to follow the suggestion and add the dependencies to the specified file. For a quick yaml configuration, refer to https://github.com/GregorySchwartz/too-many-cells/blob/master/stack.yaml.old.

macOS

We recommend using docker on macOS. The following is written for too-many-cells < v2.0.0.0. If you must compile too-many-cells, you should get the above dependencies. For some dependencies, you can use brewer, then install too-many-cells (in the cloned folder, don’t forget to install the R dependencies above):

brew cask install xquartz
brew install glib cairo gtk gettext fontconfig freetype

brew tap brewsci/bio
brew tap brewsci/science
brew install r zeromq graphviz pkg-config gsl libffi gobject-introspection gtk+ gtk+3

# Needed so pkg-config and libraries can be found.
# For the second path, use the ouput of "brew info libffi".
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:/usr/local/opt/libffi/lib/pkgconfig

# Tell gtk that it's quartz
stack install --flag gtk:have-quartz-gtk

Docker

Different computers have different setups, operating systems, and repositories. Do put the entire program in a container to bypass difficulties (with the other methods above), we user docker. So first, install docker.

To get too-many-cells (replace 2.0.0.0 with any version needed):

docker pull gregoryschwartz/too-many-cells:2.0.0.0

To run too-many-cells in a docker container:

sudo docker run -it --rm -v "/home/username:/home/username" gregoryschwartz/too-many-cells:2.0.0.0 -h

Now you can follow the tutorial below with the addition of the docker paths and commands. If you add yourself to the docker group, sudo is not needed. For instance:

docker run -it --rm -v "/home/username:/home/username" \
    gregoryschwartz/too-many-cells:2.0.0.0 make-tree \
    --matrix-path /home/username/path/to/input \
    --labels-file /home/username/path/to/labels.csv \
    --draw-collection "PieRing" \
    --output /home/username/path/to/out \
    > clusters.csv

Make sure to increase the memory that can be used by docker containers if you use macOS or Windows. Also, docker won’t be able to find your files by default. You need to mount the folders with -v in order to have docker read and write from and to the filesystem, respectively. Read the documentation about volumes for more information. You can simply mount your entire relevant path as in the above example to handle both input and output, or just mount your entire user directory as above. Specifically, -v "/home/username:/home/username"= for the whole directory or each individual =-v /path/to/matrix/on/host:/input_matrix with -m /input_matrix is what you want, where before the : is on the host filesystem while after the : is what the docker program sees. Then you can write the output in the same way: -v /path/to/output/on/host:/output will write the output to the folder before the :.

To build the too-many-cells image yourself if you want:

nix-build docker.nix
docker load < /nix/store/${NAME_OF_OUTPUT_IMAGE}.tar.gz

Troubleshooting

Using nix, I’m getting shared object not found errors.

Be sure to have LD_LIBRARY_PATH unset when running nix-env to make sure the linked libraries are in /nix/store.

I am getting errors like `AesonException "Error in $.packages.cassava.constraints.flags...` when running `stack` commands

Try upgrading stack with stack upgrade. The new installation will be in ~/.local/bin, so use that binary.

I use conda or custom ld library locations and I cannot install `too-many-cells` or run into weird R errors

stack and too-many-cells assume system libraries and programs. To solve this issue, first install the dependencies above at the system level, including system R. Then to every stack and too-many-cells command, prepend ~PATH=”$HOME/.local/bin:/usr/bin:$PATH”~ to all commands. For instance:

PATH="$HOME/.local/bin:/usr/bin:$PATH" stack install
PATH="$HOME/.local/bin:/usr/bin:$PATH" too-many-cells make-tree -h

If your shared libraries are abnormal and use libR.so from non-system locations, be sure to also have LD_LIBRARY_PATH=/usr/lib/:$LD_LIBRARY_PATH when installing (and / or the location of R libraries, such as /usr/local/lib/R/lib/).

I am still having issues with installation

Open an issue! While working on the issue, try out the docker for too-many-cells, it requires no installation at all (other than docker).

I am on macOS/Windows with docker and `too-many-cells` silently crashes.

Docker containers may run into this issue if the memory given to the containers is insufficient. Make sure to increase the memory that can be used by docker containers.

I am getting the error `--draw-leaf` cannot be read, but I copied the command!

For some computers, you may need to change the command to single quotations for the argument: =–draw-leaf ‘DrawItem (DrawContinuous ["Cd4"])’=

Included projects

This project is a collection of libraries and programs written specifically for too-many-cells:

birch-beer: Generate a tree for displaying a hierarchy of groups with colors, scaling, and more.
modularity: Find the modularity of a network.
spectral-clustering: Library for spectral clustering.
hierarchical-spectral-clustering: Hierarchical spectral clustering of a graph.
differential: Finds out whether an entity comes from different distributions (statuses).

Usage

too-many-cells has several entry points depending on the desired analysis.

Argument	Analysis
`make-tree`	Generate the tree from single cell data with various measurement outputs and visualize tree
`interactive`	Interactive visuzalization of the tree, very slow
`differential`	Find differentially expressed features between two nodes
`diversity`	Conduct diversity analyses of multiple cell populations
`paths`	The binary tree equivalent of the so called “pseudotime”, or 1D dimensionality reduction

The main workflow is to first generate and plot the population tree using too-many-cells make-tree, then use the rest of the entry points as needed.

At any point, use -h to see the help of each entry point.

Also, check out tooManyCellsR for an R wrapper!

`make-tree`

too-many-cells make-tree generates a binary tree using hierarchical spectral clustering. We start with all cells in a single node. Spectral clustering partitions the cells into two groups. We assess the clustering using Newman-Girvan modularity: if $Q > 0$ then we recursively continue with hierarchical spectral clustering. If not, then there is only a single community and we do not partition – the resulting node is a leaf and is considered the finest-grain cluster.

The most important argument is the –prior argument. Making the tree may take some time, so if the tree was already generated and other analysis or visualizations need to be run on the tree, point the --prior argument to the output folder from a previous run of too-many-cells. If you do not use --prior, the entire tree will be recalculated even if you just wanted to change the visualization!

The main input is the --matrix-path argument. When a directory is supplied, too-many-cells interprets the folder to have matrix.mtx, genes.tsv, and barcodes.tsv files (cellranger outputs, see cellranger for specifics). If a file is supplied instead of a directory, we assume a csv file containing feature row names and cell column names. This argument can be called multiple times to combine multiple single cell matrices: --matrix-path input1 --matrix-path input2.

The second most important argument is --labels-file. Supply with a csv with a format and header of “item,label” to provide colorings and statistics of the relationships between labels. Here the “item” column contains the name of each cell (barcode) and the label is any property of the cell (the tissue of origin, hour in a time course, celltype, etc.). You can also now use -Z as a list for each matching -m in order to manually give the entire matrix that label (useful for situations like -m ./t-all -Z T-ALL -m ./control -Z Control). To get the newly generated labels with =-Z into a labels.csv file, specify --labels-output and the labels.csv will be in the output folder.

To see the full list of options, use too-many-cells -h and -h for each entry point (i.e. too-many-cells make-tree -h).

Output

too-many-cells make-tree generates several files in the output folder. Below is a short description of each file.

File	Description
`clumpiness.csv`	When labels are provided, uses the clumpiness measure to determine the level of aggregation between each label within the tree.
`clumpiness.pdf`	When labels are provided, a figure of the clumpiness between labels.
`cluster_diversity.csv`	When labels are provided, the diversity, or “effective number of labels”, of each cluster.
`cluster_info.csv`	Various bits of information for each cluster and the path leading up to each cluster, from that cluster to the root. For instance, the `size` column has `cluster_size/parent_size/parent_parent_size/.../root_size`
`cluster_list.json`	The `json` file containing a list of clusterings.
`cluster_tree.json`	The `json` file containing the output tree in a recursive format.
`dendrogram.svg`	The visualization of the tree. There are many possible options for this visualization included. Can rename to choose between PNG, PS, PDF, and SVG using `--dendrogram-output`.
`graph.dot`	A `dot` file of the tree, with less information than the tree in `cluster_results.json`.
`node_info.csv`	Various information of each node in the tree.
`projection.pdf`	When `--projection` is supplied with a file of the format “barcode,x,y”, provides a plot of each cell at the specified x and y coordinates (for instance, when looking at t-SNE plots with the same labelings as the dendrogram here).

Outline with options

The basic outline of the default matrix pre-processing pipeline with some relevant options is as follows (there are many additional options including cell whitelists that can be seen using too-many-cells make-tree -h):

Read matrix.
Optionally remove cells with less than X counts (--filter-thresholds).
Optionally remove features with less than X count (--filter-thresholds).
Term frequency-inverse document frequency normalization (--normalization).
Optionally use dimensionality reduction (--lsa).
Finish.

Example

Setup

We start with our input matrix. Here,

ls ./input

barcodes.tsv  genes.tsv  matrix.mtx

Note that the input can be a directory (with the cellranger matrix format above) or a file (a csv file). You can also point to a cellranger >= 3.0 folder which has matrix.mtx.gz, features.tsv.gz, and barcodes.tsv.gz files instead. You don’t need to use scRNA-seq data! You can use any data that has observations (cells) and features (genes), as long as you agree that the observations are related by their feature abundances. If you do upstream batch effect correction, LSA, normalization, or anything else, be sure to use --normalization NoneNorm (and --shift-positive for LSA) to avoid wrong filters and scalings! If using dimensionality reduction such as PCA and t-SNE, we highly recommend generating your own similarity matrix for use with our cluster-tree program and plot with birch-beer, as we emphasize a feature matrix in too-many-cells and dimensionality reduction algorithms transform counts (our input which works with cosine similarity) into more nebulous information (which may not work with cosine similarity). cluster-tree, however, can be used with adjacency and similarity matrices. As for formats, the matrix market format contains three files like so:

The matrix.mtx file is in matrix market format.

%%MatrixMarket matrix coordinate integer general
%
23433 1981 4255069
4 1 1
5 1 1
11 1 2
23 1 2
25 1 2
40 1 2
48 1 1
...

The genes.tsv file (or features.tsv.gz) contains the features of each cell and corresponds to the rows of matrix.mtx. Here, both columns were the same gene symbols, but you can have Ensembl as the first column and gene symbol as the second, etc. The columns and column orders don’t matter, but make sure all matrices have the same format and specify the symbols you want to use (for overlaying gene expression, differential expression, etc.) with --feature-column COLUMN. So to use the second column for gene expression, you would use --feature-column 2.

Xkr4	Xkr4
Rp1	Rp1
Sox17	Sox17
Mrpl15	Mrpl15
Lypla1	Lypla1
Tcea1	Tcea1
Rgs20	Rgs20
Atp6v1h	Atp6v1h
Oprk1	Oprk1
Npbwr1	Npbwr1
...

The barcodes.tsv file contains the ids of each cell or observation and corresponds to the columns of matrix.mtx.

AAACCTGCAGTAACGG-1
AAACGGGAGAAGAAGC-1
AAACGGGAGACCGGAT-1
AAACGGGAGCGCTCCA-1
AAACGGGAGGACGAAA-1
AAACGGGAGGTACTCT-1
AAACGGGAGGTGCTTT-1
AAACGGGAGTCGAGTG-1
AAACGGGCATGGTCAT-1
AAAGATGAGCTTCGCG-1
...

For a csv file, the format is dense (observation columns (cells), feature rows (genes)):

"","A22.D042044.3_9_M.1.1","C5.D042044.3_9_M.1.1","D10.D042044.3_9_M.1.1","E13.D042044.3_9_M.1.1","F19.D042044.3_9_M.1.1","H2.D042044.3_9_M.1.1","I9.D042044.3_9_M.1.1",...
"0610005C13Rik",0,0,0,0,0,0,0,...
"0610007C21Rik",0,112,185,54,0,96,42,...
"0610007L01Rik",0,0,0,0,0,153,170,...
"0610007N19Rik",0,0,0,0,0,0,0,...
"0610007P08Rik",0,0,0,0,0,19,0,...
"0610007P14Rik",0,58,0,0,255,60,0,...
"0610007P22Rik",0,0,0,0,0,65,0,...
"0610008F07Rik",0,0,0,0,0,0,0,...
"0610009B14Rik",0,0,0,0,0,0,0,...
...

We also know where each cell came from, so we mark that down as well in a labels.csv file.

item,label
AAACCTGCAGTAACGG-1,Marrow
AAACGGGAGACCGGAT-1,Marrow
AAACGGGAGCGCTCCA-1,Marrow
AAACGGGAGGACGAAA-1,Marrow
AAACGGGAGGTACTCT-1,Marrow
...

This can be easily accomplished with sed:

cat barcodes.tsv | sed "s/-1/-1,Marrow/" | s/-2/etc... > labels.csv

For cellranger, note that the -1, -2, etc. postfixes denote the first, second, etc. label in the aggregation csv file used as input for cellranger aggr.

Default run

We can now run the too-many-cells algorithm on our data. The resulting cells with assigned clusters will be printed to stdout (don’t forget to use --normalization NoneNorm on preprocessed data, as stated here). While older versions had default filter thresholds for (MINCELL, MINFEATURE) counts, since v2.0.0.0 the default is now no filtering to account for multiple assay types.

too-many-cells make-tree \
    --matrix-path input \
    --labels-file labels.csv \
    --filter-thresholds "(250, 1)" \
    --draw-collection "PieRing" \
    --output out \
    > clusters.csv

Pruning tree

Large cell populations can result in a very large tree. What if we only want to see larger subpopulations rather than the large (inner nodes) and small (leaves)? We can use the --min-size 100 argument to set the minimum size of a leaf to 100 in this case. Alternatively, we can specify --smart-cutoff 4 in addition to --min-size 1 to set the minimum size of a node to $4 * \text{median absolute deviation (MAD)}$ of the nodes in the original tree. Varying the number of MADs varies the number of leaves in the tree. --smart-cutoff should be used in addition to --min-size, --max-proportion, --min-distance, or --min-distance-search to decide which cutoff variable to use. The value supplied to the cutoff variable is ignored when --smart-cutoff is specified. We’ll prune the tree for better visibility in this document.

Note: the pruning arguments change the tree file, not just the plot, so be sure to output into a different directory.

Also, we do not need to recalculate the entire tree! We can just supply the previous results using --prior (we can also remove --matrix-path with --prior to speed things up, but miss out on some features if needed):

too-many-cells make-tree \
    --prior out \
    --labels-file labels.csv \
    --smart-cutoff 4 \
    --min-size 1 \
    --draw-collection "PieRing" \
    --output out_pruned \
    > clusters_pruned.csv

Pie charts

What if we want pie charts instead of showing each individual cell (the default)?

too-many-cells make-tree \
    --prior out \
    --labels-file labels.csv \
    --smart-cutoff 4 \
    --min-size 1 \
    --draw-collection "PieChart" \
    --output out_pruned \
    > clusters_pruned.csv

Node numbering

Now that we see the relationships between clusters and nodes in the dendrogram, how can we go back to the data – which nodes represent which node IDs in the data?

too-many-cells make-tree \
    --prior out \
    --labels-file labels.csv \
    --smart-cutoff 4 \
    --min-size 1 \
    --draw-collection "PieChart" \
    --draw-node-number \
    --output out_pruned \
    > clusters_pruned.csv

Branch width

We can also change the width of the nodes and branches, for instance if we want thinner branches:

too-many-cells make-tree \
    --prior out \
    --labels-file labels.csv \
    --smart-cutoff 4 \
    --min-size 1 \
    --draw-collection "PieChart" \
    --draw-max-node-size 40 \
    --output out_pruned \
    > clusters_pruned.csv

No scaling

We can remove all scaling for a normal tree and still control the branch widths:

too-many-cells make-tree \
    --prior out \
    --labels-file labels.csv \
    --smart-cutoff 4 \
    --min-size 1 \
    --draw-collection "PieChart" \
    --draw-max-node-size 40 \
    --draw-no-scale-nodes \
    --output out_pruned \
    > clusters_pruned.csv

How strong is each split? We can tell by drawing the modularity of the children on top of each node:

too-many-cells make-tree \
    --prior out \
    --labels-file labels.csv \
    --smart-cutoff 4 \
    --min-size 1 \
    --draw-collection "PieChart" \
    --draw-mark "MarkModularity" \
    --output out_pruned \
    > clusters_pruned.csv

Gene expression

What if we want to draw the gene expression onto the tree in another folder (requires --matrix-path, may take some time depending on matrix size. Defaults to all black if the feature name is not present in the matrix, so check the first column of the feature file)? Note: the feature names are from the genes.tsv or features.tsv.gz file. Usually, cellranger has Ensembl identifiers as the first column and gene symbol as the second column, so if you want to specify gene symbol, use --feature-column 2 (1 is default).

too-many-cells make-tree \
    --prior out \
    --matrix-path input \
    --labels-file labels.csv \
    --filter-thresholds "(250, 1)" \
    --smart-cutoff 4 \
    --min-size 1 \
    --feature-column 2 \
    --draw-leaf "DrawItem (DrawContinuous [\"Cd4\"])" \
    --output out_gene_expression \
    > clusters_pruned.csv

Notice that Cd4 is within a list ([]), so multiple features can be listed and the average of those values for each cell will be used. While this representation shows the expression of Cd4 in each cell and blends those levels together, due to the sparsity of single cell data these cells and their respective subtrees may be hard to see without additional processing. Let’s scale the saturation to more clearly see sections of the tree with our desired expression (when choosing other high and low colors with --draw-colors, scaling the saturation will only affect non-grayscale colors).

too-many-cells make-tree \
    --prior out \
    --matrix-path input \
    --labels-file labels.csv \
    --filter-thresholds "(250, 1)" \
    --smart-cutoff 4 \
    --min-size 1 \
    --feature-column 2 \
    --draw-leaf "DrawItem (DrawContinuous [\"Cd4\"])" \
    --draw-scale-saturation 10
    --output out_gene_expression \
    > clusters_pruned.csv

There, much better! Now it’s clearly enriched in the subtree containing the thymus, where we would expect many T cells to be. While this tree makes the expression a bit more visible, there is another tactic we can use. Instead of the continuous color spectrum of expression values, we can have a binary “high” and “low” expression. Here, we’ll continue to have the red and gray colors represent high and low expressions respectively using the --draw-colors argument. Note that this binary expression technique can be used for multiple features, hence it’s a list of features with cutoffs (Exact for specified cutoffs or MadMedian for how many MADs from the median) so you can be high in a gene and low in another gene, etc. for all possible combinations.

too-many-cells make-tree \
    --prior out \
    --matrix-path input \
    --labels-file labels.csv \
    --filter-thresholds "(250, 1)" \
    --smart-cutoff 4 \
    --min-size 1 \
    --feature-column 2 \
    --draw-leaf "DrawItem (DrawThresholdContinuous [(\"Cd4\", Exact 0), (\"Cd8a\", Exact 0)])" \
    --draw-colors "[\"#e41a1c\", \"#377eb8\", \"#4daf4a\", \"#eaeaea\"]" \
    --draw-scale-saturation 10 \
    --output out_gene_expression \
    > clusters_pruned.csv

Now we can see the expression of both Cd4 and Cd8a at the same time!

Diversity

We can also see an overview of the diversity of cell labels within each subtree and leaves.

too-many-cells make-tree \
    --prior out \
    --matrix-path input \
    --filter-thresholds "(250, 1)" \
    --labels-file labels.csv \
    --smart-cutoff 4 \
    --min-size 1 \
    --draw-leaf "DrawItem DrawDiversity" \
    --output out_diversity \
    > clusters_pruned.csv

Here, the deeper the red, the more diverse (a larger “effective number of cell states”) the cell labels in that group are. Note that the inner nodes are colored relative to themselves, while the leaves are colored relative to all leaves, so there are two different scales.

`interactive`

The interactive entry point has a basic GUI interface for quick plotting with a few features. We recommend limited use of this feature, however, as it can be quite slow at this stage, has fewer customizations, and requires specific dependencies.

too-many-cells interactive \
    --prior out \
    --labels-file labels.csv

`differential`

A main use of single cell clustering is to find differential genes between multiple groups of cells. The differential aids in this endeavor by allowing comparisons with edgeR. Let’s find the differential genes between the liver group and all other cells. Consider our pruned tree from earlier:

We can see the id of each group with --draw-node-number.

We need to define two groups to compare. Well, it looks like node 98 defines the liver cluster. Then, since we don’t want 98 to be in the other group, we say that all other cells are within nodes 89 and 1. As a result, we end up with a tuple containing two lists: ([89, 1], [98]). Then our differential genes for (liver / others) can be found with differential (sent to stdout):

too-many-cells differential \
    --matrix-path input \
    --prior out_pruned \
    --filter-thresholds "(250, 1)" \
    -n "([89, 1], [98])" \
    > differential.csv

If we wanted to make the same comparison, but compare the liver subtree with liver cells from all other subtrees, we can use the --labels argument:

too-many-cells differential \
    --matrix-path input \
    --prior out_pruned \
    --labels-file labels.csv \
    --filter-thresholds "(250, 1)" \
    -n "([89, 1], [98])" \
    --labels "([\"Liver\"], [\"Liver\"])" \
    > differential_liver.csv

We can also look at the distribution of abundance for individual genes using the --features and --plot-output arguments.

Furthermore, we can compare each node to all other cells by specifying no nodes at all. The output file will contain the top --top-n genes for each node. We recommend using multiple OS threads here to speed up the process using +RTS -N${NUMOSTHREADS} (no number to use all cores). The following example will compare all nodes to all other cells using 8 OS threads:

too-many-cells differential \
    --matrix-path input \
    --prior out_pruned \
    --filter-thresholds "(250, 1)" \
    -n "([], [])" \
    --normalization "UQNorm" \
    +RTS -N8

`diversity`

Diversity is the measure of the “effective number of entities within a system”, originating from ecology (See Jost: Entropy and Diversity). Here, each cell is an organism and each cell label or cluster is a species, depending on the question. In ecology, the diversity index measures the effective number of species within a population such that the minimum is a diversity of 1 for a single dominant species up to maximum of the total number of species (evenly abundant). If our species is a cluster, then here the diversity is the effective number of cell states within a population (for labels, make-tree generates these results automatically in “diversity” columns). Say we have two populations and we generated the trees using make-tree into two different output folders, out1 and out2. We can find the diversity of each population using the diversity entry point.

too-many-cells diversity\
    --priors out1 \
    --priors out2 \
    -o out_diversity_stats

We can then find a simple plot of diversity in diversity_output. In addition, we also provide rarefaction curves for comparing the number of different cell states at each subsampling useful for comparing the number of cell states where the population sizes differ.

`paths`

“Pseudotime” refers to the one dimensional relationship between cells, useful for looking at the ordering of cell states or labels. The implementation of pseudotime in a too-many-cells point-of-view is by finding the distance between all cells and the cells found in the longest path from the root in the tree. Then each cell has a distance from the “start” and thus we plot those distances.

too-many-cells paths\
    --prior out \
    --labels-file labels.csv \
    --bandwidth 3 \
    -o out_paths

Working with scATAC-seq data using `too-many-peaks`

For more information, check out the too-many-peaks walkthrough.

scATAC-seq is a powerful technology for quantifying chromatin accessibility for individual cells. too-many-cells now supports scATAC-seq to generate cell clade relationships from chromatin state information through too-many-peaks. All of the previous analyses used with gene-product features now work with genomic regions in the form chrN:START-END, where N is the chromosome number, START is the start of the region and END is the end base of the region.

Matrices in this format can be read from either CSV or matrix-market as above but with the correctly formatted features, or you can load in directly from a fragments.tsv.gz file in Cellranger format (tab delimited with each row being chrN\tSTART\tEND\tBARCODE\tCOUNT) making sure that the filename contains the fragments ending, such as t-all_fragments.tsv.gz. For example:

too-many-cells make-tree\
    -m ./t-all_fragments.tsv.gz \
    -Z "T-ALL" \
    -m ./control_fragments.tsv.gz
    -Z "Control" \
    --filter-thresholds "(1000, 1)" \
    --binwidth 5000 \
    --lsa 50 \
    --normalization NoneNorm \
    --blacklist-regions-file Anshul_Hg19UltraHighSignalArtifactRegions.bed.gz \
    --draw-node-number \
    --draw-mark "MarkModularity" \
    --fragments-output \
    --labels-output \
    -o out \
    > out_leaves.csv

Note: We use --lsa and --normalization NoneNorm for latent semantic analysis dimensionality reduction as there are many features in scATAC-seq, so we try to overcome a potential issue where all cells are considered outliers. To blacklist known biased regions in the genome, we can call --blacklist-regions-file. The --fragments-output and --labels-output go hand-in-hand with -Z in order to keep the renamed barcodes and labels (found in the output folder). too-many-cells will binarize the data by default unless --no-binarize is specified. Lastly, we choose a binwidth using --binwidth to conform to a set of standard features across cells and samples.

`peaks`

With scATAC-seq, we want to identify enriched locations in the genome for each newly found subpopulation of cells. The peaks entrypoint can collect the appropriate fragments for quantification and visualization of peaks.

too-many-cells peaks \
    -f ./out/fragments.tsv.gz \
    --prior ./out \
    --genome human.hg19.genome \
    --bedgraph \
    --labels-file ./out/labels.csv \
    --all-nodes \
    --peak-node "1" \
    --peak-node "5" \
    --peak-node-labels "(1, [\"Control\"])" \
    --peak-node-labels "(5, [\"T-ALL\"])" \
    -o out_peaks \
    +RTS -N6

Here, we will have our peaks in the specified output folder, along with many other files and folder:

File	Description
`out_peaks/cluster_fragments`	`fragments.tsv.gz` files for each node.
`out_peaks/cluster_bedgraphs`	Bedgraphs and bigwigs if specified using `--bedgraph` for track visualization uses.
`out_peaks/cluster_peaks/union.bdg`	Merged peaks across all requested nodes in bedgraph format.
`out_peaks/cluster_peaks/union_fragments.tsv.gz`	Merged peaks across all requested nodes in `fragments.tsv.gz` format.
`out_peaks/cluster_peaks/`	Folder containing merged peaks across nodes and peaks for each individual node in each folder.

--bedgraph enabled the cluster_bedgraphs folder, while --all-nodes specified to find peaks for all nodes, not just the leaves. However, when paired with --peak-node, we just look at the peaks for each node in the list (but --all-nodes is still required if looking at non-leaf nodes as well). Without --peak-node, this command would have found peaks for every node. Furthermore, --peak-node-labels allows the filtration based on the label of cells in of the requested node. --genome tells the peak finding program where the genome file is (containing the effective genome sizes of chromosomes in tab-delimited format of chrN\tSIZE used in the MACS2 program). Here, the -f fragments.tsv.gz and labels.csv was from the previous scATAC-seq section, where we automatically generated the correctly renamed barcodes and labels. Lastly, +RTS -N6 tells too-many-cells to use six cores for the calculation. These output files, especially the merged peak files, can be used for differential accessibility analysis as in scRNA-seq. This entrypoint is highly customizable, down to the exact command used for peak calling, so check out too-many-cells peaks -h for more information.

`motifs`

After differential accessibility using peaks, the result can be used to find motifs enriched in each node.

too-many-cells motifs \
    --diff-file ./diff_out.csv \
    --motif-genome hg19.fa \
    --top-n 1000 \
    --motif-command "homer/homer-4.9/bin/findMotifs.pl %s fasta %s" \
    -o motifs

In this example, we use the output from a differential expression analysis using too-many-cells differential from our merged peaks. Using a complete genome file used by our motif program of choice (here HOMER, but defaults to MEME) with --motif-genome, we want to provide the motif program with the top 1000 most differential peaks using --top-n. Lastly, while the default uses MEME, we find HOMER to be much faster. The prior command shows the use of another program to find the motifs, making sure the %s for input and output are in the right locations (check too-many-cells motifs -h).

`classify`

To identify potential cell type candidates from sorted bulk data, too-many-cells classify uses cosine-similarity to provide scores for each bulk population. For example, we have a scATAC-seq experiment in ./mat. We also have known bulk ATAC-seq peak data of B cells in bedgraph files. We can score each cell with:

too-many-cells classify \
    --reference-file ./proB.bdg \
    --reference-file ./preB.bdg \
    --reference-file ./memoryB.bdg \
    --reference-file ./plasmaB.bdg \
    -m ./mat \
    --normalization "NoneNorm" \
    --blacklist-regions-file "mm9-blacklist.bed.gz" \
    > labels.csv

--reference-file is a list of bedgraphs here for each population. You can also specify a single reference file as an input matrix with each barcode being the label for the population, as bulk just has one sample. To use a single matrix, use --single-reference-matrix in addition to --reference-file to specify the file as a single reference matrix. The output will be identical to a normal too-many-cells labels.csv file, but with an additional column score which provides the value of the highest cosine-similarity label.

`spatial`

Spatial single-cell technologies allow us to measure not only the features of cells such as cell surface markers or transcriptomes, but also the spatial location of each individual cell in situ. These technologies, such as imaging mass cytometry and Visium, allow us to use various methods to quantify the spatial relationships between cell features and cell types. too-many-cells can report these relationships with the spatial entrypoint, making use of both AnnoSpat for cell type classification and spatstat for relationship quantification.

As an example, consider an imaging mass cytometry output containing two files, features.csv and spatial.csv. features.csv (can be any matrix format that too-many-cells accepts with -m) here is a matrix of cell rows and feature columns:

item,CD20,CD4,CD8,Foxp3,...
barcode1,0.1095368741640727,0.013183117496457954,0.19233368842522866,0.05579191206063343,...
barcode2,0.08268388046574766,0.003996753797330361,0.007560142177239592,0.0008473833902161547,...
...

spatial.csv is a file containing the locations of each cell, of the format item,sample,x,y, where item is the cell barcode, sample is the sample the barcode came from (for bulk processing to make sure there is segregation by sample), and x and y are the cell coordinates:

item,sample,x,y
barcode1,donor1,-493.99,496.08
barcode2,donor1,-479.629,496.641

Using this information, we can relate the cells by their marker expression:

too-many-cells spatial \
  --matrix-transpose \
  -m total_normalized_features.csv \
  -j total_spatial.csv \
  -o tmc_mark_output \
  --mark "CD4" \
  --mark "CD20"

We use --matrix-transpose to make sure the barcodes for the feature matrix becomes the columns in this case, -o denotes the output folder for the analyses, and --mark denotes each feature we want to relate. If you want to see every pairwise comparison between all marks, instead just use =–mark “ALL”=.

too-many-cells will output results into the tmc_mark_output folder containing a folder for each sample. Within each sample folder, there will be projections and relationships folders, the former containing an interactive visualization of the cells locations on the left with the cumulative distribution functions of each mark on the right. You can click and drag on these distributions to filter the cells on the left plot by their mark.

The relationships folder contains additional folders for pairwise comparisons of marks. Within each of these folders, there are the following files (for more information, check out =spatstat=:

File	Description
`basic_plot.csv`	Plot of each cell in situ.
`crosscorr.rds`	R object containing each cross-correlation function.
`cross_correlation_function.pdf`	The pairwise cross-correlation function for each mark.
`curve.csv`	The cross-correlation function in `csv` format.
`envelope.pdf`	The simulation envelope of the summary function.
`mark_correlation_function.pdf`	The mark correlation function of each mark.
`mark_variogram.pdf`	The mark variogram of each mark.
`stats.csv`	The various measures meant to summarize each cross-correlation function.

The stats.csv file contains multiple measures to summarize the functions:

Column	Description
`Var1`	The first mark for the curve.
`Var2`	The second mark for the curve.
`value`	The index for the location of the curve in the cross-correlation plot.
`meanCorr`	The mean value of the y-axis.
`maxCorr`	The maximum value of the y-axis.
`minCorr`	The minimum value of the y-axis.
`topMaxCorr`	The maximum value of the y-axis in the lower-quartile of $r$.
`topMeanCorr`	The mean value of the y-axis in the lower-quartile of $r$.
`negSwap`	The $r$ at which the y-axis first goes below 1.
`posSwap`	The $r$ at which the y-axis first goes above 1.
`longestPosLength`	The longest stretch of distance the function is above 1.
`longestNegLength`	The longest stretch of distance the function is below 1.
`maxPosWithVal`	maxCorr / maxPos ignoring the first value (which is usually 0).
`logMaxPosWithVal`	log(maxPosWithVal).
`maxPos`	The $r$ which resides at maxCorr.
`minPos`	The $r$ which resides at minCorr.
`label`	The label of the curve.
`n`	The sample size of cells with both marks.

The mark cross-correlation function may be used with discrete values as well, so instead of, for instance, cell surface expression, you could use cell types by passing in a labels file (used by any too-many-cells entrypoint) with -l:

too-many-cells spatial \
  --matrix-transpose \
  -m total_normalized_features.csv \
  -j total_spatial.csv \
  -o tmc_mark_output \
  -l labels_celltypes.csv \
  --mark "Helper T Cell" \
  --mark "B Cell"

You can even use AnnoSpat to predict cell types to use instead of a labels file with --annospat-marker-file (see the AnnoSpat documentation for this format).

`matrix-output`

A simple entrypoint to output the transformed matrix too-many-cells uses before clustering. Saves to --mat-output.

Advanced documentation

Each entry point has its own documentation accessible with -h, such as too-many-cells make-tree -h:

too-many-cells -h

too-many-cells, Gregory W. Schwartz

Usage: too-many-cells (COMMAND | COMMAND | COMMAND)
  Clusters and analyzes single cell data.

Available options:
  -h,--help                Show this help text

Analyses using the single-cell matrix
  make-tree                Generate and plot the too-many-cells tree
  interactive              Interactive tree plotting (legacy, slow)
  differential             Find differential features between groups of nodes
  classify                 Classify single-cells based on reference profiles
  spatial                  Spatially analyze single-cells
  matrix-output            Transform the input matrix only

No single-cell matrix analyses
  diversity                Quantify the diversity and rarefaction curves of the
                           tree
  paths                    Infer pseudo-time information from the tree

too-many-peaks analyses for scATAC-seq
  peaks                    Find peaks in nodes for scATAC-seq
  motifs                   Find motifs from peaks for scATAC-seq

Demo

Check out an instructional example of using too-many-cells here when finished looking at the brief feature overview.

too-many-cells's People

Contributors

Stargazers

Watchers

Forkers

faryabilab faryabib chris-rands pedroelbanquero yueyuxiaoyang ammawla ryanyip-kat abhijeetrpatil yanglq-bio vietdhoang diraczhu1998 chaunceydust schwartzlab-methods jamespeapen

too-many-cells's Issues

trouble with running on windows 10

i am trying to run too-many-cells on windows 10. I am trying to do it with docker and used a lot of tips from previous issue #9, which was really helpful, but i seem to have hit a different error now. Any help would be greatly appreciated. thanks!

PS C:\Users\acpan\too-many-cells> docker run -i --rm -v C:\Users\acpan\too-many-cells\filtered_feature_bc_matrix\:/filtered_feature_bc_matrix/ gregoryschwartz/too-many-cells:0.2.2.0 make-tree --no-filter --normalization NoneNorm --draw-max-node-size 40 --draw-max-leaf-node-size 70 --matrix-path /filtered_feature_bc_matrix/matrix.mtx --output /filtered_feature_bc_matrix/out --labels-file /filtered_feature_bc_matrix/labels.csv
Loading matrix [..........................................................]   0%too-many-cells: CsvParseException "parse error (endOfInput)"
CallStack (from HasCallStack):
  error, called at src/TooManyCells/Matrix/Load.hs:220:27 in too-many-cells-0.2.2.0-AeD17o0PQYl1Xrk9gmUYF3:TooManyCells.Matrix.Load
PS C:\Users\acpan\too-many-cells> docker info
Client:
 Debug Mode: false
 Plugins:
  app: Docker Application (Docker Inc., v0.8.0)
  buildx: Build with BuildKit (Docker Inc., v0.3.1-tp-docker)
  mutagen: Synchronize files with Docker Desktop (Docker Inc., testing)

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 1
 Server Version: 19.03.8
 Storage Driver: overlay2
  Backing Filesystem: <unknown>
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 4.19.76-linuxkit
 Operating System: Docker Desktop
 OSType: linux
 Architecture: x86_64
 CPUs: 3
 Total Memory: 3.848GiB
 Name: docker-desktop
 ID: YF44:LHID:3WTQ:MWJL:RYGV:SOIO:E5PP:TAYH:VJWQ:7B2T:CNM7:2EJC
 Docker Root Dir: /var/lib/docker
 Debug Mode: true
  File Descriptors: 41
  Goroutines: 54
  System Time: 2020-05-28T20:45:40.026975263Z
  EventsListeners: 4
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Product License: Community Engine

Error when running too-many-cells

Hi @GregorySchwartz , I use the image for too-many-cells, but when I run make-tree, I met an error :

Loading matrix [] 0%too-many-cells: ./Data/Vector/Generic.hs:245 ((!)): index out of bounds (0,0)
CallStack (from HasCallStack):
error, called at ./Data/Vector/Internal/Check.hs:87:5 in vector-0.12.0.1-GGZqQZyzchy8YFPCF67wxL:Data.Vector.Internal.Check

When I use docker，whether additional stack installation is required ?

kent-394 not supported

Hi there,

I was trying to install too-many-cells following your instruction using nix. However, I encountered this error, see attached image. I'd greatly appreciate it if could let me know what might be causing this and how I might be able to fix this!

Best regards,
Michelle

macOS Nix missing glibcLocales

Resulting in incomplete build, see #27.

stack size error limit

Dear Gregory,

I'm trying to launch too-many-cells on my datasets when I get the following error:

Language.R.Interpreter: Cannot increase stack size limit.Try increasing your stack size limit manually:$ ulimit -s unlimited
too-many-cells: setResourceLimit: permission denied (Operation not permitted)

I do not have the right to change myself the stack limit size (I'm working on a remote server and the system admin does not want to change it too since it requires a reboot of all the system).

It would be helpfull for me if I can launch the tool without having to change this stack size limit.

Best regards,
Mamy Andrianteranagna

comparative analysis of clustering performance #37

Thanks for the reply. (#37) I download some PBMCs data as my benchmark data set, and have cell types as my true labels. just as you say, ARI and Silhouette unsuitable here. but because of the diversity, it seems too many clusters to evaluate, I dont know how to use the measures you recommend, and how many clusters(k) of my result. I learned the interlinkage you provided, but I've only learned R and Python, I can't understand the purity.hs, do you provide R or Python code or another way for my reference?

error, called at src/Data/Sparse/SpMatrix.hs:793:22 in sparse-linear-algebra-0.3.2-LJ5QWDyWqk8g1L9MLmVBX:Data.Sparse.SpMatrix

Hi, I am running a 10X data in Docker in a Linux Webserver. The dir has three cellrange output file like matrix, barcodes and features. I created labels.csv like
vip06@tpm5-desktop:~$ head ./labels.csv
AAACCCAAGAAACACT-1,day14
AAACCCAAGAAACCAT-1,day14
AAACCCAAGAAACCCA-1,day14
AAACCCAAGAAACCCG-1,day14
AAACCCAAGAAACCTG-1,day14
AAACCCAAGAAACGAA-1,day14
AAACCCAAGAAACGTC-1,day14
AAACCCAAGAAACTAC-1,day14
AAACCCAAGAAACTCA-1,day14

It showed error like below:

vip06@tpm5-desktop:~$ docker run -it --rm -v "/home/data/vip06:/home/data/vip06" gregoryschwartz/too-many-cells:2.0.0.0 make-tree --matrix-path /home/data/vip06/scdata/allraw/day14 --labels-file /home/data/vip06/labels.csv --draw-collection "PieRing" --output /home/data/vip06/result/
too-many-cells: matMat : incompatible matrix sizes((31053,6794880),(1671225,1)).............................................] 18%
CallStack (from HasCallStack):
error, called at src/Data/Sparse/SpMatrix.hs:793:22 in sparse-linear-algebra-0.3.2-LJ5QWDyWqk8g1L9MLmVBX:Data.Sparse.SpMatrix

Thanks a lot!

comparative analysis of clustering performance

Hi Dr. Schwartz,
TooManyCells seem to be a powerful and fast algorithm for cell clustering. For comparative analysis of clustering performance and scalability, I used ARI and Silhouette to evaluate the accuracy of too-many-cells,but the results were not good while the clumpiness show well. do you have a way or R code so that I can compare the two clustering method and get a good result? (Just like what you did in Figure 3 of your Nature Methods paper.)
Great appreciation to your time and looking forward to your feedback and insights.

can't install the included projects

Dear authors,
When I use stack to install the included projects, like birch-beer,
it keeps giving me those errors:
-- While building package spectral-clustering-0.3.2.2 (scroll up to its section to see the error) using:
/home/xzy/.stack/setup-exe-cache/x86_64-linux/Cabal-simple_mPHDZzAJ_3.0.1.0_ghc-8.8.3 --builddir=.stack-work/dist/x86_64-linux/Cabal-3.0.1.0 build --ghc-options " -fdiagnostics-color=always"
Process exited with code: ExitFailure 1
I tried many ways to fix it, but all failed.
I will appreciate it if you tell me how to solve it.
Thank you for your time and energy.

TooManyCellsR

Hello,
I am trying to run the example code for tooManyCells function from the TooManyCellsR package in R.
I am using windows 10.
I have followed the docker workflow. I can confirm the container works:

PS C:\Windows\system32> docker run -it --rm -v "/home/username:/home/username" gregoryschwartz/too-many-cells:0.1.5.0 -h
too-many-cells, Gregory W. Schwartz. Clusters and analyzes single cell data.

Usage: too-many-cells (make-tree | interactive | differential | diversity |
                      paths)

Available options:
  -h,--help                Show this help text

Available commands:
  make-tree
  interactive
  differential
  diversity
  paths

In R I follow the example code from the TooManyCells function and specify the docker argument

library(TooManyCellsR)
input <- system.file("extdata", "mat.csv", package="TooManyCellsR")
inputLabels <- system.file("extdata", "labels.csv", package="TooManyCellsR")
df = read.csv(input, row.names = 1, header = TRUE)
mat = Matrix::Matrix(as.matrix(df), sparse = TRUE)
labelsDf = read.csv(inputLabels, header = TRUE)
res = tooManyCells(docker = "gregoryschwartz/too-many-cells:0.1.5.0",mat, labels = labelsDf
                    , args = c( "make-tree"
                                , "--no-filter"
                                , "--normalization", "NoneNorm"
                                , "--draw-max-node-size", "40"
                                , "--draw-max-leaf-node-size", "70"
                    )
)

I can create all the object but res, with the following error:

Error in wrap.url(file, load.image.internal) : File not found
In addition: Warning messages:
1: In dir.create(output, recursive = TRUE) : 'out' already exists
2: In system2("docker", args = c(dockerArgs, args, autoArgs), stdout = TRUE) :
  running command '"docker" run -i --rm -v C:\Users\pedri\AppData\Local\Temp\Rtmp8sTiO7:C:\Users\pedri\AppData\Local\Temp\Rtmp8sTiO7 -v C:\Users\pedri\Documents\out:C:\Users\pedri\Documents\out gregoryschwartz/too-many-cells:0.1.5.0 make-tree --no-filter --normalization NoneNorm --draw-max-node-size 40 --draw-max-leaf-node-size 70 --matrix-path C:\Users\pedri\AppData\Local\Temp\Rtmp8sTiO7 --output C:\Users\pedri\Documents\out --labels-file C:\Users\pedri\AppData\Local\Temp\Rtmp8sTiO7/labels.csv' had status 125

any idea how to fix the issue?

here my session:

> sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] TooManyCellsR_0.1.1.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3       lattice_0.20-38  png_0.1-7        tiff_0.1-5       grid_3.6.2       imager_0.42.1    magrittr_1.5     stringi_1.4.3   
 [9] rlang_0.4.2      Matrix_1.2-18    bmp_0.3          tools_3.6.2      stringr_1.4.0    purrr_0.3.3      igraph_1.2.4.2   jpeg_0.1-8.1    
[17] compiler_3.6.2   pkgconfig_2.0.3  readbitmap_0.1.5

`--dense` does not work

Thanks for adding the --dense option.
However, at the moment there still seems to be a problem.
On a dense matrix with ~70k cells and 50 dimensions, it runs for more than 30min with 32 cores and takes a vast amount of memory (~120GB).

On the other hand, without the dense option it completes within a couple of minutes.

Given that it is already pretty fast without dense, I'm not even sure if it's worth supporting this option.

Cheers,
Gregor

The pipeline crashed with error message

Error with install on CentOS7: inline-r-0.9.2 won't build

Hi,

The install won't complete because "pkg-config package 'libR' version >= 3.0 is required but it could not be found", an error which shows up a lot on the web, but I have no idea how to implement their fixes, I am not a Haskell person...

I assume it is because I did not install R as part of the prerequisites, since we manage a sitewide R installation which is NOT at /usr/bin/R. I have added "/n/apps/CentOS7/install/r-3.6.1/lib64/" to LD_LIBRARY_PATH, but to no effect.

Full error:

inline-r > configure
inline-r > Configuring inline-r-0.9.2...
inline-r > Cabal-simple_mPHDZzAJ_2.2.0.1_ghc-8.4.3: The pkg-config package 'libR' version
inline-r > >=3.0 is required but it could not be found.
inline-r >
-- While building package inline-r-0.9.2 using:
/home/apa/.stack/setup-exe-cache/x86_64-linux/Cabal-simple_mPHDZzAJ_2.2.0.1_ghc-8.4.3
--builddir=.stack-work/dist/x86_64-linux/Cabal-2.2.0.1
configure
--user
--package-db=clear
--package-db=global
--package-db=/home/apa/.stack/snapshots/x86_64-linux/5c704446e9d18780b137cf7126abfe344049f71b810d2ee121cad6c8264ecbb3/8.4.3/pkgdb
--libdir=/home/apa/.stack/snapshots/x86_64-linux/5c704446e9d18780b137cf7126abfe344049f71b810d2ee121cad6c8264ecbb3/8.4.3/lib
--bindir=/home/apa/.stack/snapshots/x86_64-linux/5c704446e9d18780b137cf7126abfe344049f71b810d2ee121cad6c8264ecbb3/8.4.3/bin
--datadir=/home/apa/.stack/snapshots/x86_64-linux/5c704446e9d18780b137cf7126abfe344049f71b810d2ee121cad6c8264ecbb3/8.4.3/share
--libexecdir=/home/apa/.stack/snapshots/x86_64-linux/5c704446e9d18780b137cf7126abfe344049f71b810d2ee121cad6c8264ecbb3/8.4.3/libexec
--sysconfdir=/home/apa/.stack/snapshots/x86_64-linux/5c704446e9d18780b137cf7126abfe344049f71b810d2ee121cad6c8264ecbb3/8.4.3/etc
--docdir=/home/apa/.stack/snapshots/x86_64-linux/5c704446e9d18780b137cf7126abfe344049f71b810d2ee121cad6c8264ecbb3/8.4.3/doc/inline-r-0.9.2
--htmldir=/home/apa/.stack/snapshots/x86_64-linux/5c704446e9d18780b137cf7126abfe344049f71b810d2ee121cad6c8264ecbb3/8.4.3/doc/inline-r-0.9.2
--haddockdir=/home/apa/.stack/snapshots/x86_64-linux/5c704446e9d18780b137cf7126abfe344049f71b810d2ee121cad6c8264ecbb3/8.4.3/doc/inline-r-0.9.2
--dependency=aeson=aeson-1.4.0.0-EAbp2GiwrvTH27nXdJzV0g
--dependency=base=base-4.11.1.0
--dependency=bytestring=bytestring-0.10.8.2
--dependency=containers=containers-0.5.11.0
--dependency=data-default-class=data-default-class-0.1.2.0-2kYzERBLX3wJoPfj7mwVvW
--dependency=deepseq=deepseq-1.4.3.0
--dependency=exceptions=exceptions-0.10.0-DmsI5QMvE6e6QgVkMINEKb
--dependency=inline-c=inline-c-0.6.1.0-FM9gF7RqOpoLWRlok3Pud0
--dependency=mtl=mtl-2.2.2
--dependency=pretty=pretty-1.1.3.6
--dependency=primitive=primitive-0.6.3.0-DaZpcxwJp2TGn8ITSgfI4C
--dependency=process=process-1.6.3.0
--dependency=reflection=reflection-2.1.4-ET4Qfoy5lmWBopRK3ezJIQ
--dependency=setenv=setenv-0.1.1.3-H1xmIqlPy4yIDquO6eJhBl
--dependency=singletons=singletons-2.4.1-FDzlisNNwplIrNjegYYDdD
--dependency=template-haskell=template-haskell-2.13.0.0
--dependency=text=text-1.2.3.0
--dependency=th-lift=th-lift-0.7.10-88ozaMeoe8eDZSlyIjheFa
--dependency=th-orphans=th-orphans-0.13.6-6mvRAE1wQLBDXpoe3PtgV3
--dependency=transformers=transformers-0.5.5.0
--dependency=unix=unix-2.7.2.2
--dependency=vector=vector-0.12.0.1-GGZqQZyzchy8YFPCF67wxL
--exact-configuration
--ghc-option=-fhide-source-paths

Process exited with code: ExitFailure 1

Prune parameters setting question

Dear @GregorySchwartz ,

I'm using TooManyCells on my dataset. It works great. But I have trouble understanding the prune parameters in the "docker run gregoryschwartz/too-many-cells:2.0.0.0 make-tree -h". Could you explain these prune parameters in more detail?

Also, could you give me some advice on improving the plotting attached below to make "small leaves" more biological meaningful? Because it's hard for a non-model organism to define cell subtype truly, it's arbitrary in the standard Seurat workflows.

plot1 based on "--draw-collection "PieChart" --smart-cutoff 4 --draw-no-scale-nodes --draw-mark "MarkModularity" --draw-max-node-size 10" parameters
plot2 based on "--draw-collection "PieChart" --draw-no-scale-nodes --draw-mark "MarkModularity" --draw-max-node-size 10 --min-size 100" parameters.

Thanks a lot!

Expected runtime

Hello,

Great program so far! We are looking forward to our analysis using AnnoSpat and the spatial module in too-many-cells. I am wondering if it is typical for only one core to be used in the spatial processing? I am running a Docker container of v3.0.1.0 in WSL2 given 50GB of 64GB RAM with an 8 core i7-7700 CPU. The input is ~137,000 cells from CODEX imaging of 23 markers assigned using AnnoSpat:

docker run --memory=55g -v "$HOME:$HOME"
gregoryschwartz/too-many-cells:3.0.1.0 spatial
--matrix-transpose
-z QuantileNorm
-z TfIdfNorm
-m /home/smith6jt/AnnoSpat/measurements.csv
-j /home/smith6jt/AnnoSpat/spatial.csv
-o /home/smith6jt/outputdir/full_marker/tmc
-l /home/smith6jt/outputdir/full_marker/trte_labels_ELM_spleen.csv
--mark "ALL"

There are 31 expected cell types and for now each relationship file is taking almost 30min so all combinations will take quite a long time for one sample at this pace. Perhaps there is something I can do to improve?

Thanks!

Problem running on R

Good afternoon,

I am trying to run toomanycells on R but I get the following error:

Error in read.table(file = file, header = header, sep = sep, quote = quote, :
no lines available in input

I get this error while trying to run it with the example provided in the reference manual in R package so it should not be a file format problem.

This is what I tried to run:

input <- system.file("extdata", "mat.csv", package="TooManyCellsR")
inputLabels <- system.file("extdata", "labels.csv", package="TooManyCellsR")
df = read.csv(input, row.names = 1, header = TRUE)
mat = Matrix::Matrix(as.matrix(df), sparse = TRUE)
labelsDf = read.csv(inputLabels, header = TRUE)

tooManyCells(mat, args = c("make-tree"), labels = NULL,
output = "out", prior = NULL, docker = NULL, mounts = c())

Error in installing with docker

Hi,

I am getting trouble with installing with docker, when I run docker pull gregoryschwartz/too-many-cells:2.0.0.0. I always get error like this:

Error response from daemon: Get https://registry-1.docker.io/v2/gregoryschwartz/too-many-cells/manifests/2.0.0.0: Get https://auth.docker.io/token?scope=repository%3Agregoryschwartz%2Ftoo-many-cells%3Apull&service=registry.docker.io: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Thanks so much!

Font missing in docker container

Sorry for bothering you again...
In the "Painting tree" step the docker container fails because of a font missing:

> sudo docker run -it -v /storage/scratch/toomanycells:/scratch gregoryschwartz/too-many-cells:0.1.1.0 make-tree -m /scratch/harmony_toomanycells.csv --no-filter --normalization NoneNorm  --dense
Painting sketches [===================================>..............]  70%
Clumpiness requires labels for cells, skipping...
Painting tree [===========================================>..........]  80%
too-many-cells: /opt/build/.stack-work/install/x86_64-linux/lts-12.0/8.4.3/share/x86_64-linux-ghc-8.4.3/SVGFonts-1.6.0.3/fonts/LinLibertine.svg: openFile: does not exist (No such file or directory)

Docker run too-many-cells error

Hi,

I have this error immediately after run too-many-cells

too-many-cells: readCreateProcess: R "-e" "cat(R.home())" "--quiet" "--slave" (exit -9): failed

Thanks,

Best,

Jia Li

How to modify the colors of the PieChart

Hello,
Thanks for your packages.After running make-tree,I get the plot.It's beautiful,but I want to know,if I can change the PieChart's colors use my selected colors.I find there is a Parameter which is --draw-palette,but I have to choose between Set1 | HSV | Ryb | Blues.Can I use a custom colors?

index out of bounds error

Hi,

I'm experiencing an index-out-of-bounds error when running too-many-cells on a csv file using the docker container:

sudo docker run -it -v /storage/scratch/toomanycells:/scratch gregoryschwartz/too-many-cells:0.1.0.0 make-tree -m /scratch/test.csv 

Sketching tree [=================================>..................................................]  40%too-many-cells: ./Data/Vector/Generic.hs:245 ((!)): index out of bounds (0,0)                             
CallStack (from HasCallStack):
  error, called at ./Data/Vector/Internal/Check.hs:87:5 in vector-0.12.0.1-GGZqQZyzchy8YFPCF67wxL:Data.Vector.Internal.Check

Do you have an idea what might go wrong?

Here's the matrix: test.csv.gz. The data are the first 50 PC's on 2000 immune cells.

Cannot read draw-leaf error

I am using docker container of too-many-cells v0.1.5.0 for my data. The make-tree is working fine and I am happy with the results, but the gene-expression draw-tree is giving "draw-leaf error" for any gene.
For example in the code below I am trying to plot the expression of MBP gene which is expressed in the data and which is in column 1 of the genes.tsv file (only 1 gene-name column). The path of "priors", matrix file as well as labels are all accurate, but I still do not understand the error. I have also run without "--draw-scale-saturation 10" parameter and getting the same error.

Here is the error

Planning leaf colors [=====>..............................................] 10%too-many-cells: Cannot read draw-leaf.
CallStack (from HasCallStack):
error, called at app/Main.hs:329:36 in main:Main

Here is the code that I am running

docker run -v $INPATH:/data gregoryschwartz/too-many-cells:0.1.5.0 make-tree \ --prior /data/snRNAseq_Processed_AD2019/out_cellType \ --matrix-path /data/snRNAseq_Processed_AD2019 \ --labels-file /data/snRNAseq_Processed_AD2019/labels_CellType.csv \ --draw-leaf "DrawItem (DrawContinuous [\"MBP\"])" \ --draw-scale-saturation 10 \ --output /data/snRNAseq_Processed_AD2019/out_gene_expression \ > ${INPATH}/snRNAseq_Processed_AD2019/clusters_cellType_Expr_pruned.csv

parse error (not enough input)

I am trying to run on docker with a windows 10 system. not sure if my system is running out of memory or there is another issue. Thinking it is not only a memory issue since usually that has not thrown an error for me in the past.

C:\Users\acpan
λ docker run -i --rm -v C:\Users\acpan\too-many-cells\too-many-cells\filtered_feature_bc_matrix:/filtered_feature_bc_matrix/ gregoryschwartz/too-many-cells:0.2.2.0 make-tree --no-filter --normalization NoneNorm --draw-max-node-size 40 --draw-max-leaf-node-size 70 --matrix-path /filtered_feature_bc_matrix/ --output /filtered_feature_bc_matrix/out/ --labels-file /filtered_feature_bc_matrix/labels.csv
Recording tree measurements [======================>......................] 50%too-many-cells: parse error (not enough input) at ""
CallStack (from HasCallStack):
error, called at src/BirchBeer/Load.hs:45:33 in birch-beer-0.2.2.0-GoLgVIKGefMG9Opfue2NME:BirchBeer.Load

C:\Users\acpan
λ docker info
Client:
Debug Mode: false

Server:
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 1
Server Version: 19.03.8
Storage Driver: overlay2
Backing Filesystem:
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 4.19.76-linuxkit
Operating System: Docker Desktop
OSType: linux
Architecture: x86_64
CPUs: 9
Total Memory: 47.07GiB
Name: docker-desktop
ID: NX5S:K2SQ:E262:C2MZ:XI77:FYJP:HN54:MNWC:LENB:BKF5:HRBU:HYKF
Docker Root Dir: /var/lib/docker
Debug Mode: true
File Descriptors: 41
Goroutines: 54
System Time: 2020-07-27T17:53:04.461267527Z
EventsListeners: 4
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine

too-many-cells not installed on MacOSX Sierra

I tried to install too-many-cells on MacOS X Sierra without success.

It fails building gtk 0.14 that is required.

I also followed instructions here: https://wiki.haskell.org/Gtk2Hs/Mac i.e. I tried
to use cabal install gtk -f have-quartz-gtk although for OS X Mavericks but again it failed.

It installs automatically gtk 0.15.

Any suggestion?

Thank you in advance

Unable to use on my data with docker(lol)

Hi there, I am using too-many-cells on mac to analysis my own data. Due to unable to install it with stack, I use docker instead.
My original data is 450MB, but no matter how much mem I give to docker, I can see all the mem are used up and then I will get an error without any log info (lol).
But when I use example files to run docker, I can correct results......
Are there any other reasons for such kind of failure?

Prebuilt binaries for Linux

Hi @GregorySchwartz, I am trying to run too-many-cells on a linux cluster where I don't have sudo access and docker is not a viable installation options. Can I use prebuilt linux binaries of too-many-cells? I have not been able to use the docker approach or compile with stack on my OSX laptop either.

Package for Homebrew

Hi, Gregory. I maintain the Homebrew tap Brewsci/bio. Would you be interested in adding Too-many-cells to Brewsci/bio? If you wanted to open a draft PR, I'd be happy to answer there any questions you may run into. Homebrew supports macOS, Linux, and Windows 10 WSL.

Seurat SCtransform

Hello developers of too-many-cells!
Could I use SCtranfrom normalized values as input? Does it contradict the assumptions of your algorithm?

stack install error on CentOS

Hello, I am trying to install too-many-cells on CentOS 6. I have ghc-8.4.4 and ghc-8.0.1 available on my system (I can change between the two easily) and the latest stack from their git repo.
This is what the install looks like with ghc-8.4.4:

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 8.4.4
$ stack --version
Version 2.1.3, Git revision 636e3a759d51127df2b62f90772def126cdf6d1f (7735 commits) x86_64 hpack-0.31.2
$ stack install
No setup information found for ghc-8.4.3 on your platform.
This probably means a GHC bindist has not yet been added for OS key 'linux64-gmp4'.
Supported versions: ghc-7.8.4, ghc-7.10.2, ghc-7.10.3, ghc-8.0.1, ghc-8.0.2, ghc-8.2.1, ghc-8.2.2, ghc-8.4.2

and with ghc-8.0.1 (tried this one because it is listed in the 'Supported versions' line of the last error message):

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 8.0.1
$ stack --version
Version 2.1.3, Git revision 636e3a759d51127df2b62f90772def126cdf6d1f (7735 commits) x86_64 hpack-0.31.2
$ stack install
No setup information found for ghc-8.4.3 on your platform.
This probably means a GHC bindist has not yet been added for OS key 'linux64-gmp4'.
Supported versions: ghc-7.8.4, ghc-7.10.2, ghc-7.10.3, ghc-8.0.1, ghc-8.0.2, ghc-8.2.1, ghc-8.2.2, ghc-8.4.2

I'm not sure why the installation keeps referencing ghc-8.4.3. This version of GHC is not installed on my system and there shouldn't be any references to it anywhere.

Any assistance would be greatly appreciated. I understand that the Docker build is likely the easiest way around this issue, but using Docker is not a viable solution in my environment.

Error in installation on Mac

Hi,

I am trying to install on Mac osx and based on the tutorial completed all the steps. However, I am getting the following error.

stack install --flag gtk:have-quartz-gtk
Preparing to install GHC to an isolated location.
This will not interfere with any system-level installation.
Already downloaded.
Configuring GHC ...
Received ExitFailure 1 when running
Raw command: /Users/bibaswan/.stack/programs/x86_64-osx/ghc-8.4.3.temp/ghc-8.4.3/configure --prefix=/Users/bibaswan/.stack/programs/x86_64-osx/ghc-8.4.3/
Run from: /Users/bibaswan/.stack/programs/x86_64-osx/ghc-8.4.3.temp/ghc-8.4.3/

Thanks,

Bibaswan

Use PCA embedding as input data?

Hi @GregorySchwartz,

this tool looks pretty awesome, I'm curious to try it out!

From the documentation I understand that too-many-cells expects a (UMI)-count matrix.
Would the algorithm also work with embedded data (e.g. PCA)?

I am asking because I have a large single cell dataset integrated from different sources and I performed some data cleaning/batch effect removal steps that work on embedded data.

Cheers,
Gregor

Running too-many-cells with normalized data

Hi Greg,

I have normalized data from my scRNAseq analysis pipeline that I would like to use with too-many-cells for hierarchical clustering and creating a tree. How can I do that?

Thanks

Error in installation with stack

Hi,

I am getting an error in installation with stack. Please see the error message. I am using MacOS Mojave 10.14.4. Please let me know the solution.

WARNING: Ignoring out of range dependency (allow-newer enabled): monoid-extras-0.5. dual-tree requires: >=0.2 && <0.5
WARNING: Ignoring out of range dependency (allow-newer enabled): Cabal-2.2.0.1. gtk requires: >=1.24 && <2.1
WARNING: Ignoring out of range dependency (allow-newer enabled): Cabal-2.2.0.1. gio requires: >=1.24 && <2.1
WARNING: Ignoring out of range dependency (allow-newer enabled): transformers-compat-0.6.2. mmorph requires: >=0.3 && <0.6
WARNING: Ignoring out of range dependency (allow-newer enabled): streaming-commons-0.2.1.0. pipes-text requires: >=0.1 && <0.2
WARNING: Ignoring out of range dependency (allow-newer enabled): async-2.2.1. typed-spreadsheet requires: >=2.0 && <2.2
WARNING: Ignoring out of range dependency (allow-newer enabled): foldl-1.4.2. typed-spreadsheet requires: >=1.1 && <1.4
WARNING: Ignoring out of range dependency (allow-newer enabled): aeson-1.4.0.0. streaming-utils requires: >0.8 && <1.2
WARNING: Ignoring out of range dependency (allow-newer enabled): json-stream-0.4.2.3. streaming-utils requires: >0.4.0 && <0.4.2
WARNING: Ignoring out of range dependency (allow-newer enabled): resourcet-1.2.1. streaming-utils requires: >1.0 && <1.2
WARNING: Ignoring out of range dependency (allow-newer enabled): streaming-0.2.1.0. streaming-utils requires: >=0.1.4.0 && <0.1.4.8
WARNING: Ignoring out of range dependency (allow-newer enabled): streaming-bytestring-0.1.6. streaming-utils requires: >=0.1.4.0 && <0.1.4.8
WARNING: Ignoring out of range dependency (allow-newer enabled): streaming-commons-0.2.1.0. streaming-utils requires: >0.1.0 && <0.1.18
WARNING: Ignoring out of range dependency (allow-newer enabled): transformers-0.5.5.0. streaming-utils requires: >=0.4 && <0.5.3
hierarchical-spectral-clustering-0.3.0.0: configure
diagrams-1.4: download
hierarchical-spectral-clustering-0.3.0.0: build
inline-r-0.9.2: configure
gtk-0.14.7: configure
Could not find custom-setup dep: Cabal
diagrams-1.4: configure
diagrams-1.4: build
diagrams-1.4: copy/register
Progress 4/9

-- While building package hierarchical-spectral-clustering-0.3.0.0 using:
/Users/biba/.stack/setup-exe-cache/x86_64-osx/Cabal-simple_mPHDZzAJ_2.2.0.1_ghc-8.4.3 --builddir=.stack-work/dist/x86_64-osx/Cabal-2.2.0.1 build --ghc-options " -ddump-hi -ddump-to-file -fdiagnostics-color=always"
Process exited with code: ExitFailure 1
Logs have been written to: /Users/biba/Downloads/softwares/too-many-cells/.stack-work/logs/hierarchical-spectral-clustering-0.3.0.0.log

Configuring hierarchical-spectral-clustering-0.3.0.0...
clang: warning: argument unused during compilation: '-nopie' [-Wunused-command-line-argument]
Preprocessing library for hierarchical-spectral-clustering-0.3.0.0..
Building library for hierarchical-spectral-clustering-0.3.0.0..
[ 1 of 10] Compiling Math.Clustering.Hierarchical.Spectral.Types ( src/Math/Clustering/Hierarchical/Spectral/Types.hs, .stack-work/dist/x86_64-osx/Cabal-2.2.0.1/build/Math/Clustering/Hierarchical/Spectral/Types.o )
[ 2 of 10] Compiling Math.Clustering.Hierarchical.Spectral.Load ( src/Math/Clustering/Hierarchical/Spectral/Load.hs, .stack-work/dist/x86_64-osx/Cabal-2.2.0.1/build/Math/Clustering/Hierarchical/Spectral/Load.o )
[ 3 of 10] Compiling Math.Clustering.Hierarchical.Spectral.Utility ( src/Math/Clustering/Hierarchical/Spectral/Utility.hs, .stack-work/dist/x86_64-osx/Cabal-2.2.0.1/build/Math/Clustering/Hierarchical/Spectral/Utility.o )
[ 4 of 10] Compiling Math.Clustering.Hierarchical.Spectral.Sparse ( src/Math/Clustering/Hierarchical/Spectral/Sparse.hs, .stack-work/dist/x86_64-osx/Cabal-2.2.0.1/build/Math/Clustering/Hierarchical/Spectral/Sparse.o )
[ 5 of 10] Compiling Math.Clustering.Hierarchical.Spectral.Eigen.FeatureMatrix ( src/Math/Clustering/Hierarchical/Spectral/Eigen/FeatureMatrix.hs, .stack-work/dist/x86_64-osx/Cabal-2.2.0.1/build/Math/Clustering/Hierarchical/Spectral/Eigen/FeatureMatrix.o )
[ 6 of 10] Compiling Math.Clustering.Hierarchical.Spectral.Eigen.AdjacencyMatrix ( src/Math/Clustering/Hierarchical/Spectral/Eigen/AdjacencyMatrix.hs, .stack-work/dist/x86_64-osx/Cabal-2.2.0.1/build/Math/Clustering/Hierarchical/Spectral/Eigen/AdjacencyMatrix.o )
[ 7 of 10] Compiling Math.Clustering.Hierarchical.Spectral.Dense ( src/Math/Clustering/Hierarchical/Spectral/Dense.hs, .stack-work/dist/x86_64-osx/Cabal-2.2.0.1/build/Math/Clustering/Hierarchical/Spectral/Dense.o )
[ 8 of 10] Compiling Math.Clustering.Hierarchical.Spectral.Test ( src/Math/Clustering/Hierarchical/Spectral/Test.hs, .stack-work/dist/x86_64-osx/Cabal-2.2.0.1/build/Math/Clustering/Hierarchical/Spectral/Test.o )
[ 9 of 10] Compiling Math.Graph.Types ( src/Math/Graph/Types.hs, .stack-work/dist/x86_64-osx/Cabal-2.2.0.1/build/Math/Graph/Types.o )
[10 of 10] Compiling Math.Graph.Components ( src/Math/Graph/Components.hs, .stack-work/dist/x86_64-osx/Cabal-2.2.0.1/build/Math/Graph/Components.o )
ignoring (possibly broken) abi-depends field for packages
Preprocessing executable 'cluster-tree' for hierarchical-spectral-clustering-0.3.0.0..
Building executable 'cluster-tree' for hierarchical-spectral-clustering-0.3.0.0..
[1 of 1] Compiling Main             ( app/Main.hs, .stack-work/dist/x86_64-osx/Cabal-2.2.0.1/build/cluster-tree/cluster-tree-tmp/Main.o )
Linking .stack-work/dist/x86_64-osx/Cabal-2.2.0.1/build/cluster-tree/cluster-tree ...
clang: warning: argument unused during compilation: '-nopie' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-nopie' [-Wunused-command-line-argument]
Undefined symbols for architecture x86_64:
  "_iconv", referenced from:
      _hs_iconv in libHSbase-4.11.1.0.a(iconv.o)
     (maybe you meant: _base_GHCziIOziEncodingziIconv_iconvEncoding12_closure, _base_GHCziIOziEncodingziIconv_iconvEncoding1_info , _base_GHCziIOziEncodingziIconv_iconvEncoding3_closure , _base_GHCziIOziEncodingziIconv_iconvEncoding4_closure , _base_GHCziIOziEncodingziIconv_iconvEncoding10_closure , _base_GHCziIOziEncodingziIconv_iconvEncoding18_closure , _base_GHCziIOziEncodingziIconv_iconvEncoding10_info , _hs_iconv , _base_GHCziIOziEncodingziIconv_iconvEncoding5_closure , _base_GHCziIOziEncodingziIconv_iconvEncoding17_bytes , _base_GHCziIOziEncodingziIconv_iconvEncoding14_closure , _base_GHCziIOziEncodingziIconv_iconvEncoding12_info , _base_GHCziIOziEncodingziIconv_iconvEncoding15_info , _base_GHCziIOziEncodingziIconv_iconvEncoding18_info , _base_GHCziIOziEncodingziIconv_iconvEncoding13_bytes , _base_GHCziIOziEncodingziIconv_iconvEncoding16_info , _hs_iconv_open , _base_GHCziIOziEncodingziIconv_iconvEncoding7_info , _base_GHCziIOziEncodingziIconv_iconvEncoding6_info , _base_GHCziIOziEncodingziIconv_iconvEncoding_closure , _hs_iconv_close , _base_GHCziIOziEncodingziIconv_iconvEncoding2_info , _base_GHCziIOziEncodingziIconv_iconvEncoding2_closure , _base_GHCziIOziEncodingziIconv_iconvEncoding8_bytes , _base_GHCziIOziEncodingziIconv_iconvEncoding9_closure , _base_GHCziIOziEncodingziIconv_iconvEncoding16_closure , _base_GHCziIOziEncodingziIconv_iconvEncoding4_info , _base_GHCziIOziEncodingziIconv_iconvEncoding15_closure , _base_GHCziIOziEncodingziIconv_iconvEncoding11_closure , _base_GHCziIOziEncodingziIconv_iconvEncoding6_closure , _base_GHCziIOziEncodingziIconv_iconvEncoding7_closure , _base_GHCziIOziEncodingziIconv_iconvEncoding1_closure , _base_GHCziIOziEncodingziIconv_iconvEncoding11_info , _base_GHCziIOziEncodingziIconv_iconvEncoding_info , _base_GHCziIOziEncodingziIconv_iconvEncoding14_info )
  "_iconv_open", referenced from:
      _hs_iconv_open in libHSbase-4.11.1.0.a(iconv.o)
     (maybe you meant: _hs_iconv_open)
  "_iconv_close", referenced from:
      _hs_iconv_close in libHSbase-4.11.1.0.a(iconv.o)
     (maybe you meant: _hs_iconv_close)
  "_locale_charset", referenced from:
      _localeEncoding in libHSbase-4.11.1.0.a(PrelIOUtils.o)
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
`gcc' failed in phase `Linker'. (Exit code: 1)

-- While building package gtk-0.14.7 using:
/private/var/folders/lx/y213srbn7dq_941s3qthqm7r0000gn/T/stack-2c47c2d5ae83912e/gtk-0.14.7/.stack-work/dist/x86_64-osx/Cabal-2.2.0.1/setup/setup --builddir=.stack-work/dist/x86_64-osx/Cabal-2.2.0.1 configure --with-ghc=/Users/biba/.stack/programs/x86_64-osx/ghc-8.4.3/bin/ghc --with-ghc-pkg=/Users/biba/.stack/programs/x86_64-osx/ghc-8.4.3/bin/ghc-pkg --user --package-db=clear --package-db=global --package-db=/Users/biba/.stack/snapshots/x86_64-osx/lts-12.0/8.4.3/pkgdb --package-db=/Users/biba/Downloads/softwares/too-many-cells/.stack-work/install/x86_64-osx/lts-12.0/8.4.3/pkgdb --libdir=/Users/biba/Downloads/softwares/too-many-cells/.stack-work/install/x86_64-osx/lts-12.0/8.4.3/lib --bindir=/Users/biba/Downloads/softwares/too-many-cells/.stack-work/install/x86_64-osx/lts-12.0/8.4.3/bin --datadir=/Users/biba/Downloads/softwares/too-many-cells/.stack-work/install/x86_64-osx/lts-12.0/8.4.3/share --libexecdir=/Users/biba/Downloads/softwares/too-many-cells/.stack-work/install/x86_64-osx/lts-12.0/8.4.3/libexec --sysconfdir=/Users/biba/Downloads/softwares/too-many-cells/.stack-work/install/x86_64-osx/lts-12.0/8.4.3/etc --docdir=/Users/biba/Downloads/softwares/too-many-cells/.stack-work/install/x86_64-osx/lts-12.0/8.4.3/doc/gtk-0.14.7 --htmldir=/Users/biba/Downloads/softwares/too-many-cells/.stack-work/install/x86_64-osx/lts-12.0/8.4.3/doc/gtk-0.14.7 --haddockdir=/Users/biba/Downloads/softwares/too-many-cells/.stack-work/install/x86_64-osx/lts-12.0/8.4.3/doc/gtk-0.14.7 --dependency=Cabal=Cabal-2.2.0.1 --dependency=array=array-0.5.2.0 --dependency=base=base-4.11.1.0 --dependency=bytestring=bytestring-0.10.8.2 --dependency=cairo=cairo-0.13.5.0-FlHMbXIytdkJ24AuhwTIIm --dependency=containers=containers-0.5.11.0 --dependency=gio=gio-0.13.4.1-uwR5pDtPrk9QgHfqqjx3G --dependency=glib=glib-0.13.6.0-8v3ZoQEqpf3Ib0SRPzybZI --dependency=gtk2hs-buildtools=gtk2hs-buildtools-0.13.4.0-BeTuALJn73yF2BRI2iGnJc --dependency=mtl=mtl-2.2.2 --dependency=pango=pango-0.13.5.0-9Ppj0iNVrww8InFTamPfKW --dependency=text=text-1.2.3.0 -fdeprecated -ffmode-binary -fhave-gio -f-have-quartz-gtk --exact-configuration
Process exited with code: ExitFailure 1
Logs have been written to: /Users/biba/Downloads/softwares/too-many-cells/.stack-work/logs/gtk-0.14.7.log

[1 of 2] Compiling Main             ( /private/var/folders/lx/y213srbn7dq_941s3qthqm7r0000gn/T/stack-2c47c2d5ae83912e/gtk-0.14.7/Setup.hs, /private/var/folders/lx/y213srbn7dq_941s3qthqm7r0000gn/T/stack-2c47c2d5ae83912e/gtk-0.14.7/.stack-work/dist/x86_64-osx/Cabal-2.2.0.1/setup/Main.o )
[2 of 2] Compiling StackSetupShim   ( /Users/biba/.stack/setup-exe-src/setup-shim-mPHDZzAJ.hs, /private/var/folders/lx/y213srbn7dq_941s3qthqm7r0000gn/T/stack-2c47c2d5ae83912e/gtk-0.14.7/.stack-work/dist/x86_64-osx/Cabal-2.2.0.1/setup/StackSetupShim.o )
Linking /private/var/folders/lx/y213srbn7dq_941s3qthqm7r0000gn/T/stack-2c47c2d5ae83912e/gtk-0.14.7/.stack-work/dist/x86_64-osx/Cabal-2.2.0.1/setup/setup ...
clang: warning: argument unused during compilation: '-nopie' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-nopie' [-Wunused-command-line-argument]
Configuring gtk-0.14.7...
clang: warning: argument unused during compilation: '-nopie' [-Wunused-command-line-argument]
setup: The pkg-config package 'gtk+-2.0' is required but it could not be
found.

-- While building package inline-r-0.9.2 using:
/Users/biba/.stack/setup-exe-cache/x86_64-osx/Cabal-simple_mPHDZzAJ_2.2.0.1_ghc-8.4.3 --builddir=.stack-work/dist/x86_64-osx/Cabal-2.2.0.1 configure --with-ghc=/Users/biba/.stack/programs/x86_64-osx/ghc-8.4.3/bin/ghc --with-ghc-pkg=/Users/biba/.stack/programs/x86_64-osx/ghc-8.4.3/bin/ghc-pkg --user --package-db=clear --package-db=global --package-db=/Users/biba/.stack/snapshots/x86_64-osx/lts-12.0/8.4.3/pkgdb --package-db=/Users/biba/Downloads/softwares/too-many-cells/.stack-work/install/x86_64-osx/lts-12.0/8.4.3/pkgdb --libdir=/Users/biba/Downloads/softwares/too-many-cells/.stack-work/install/x86_64-osx/lts-12.0/8.4.3/lib --bindir=/Users/biba/Downloads/softwares/too-many-cells/.stack-work/install/x86_64-osx/lts-12.0/8.4.3/bin --datadir=/Users/biba/Downloads/softwares/too-many-cells/.stack-work/install/x86_64-osx/lts-12.0/8.4.3/share --libexecdir=/Users/biba/Downloads/softwares/too-many-cells/.stack-work/install/x86_64-osx/lts-12.0/8.4.3/libexec --sysconfdir=/Users/biba/Downloads/softwares/too-many-cells/.stack-work/install/x86_64-osx/lts-12.0/8.4.3/etc --docdir=/Users/biba/Downloads/softwares/too-many-cells/.stack-work/install/x86_64-osx/lts-12.0/8.4.3/doc/inline-r-0.9.2 --htmldir=/Users/biba/Downloads/softwares/too-many-cells/.stack-work/install/x86_64-osx/lts-12.0/8.4.3/doc/inline-r-0.9.2 --haddockdir=/Users/biba/Downloads/softwares/too-many-cells/.stack-work/install/x86_64-osx/lts-12.0/8.4.3/doc/inline-r-0.9.2 --dependency=aeson=aeson-1.4.0.0-EAbp2GiwrvTH27nXdJzV0g --dependency=base=base-4.11.1.0 --dependency=bytestring=bytestring-0.10.8.2 --dependency=containers=containers-0.5.11.0 --dependency=data-default-class=data-default-class-0.1.2.0-2kYzERBLX3wJoPfj7mwVvW --dependency=deepseq=deepseq-1.4.3.0 --dependency=exceptions=exceptions-0.10.0-DmsI5QMvE6e6QgVkMINEKb --dependency=inline-c=inline-c-0.6.1.0-FM9gF7RqOpoLWRlok3Pud0 --dependency=mtl=mtl-2.2.2 --dependency=pretty=pretty-1.1.3.6 --dependency=primitive=primitive-0.6.3.0-DaZpcxwJp2TGn8ITSgfI4C --dependency=process=process-1.6.3.0 --dependency=reflection=reflection-2.1.4-ET4Qfoy5lmWBopRK3ezJIQ --dependency=setenv=setenv-0.1.1.3-H1xmIqlPy4yIDquO6eJhBl --dependency=singletons=singletons-2.4.1-FDzlisNNwplIrNjegYYDdD --dependency=template-haskell=template-haskell-2.13.0.0 --dependency=text=text-1.2.3.0 --dependency=th-lift=th-lift-0.7.10-88ozaMeoe8eDZSlyIjheFa --dependency=th-orphans=th-orphans-0.13.6-6mvRAE1wQLBDXpoe3PtgV3 --dependency=transformers=transformers-0.5.5.0 --dependency=unix=unix-2.7.2.2 --dependency=vector=vector-0.12.0.1-GGZqQZyzchy8YFPCF67wxL --exact-configuration
Process exited with code: ExitFailure 1
Logs have been written to: /Users/biba/Downloads/softwares/too-many-cells/.stack-work/logs/inline-r-0.9.2.log

Configuring inline-r-0.9.2...
clang: warning: argument unused during compilation: '-nopie' [-Wunused-command-line-argument]
Cabal-simple_mPHDZzAJ_2.2.0.1_ghc-8.4.3: The pkg-config package 'libR' version
>=3.0 is required but it could not be found.

Thanks a lot in advance!!

Error when using matrix-path with cellranger .gz files

I'm running this on the cluster, where our cluster admins created a singularity instance of the docker version of too-many-cells. I was able to start running this with an input for a .csv file, but ran into memory issues, so wanted to try on just a small subset with 2 samples. I tried loading the samples as shown in the workshop tutorial:

singularity run /share/data2/applications/singularity_images/too-many-cells.sif make-tree \
>   --matrix-path /share/lab/me/scRNAseq/sample1/outs/filtered_feature_bc_matrix/ \
>   --matrix-path /share/lab/me/scRNAseq/sample2/outs/filtered_feature_bc_matrix/ \
>   --output outTest20200420 \
>   > clustersTest.csv

and got the following error message:

Error in load(name, envir = .GlobalEnv) : 
  bad restore file magic number (file may be corrupted) -- no data loaded
Calls: sys.load.image -> load
In addition: Warning message:
file ‘.RData’ has magic number 'RDX3'
  Use of save versions prior to 2 is deprecated 
Execution halted
too-many-cells: readCreateProcess: R "-e" "cat(R.home())" "--quiet" "--slave" (exit 1): failed

The CellRanger datafiles are .gz files, but the documentation indicates that these can be read by too-many-cells. Please advise on what I should to do remedy this issue. I'm not trying to use the tool in R, but this seems like an R error.

Runing too-many-cell's Docker 'gregoryschwartz/too-many-cells' occurred error on the middle of procedure

Hi @GregorySchwartz
Thanks for developing the great tools for single cell analysis.
The error would be triggered when running the docker of gregoryschwartz/too-many-cells using following command 'docker run -it --rm -v //share/nas1/Data/Users/user/Personalization/20191206_ana/20200319_mainline/result/integrated/vst/subtypes/T_cells/TooManyCells/:/input_matrix gregoryschwartz/too-many-cells:0.2.2.0 make-tree --matrix-path /input_matrix/T_cells.csv -l /input_matrix/T_cells_ann.csv -o ./out'
The error picture showing below:

I am confusing utterly that why the error just broke down in middle step of runing on 70% but the begin of the runing of docker.
That whether It was meaning that the inputted file i provided was right, so what happen to my
running of this situation.
Thanks
Any advice would be appreciated
System:
CentOS Linux release 7.3.1611 (Core)
the input file format display below:
cell annotation :

expression matrix:

nix version code error when running differential analysis between one and many

Hi,
My OS is Pop!_OS 19.10, which is a derivation of Ubuntu 19.10. After generating the tree, I tried to run the DE analysis as follows:

    too-many-cells differential \
    --matrix-path \
        /home/derek/research/Kim-Lab/traf6-tooManyCells/data/wt \
    --prior \
        /home/derek/research/Kim-Lab/traf6-tooManyCells/exp-2/out/ \
    --feature-column 2 \
    --labels-file \
        /home/derek/research/Kim-Lab/traf6-tooManyCells/data/ko-labels.csv \
    --nodes "([2], [10, 17, 22])" \
    > ko_2_vs_10_17_22_differential.csv

I got error as follows:

too-many-cells: R Runtime Error: Error: package or namespace load failed for ‘edgeR’ in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]):
 there is no package called ‘locfit’

But if I change the nodes to --nodes "([],[])", there will be no error.

I tried to fix the error by updating the beginning of default.nix as following and reinstall the package.

# default.nix
{ compilerVersion ? "ghc865", pkgsLink ? (builtins.fetchTarball https://github.com/NixOS/nixpkgs/archive/1e90c46c2d98f9391df79954a74d14f263cad729.tar.gz)}:
let
  # Packages
  config = { allowBroken = true;
             allowUnfree = true;
             packageOverrides = super: let self = super.pkgs; in {
                Renv = super.rWrapper.override {
                  packages = with self.rPackages; [
                    ggplot2
                    devtools
                    cowplot
                    dplyr
                    jsonlite
                    reshape2
                    locfit
                    limma
                    edgeR 
                  ];
                };
              };
           };

But I got an error:

too-many-cells: R Runtime Error: Error in library(edgeR) : there is no package called ‘edgeR’

If I remove 'limma' from the above code of default.nix, I got

too-many-cells: R Runtime Error: Error: package ‘limma’ required by ‘edgeR’ could not be found

Now I have to switch to docker but is it possible to fix this? Thank you very much!

Error when installing with nix on CentOS 7

Hi,

I was trying to install too-many-cells with nix-env, but I encountered the following error.

...
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (gert)
post-installation fixup
shrinking RPATHs of ELF executables and libraries in /nix/store/h9as7005w58sizn57y1ps5hym37dr940-r-gert-1.0.2
shrinking /nix/store/h9as7005w58sizn57y1ps5hym37dr940-r-gert-1.0.2/library/gert/libs/gert.so
strip is /nix/store/8xiqj7nkx1z0mxhda9pcl8fa51zmxqd1-binutils-2.35.1/bin/strip
patching script interpreter paths in /nix/store/h9as7005w58sizn57y1ps5hym37dr940-r-gert-1.0.2
checking for references to /tmp/nix-build-r-gert-1.0.2.drv-0/ in /nix/store/h9as7005w58sizn57y1ps5hym37dr940-r-gert-1.0.2...
building '/nix/store/3n793cglv7dbwzpvw1rhnj4j1rry43si-diagrams-cairo-1.4.1.1.drv'...
setupCompilerEnvironmentPhase
Build with /nix/store/2ip3lwzqswai3zz33407h4b2q3har28p-ghc-8.10.4.
ghc-pkg: Couldn't open database /tmp/nix-build-diagrams-cairo-1.4.1.1.drv-0/setup-package.conf.d for modification: {handle: /tmp/nix-build-diagrams-cairo-1.4.1.1.drv-0/setup-package.conf.d/package.cache.lock}: hLock: invalid argument (Invalid argument)
error: builder for '/nix/store/3n793cglv7dbwzpvw1rhnj4j1rry43si-diagrams-cairo-1.4.1.1.drv' failed with exit code 1
error: 1 dependencies of derivation '/nix/store/9m6m0in321whhiijcj55lvfq97978lq0-too-many-cells-2.2.0.0.drv' failed to build

The nix version

nix-env (Nix) 2.4pre20210503_6d2553a

I am not familiar with nix, and could not find anything related though Google.

Could you please help me with this?

Best wishes,
Yiwei

Purity section in too-many-cells

Dear Gregory:
Thanks for building such a nice tool. I want to compare the accuracy of clustering algorithms, and measure how close between the clusters and the true labels, which called 'Cluster purity' in your article. I am a bit confused about this part, but I can not find this parameter in the help page of too-many-cell.
I tried to run the source code, but I have not learned the programming software used in 'purity' part, and it is difficult to me at present.
I wanna to ask whether there has a parameter of purity of the too-many-cell pipeline which I may ignored?Or whether you have the R code about 'purity' that can provide to me for reference?
Thanks for your time!

error message during installation of too-many-cells on ubuntu

I am trying to install too-many-cells on Ubuntu 19.4. I am facing this error message:
"-- While building package glib-0.13.6.0 using:
/tmp/stack24744/glib-0.13.6.0/.stack-work/dist/x86_64-linux-tinfo6/Cabal-2.2.0.1/setup/setup --builddir=.stack-work/dist/x86_64-linux-tinfo6/Cabal-2.2.0.1 build --ghc-options " -fdiagnostics-color=always"
Process exited with code: ExitFailure 1
Progress 211/248"
Could you please help me to solve the issue?
Thanks in advance,
Zeinab

Running a million cells failed in the docker

Hi Gregory, I installed the package in the docker. I gave 15g RAM to the docker, it is the maximum RAM. It runs well for 500,000 cells, However, it runs failed for a million cells. Could you give me some help? I used the below codes. Thank you.
docker run -it --rm -v "/home/username:/home/username"
-m 15g
gregoryschwartz/too-many-cells:2.0.0.0 make-tree
--matrix-path /home/username/path/to/input
--labels-file /home/username/path/to/labels.csv
--draw-collection "PieRing"
--output /home/username/path/to/out
> clusters.csv

Not able to find the output files

Hi Dr. Schwartz,

Great talk last week in the Stem Cell Club! TooManyCells seem to be a powerful and fast algorithm for cell clustering. I have relatively limited knowledge on informatics and it's my first time using a language other than R. So please excuse my question to be naive.

We have a Windows 7 in our lab and I had to install the Docker Toolbox and run everything in a virtual box using the command line "docker-machine ssh default".
Then I followed the instructions on your website, trying to replicate the figures you have there. Please see below.

#download brain data
mkdir -p data/brain #Make directory
cd ./data/brain #Enter the directory
wget http://cf.10xgenomics.com/samples/cell-exp/3.0.0/neuron_1k_v3/neuron_1k_v3_filtered_feature_bc_matrix.tar.gz #Download the data
tar xvf neuron_1k_v3_filtered_feature_bc_matrix.tar.gz #Uncompress data
cd .. #go to upper folder directory

#download heart data
mkdir -p heart #Make the data directory
cd ./heart #Enter the directory
wget http://cf.10xgenomics.com/samples/cell-exp/3.0.0/heart_1k_v3/heart_1k_v3_filtered_feature_bc_matrix.tar.gz #Download the data
tar xvf heart_1k_v3_filtered_feature_bc_matrix.tar.gz #Uncompress data
cd ..
cd ..

#Prevent overlapping
#Backup barcodes
cp ./data/brain/filtered_feature_bc_matrix/barcodes.tsv{.gz,.gz.bk}
cp ./data/heart/filtered_feature_bc_matrix/barcodes.tsv{.gz,.gz.bk}
#Edit barcodes
cat ./data/heart/filtered_feature_bc_matrix/barcodes.tsv.gz.bk | gzip -d | sed "s/-1/-2/g" | gzip > ./data/heart/filtered_feature_bc_matrix/barcodes.tsv.gz #Now let's edit the heart barcodes to have -2 instead of -1.
cat ./data/heart/filtered_feature_bc_matrix/barcodes.tsv.gz | gzip -d | head

So up until these steps, everything is fine. Then I ran:
sudo docker run -it --rm -v "/home/docker:/home/docker"
gregoryschwartz/too-many-cells:0.2.2.0 make-tree
--matrix-path ./data/brain/filtered_feature_bc_matrix/
--matrix-path ./data/heart/filtered_feature_bc_matrix/
--output out > clusters1.csv
printf "./out/dendrogram.svg"

Here's where I have questions. Still no error messages, but I can't find or visualize "clusters.csv" or "dendrogram.svg" on my Windows computer.
-For "clusters.csv", I can see it's in "/home/docker/" folder, but I couldn't export the file to my local Windows computer.
-And for "dendrogram.svg", I was not able to find the file anywhere at all.

Therefore I have two questions,
(1) Is there a way I can export those files back to the Windows computer?
(2) For the annotated "clusters.csv" and "labels.csv", do you have a way or code to export those annotations back to the Seurat file so that I can compare the two clustering method? (Just like what you did in Figure 6i of your Nature Methods paper.)

Great appreciation to your time and looking forward to your feedback and insights.

Regards,
Joey

cannot open file 'Rplots.pdf'

Hello,

I've converted the Docker image to Singularity and after some experimentation got this error:

» ./run.bash                                               
Counting leaves [=========================================================>......................................]  60%cell,cluster,path
GATCACACACCCTGTT-1,0,0

Painting sketches [=================================================================>............................]  70%too-many-cells: R Runtime Error: Error in (function (file = if (onefile) "Rplots.pdf" else "Rplot%03d.pdf",  :
  cannot open file 'Rplots.pdf'

where "run.bash" consists of:

» cat run.bash                                             
#!/bin/bash -e

singularity run --pwd /opt/too-many-cells \
--bind /too-many-cells/Singularity/input:/input:ro \
--bind /too-many-cells/Singularity/output:/output:rw \
too-many-cells-0.2.2.2.img \
make-tree --matrix-path /input --labels-file /input/labels.csv --draw-collection "PieRing" --output /output

I suspect this has something to do with paths but I am not sure. R version inside the container shows 3.4.4:

Singularity too-many-cells-0.2.2.2.img:/opt/too-many-cells> which R
/usr/bin/R
Singularity too-many-cells-0.2.2.2.img:/opt/too-many-cells> R

R version 3.4.4 (2018-03-15) -- "Someone to Lean On"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(ggplot2)
> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] ggplot2_3.1.0

loaded via a namespace (and not attached):
 [1] colorspace_1.4-0 scales_1.0.0     compiler_3.4.4   plyr_1.8.4
 [5] lazyeval_0.2.1   withr_2.1.2      pillar_1.3.1     gtable_0.2.0
 [9] tibble_2.0.1     crayon_1.3.4     Rcpp_1.0.0       grid_3.4.4
[13] pkgconfig_2.0.2  rlang_0.3.1      munsell_0.5.0
>

And the "ggplot2" version is 3.1.0. If you have any ideas please let me know. Thank you very much!

Installation fails with nix-env ('__noChroot' set, but that's not allowed when 'sandbox' is 'true')

Hello Gregory,

Thank you for this very nice package.

I encounter this error when trying to install TooManyCells from nix:
nix-env -f default.nix -i too-many-cells
...
error: derivation '/nix/store/jizcclxhhmqaxm9cy1wcmmy5nz1h0mwp-homer-4.11.drv' has '__noChroot' set, but that's not allowed when 'sandbox' is 'true'

I managed to pass this error by changing the deps/homer.nix file line 21 by setting noChroot to false: __noChroot = false;

So then the homer installation is fine but the first test of the installation fails. I don't know if it is linked to the modification above.

[1 of 1] Compiling Main             ( tests/test-qq.hs, dist/build/test-qq/test-qq-tmp/Main.o, dist/build/test-qq/test-qq-tmp/Main.dyn_o )
Linking dist/build/test-qq/test-qq ...
Preprocessing test suite 'tests' for inline-r-0.10.4..
Building test suite 'tests' for inline-r-0.10.4..
[1 of 8] Compiling Test.Constraints ( tests/Test/Constraints.hs, dist/build/tests/tests-tmp/Test/Constraints.o, dist/build/tests/tests-tmp/Test/Constraints.dyn_o )
[2 of 8] Compiling Test.Event       ( tests/Test/Event.hs, dist/build/tests/tests-tmp/Test/Event.o, dist/build/tests/tests-tmp/Test/Event.dyn_o )
[3 of 8] Compiling Test.FunPtr      ( tests/Test/FunPtr.hs, dist/build/tests/tests-tmp/Test/FunPtr.o, dist/build/tests/tests-tmp/Test/FunPtr.dyn_o )
[4 of 8] Compiling Test.GC          ( tests/Test/GC.hs, dist/build/tests/tests-tmp/Test/GC.o, dist/build/tests/tests-tmp/Test/GC.dyn_o )
[5 of 8] Compiling Test.Matcher     ( tests/Test/Matcher.hs, dist/build/tests/tests-tmp/Test/Matcher.o, dist/build/tests/tests-tmp/Test/Matcher.dyn_o )
[6 of 8] Compiling Test.Regions     ( tests/Test/Regions.hs, dist/build/tests/tests-tmp/Test/Regions.o, dist/build/tests/tests-tmp/Test/Regions.dyn_o )
[7 of 8] Compiling Test.Vector      ( tests/Test/Vector.hs, dist/build/tests/tests-tmp/Test/Vector.o, dist/build/tests/tests-tmp/Test/Vector.dyn_o )
[8 of 8] Compiling Main             ( tests/tests.hs, dist/build/tests/tests-tmp/Main.o, dist/build/tests/tests-tmp/Main.dyn_o )
Linking dist/build/tests/tests ...
Preprocessing test suite 'test-shootout' for inline-r-0.10.4..
Building test suite 'test-shootout' for inline-r-0.10.4..
[1 of 2] Compiling Test.Scripts     ( tests/Test/Scripts.hs, dist/build/test-shootout/test-shootout-tmp/Test/Scripts.o, dist/build/test-shootout/test-shootout-tmp/Test/Scripts.dyn_o )
[2 of 2] Compiling Main             ( tests/test-shootout.hs, dist/build/test-shootout/test-shootout-tmp/Main.o, dist/build/test-shootout/test-shootout-tmp/Main.dyn_o )
Linking dist/build/test-shootout/test-shootout ...
running tests
Running 3 test suites...
Test suite test-qq: RUNNING...
Test suite test-qq: FAIL
Test suite logged to: dist/test/inline-r-0.10.4-test-qq.log
Test suite tests: RUNNING...
Test suite tests: PASS
Test suite logged to: dist/test/inline-r-0.10.4-tests.log
Test suite test-shootout: RUNNING...
Test suite test-shootout: PASS
Test suite logged to: dist/test/inline-r-0.10.4-test-shootout.log
2 of 3 test suites (2 of 3 test cases) passed.
error: builder for '/nix/store/x2wcyjx5hwfn2b28qmbrb6qfr410yy97-inline-r-0.10.4.drv' failed with exit code 1
error: 1 dependencies of derivation '/nix/store/m4i3lm899cid0m9jqqs0bhd2vsirsag9-too-many-cells-3.0.1.0.drv' failed to build

Thanks in advance,
Pacôme

error: "runInteractiveProcess:exec:does not exist"

Encountered the following error when running "make-tree":

"twopi: runInteractiveProcess: runInteractiveProcess: exec: does not exist (No such file or directory)"

The command I used was:
~/.local/bin/too-many-cells make-tree -m ~/too-many-cells/filtered_feature_bc_matrix --draw-collection "PieRing" -- output out >clusters.csv

Index out of bounds for TooManyPeaks

Hello,

Thank you for making this neat program!

I am attempting to run TooManyPeaks according to your vignette, however, I keep running into the following error message:

too-many-cells: ./Data/Vector/Generic.hs:257 ((!)): index out of bounds (0,0)
CallStack (from HasCallStack):
error, called at ./Data/Vector/Internal/Check.hs:87:5 in vector-0.12.2.0-2zHAlcbUq1I7DJzPuCRDoB:Data.Vector.Internal.Check

Here is my code:
too-many-cells make-tree\ -m test.sorted2.tsv \ --filter-thresholds "(1000, 1)" \ --binwidth 5000 \ --lsa 50 \ --normalization NoneNorm \ --blacklist-regions-file Anshul_Hg19UltraHighSignalArtifactRegions.bed.gz \ --draw-node-number \ --draw-mark "MarkModularity" \ --fragments-output \ --labels-output \ -o out \ > out_leaves.csv

I've attached the top and bottom rows of my tsv file here.

Also, I am wondering if it is possible to provide a labels.csv file, including barcode and cell type, for the TooManyPeaks option.

Thank you. I would really like to get this to work!

A problem for batch effect removed data

Hello, I have tried TooManyCells in a single case(four samples from the same case), and the result looks very nice. So now I want to try it for multiple cases.
However, when I ran make-tree with the output of cellranger for 17 cases, I got a strange result. I guess it might be caused by the batch effect and I want to run make-tree with batch effect corrected data. I would like to know:

The batch effect corrected data is generated by Harmony, which is integrated into Seurat pipeline. The data is in .rds format and there is no expression data in harmony result. So in this case, I would like to know how to prepare the input data for make-tree.
It looks that TooManyCells is memory-consuming. In my first running (four samples from the same case, about 16,000 cells), I did not meet any problem; but in my second running (17 cases, about 190,000 cells), I found that 300GB memory was not enough, make-tree halted without warning nor error message. I do not know if my guessing is correct, so I would like to know your comment.

Thank you very much!

Normalization section in too-many-cells differential -h

Hi,

Thanks for building such a nice tool. But I am a bit confused about the normalization in the differential analysis component. The help page does not help much since it basically identical to the help page of make-tree.

It seems so far the too-many-cell pipeline run two normalization algorithm: one before clustering and the other before DE analysis. By default, the normalization before clustering is TfIdfNorm and the normalization is NoneNorm so that the program will revoke edgeR's normalization algorithm. Is my understanding right?

I guess my question is to double-check the function of --normalization argument in too-many-cells differential. May I assume it is used to choose the normalization algorithm before DE and by default is should be NoneNorm?

Currently, the Normalization section in the help page of too-many-cells differential is identical, which makes me feel like I am choosing an algorithm for clustering.

Thanks for your time!

fail when downloading ucsc browser

i am using nix to install too-many-cells on my ubuntu 20 server, when i cloned the github repository and run nix-env -f default.nix -i too-many-cells, I always get the following error codes, what should I do?

Installing library in /nix/store/qyzrysq5rvfjlcg5x608w9rjaa2zw7xh-modularity-0.2.1.1/lib/ghc-8.10.4/x86_64-linux-ghc-8.10.4/modularity-0.2.1.1-IqGSQGN5D2q2AdgZmiJt0X
post-installation fixup
shrinking RPATHs of ELF executables and libraries in /nix/store/qyzrysq5rvfjlcg5x608w9rjaa2zw7xh-modularity-0.2.1.1
shrinking /nix/store/qyzrysq5rvfjlcg5x608w9rjaa2zw7xh-modularity-0.2.1.1/lib/ghc-8.10.4/x86_64-linux-ghc-8.10.4/libHSmodularity-0.2.1.1-IqGSQGN5D2q2AdgZmiJt0X-ghc8.10.4.so
strip is /nix/store/8xiqj7nkx1z0mxhda9pcl8fa51zmxqd1-binutils-2.35.1/bin/strip
stripping (with command strip and flags -S) in /nix/store/qyzrysq5rvfjlcg5x608w9rjaa2zw7xh-modularity-0.2.1.1/lib
patching script interpreter paths in /nix/store/qyzrysq5rvfjlcg5x608w9rjaa2zw7xh-modularity-0.2.1.1
checking for references to /build/ in /nix/store/qyzrysq5rvfjlcg5x608w9rjaa2zw7xh-modularity-0.2.1.1...
shrinking RPATHs of ELF executables and libraries in /nix/store/2md5vngjf8cz3hmb8xh2kf985dbwfvim-modularity-0.2.1.1-doc
strip is /nix/store/8xiqj7nkx1z0mxhda9pcl8fa51zmxqd1-binutils-2.35.1/bin/strip
patching script interpreter paths in /nix/store/2md5vngjf8cz3hmb8xh2kf985dbwfvim-modularity-0.2.1.1-doc
checking for references to /build/ in /nix/store/2md5vngjf8cz3hmb8xh2kf985dbwfvim-modularity-0.2.1.1-doc...
building '/nix/store/7f2mn29swi2l69ffzrx8s5d9f913jq5h-source.drv'...

trying https://github.com/ucscGenomeBrowser/kent/archive/v404_base.tar.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:02:00 --:--:-- 0
curl: (52) Empty reply from server
error: cannot download source from any mirror
error: builder for '/nix/store/7f2mn29swi2l69ffzrx8s5d9f913jq5h-source.drv' failed with exit code 1;
last 7 log lines:
>
> trying https://github.com/ucscGenomeBrowser/kent/archive/v404_base.tar.gz
> % Total % Received % Xferd Average Speed Time Time Time Current
> Dload Upload Total Spent Left Speed
> 0 0 0 0 0 0 0 0 --:--:-- 0:02:00 --:--:-- 0
> curl: (52) Empty reply from server
> error: cannot download source from any mirror
For full logs, run 'nix log /nix/store/7f2mn29swi2l69ffzrx8s5d9f913jq5h-source.drv'.
error: 1 dependencies of derivation '/nix/store/cv85fnqllsy0ps8a40wbdr5ggh3v9bvr-kent-404.drv' failed to build
error: 1 dependencies of derivation '/nix/store/x2i34nk8ljfaammqxgv5rmjg6by5wf5n-too-many-cells-2.2.1.0.drv' failed to build

Question: Does well separated color in the dendrogram mean that the items are distinct?

Hello,

Thanks for providing this interesting cell clustering method. I am using this method to analyze the similarity between the spliced and unspliced count matrix from the same single-cell RNA-seq dataset. So these two count matrices are actually two types of signals of the same sample. The result returned from TooManyCells is interesting, so I would hope that you could help me understand what the result tells us. Thank you in advance!

I gave TooManyCell the spliced and unspliced raw count matrices, and mark the two using "S" for spliced and "U" for unspliced. The result shows that the items in the two matrices are well separated and has no overlap.

If possible, could you please help me understand the result? I have the following questions:

Do I need to normalize or scale the two matrices before running TooManyCells?
If giving the raw count matrices is the correct thing to do, does this result (S and U items are separated) mean that the two types of signals from the same sample are totally different?
Why is the unspliced side of the tree larger than the spliced side? Does this mean that unspliced can separate the data better?
If the result shows that the two types of signals are totally different, how could I show that both of them are biologically meaningful? For example, do you think that finding rare cell types from the items in each matrix separately will show that the two matrices are different and both biologically meaningful? Are there any other things I can do to show that they are both biologically meaningful?

Thanks so much! I am looking forward to your reply!

Best,
Dongze

Strange warnings occurred during the procedure

Hello, thanks for developing the great tool for single-cell analysis.

I met some strange warnings during running the following script:

docker run --rm -v /data:/data --name toomanycells gregoryschwartz/too-many-cells:2.0.0.0 make-tree \
    --matrix-path /data/share/scRNAseq/results/6N-total/outs/filtered_feature_bc_matrix \
    -Z 6N-total \
    --matrix-path /data/share/scRNAseq/results/6N-Bcell/outs/filtered_feature_bc_matrix \
    -Z 6N-Bcell \
    --labels-file /data/TooManyCells/test2.labels.csv \
    --filter-thresholds "(250, 1)" \
    --draw-collection "PieRing" \
    --output /data/TooManyCells/out2 \
    > /data/TooManyCells/clusters2.csv

--matrix-path is the output directory of cellranger
-Z for the sample name, same with the labels in the labels-file, like this:

The test2.labels.csv contains all barcodes in 6N-total or 6N-Bcell samples.

The warnings are as follows:

[=======================>...................]  55%
Cell missing a label.
Warning: Problem in diversity, skipping cluster_diversity.csv output ...
[===========================>...............]  64%
Cell has no label: Id {unId = "AATCGGTTCTGCTGCT-1-6N-total"}
Warning: Problem in clumpiness, skipping clumpiness output ...
[===========================================] 100%

However, I can find this label AATCGGTTCTGCTGCT-1,6N-total in test2.labels.csv
Some of the results are missing, like clumpiness.csv, clumpiness.pdf, cluster_diversity.csv. dendrogram.svg is totally black and white and no color for distinguishing different samples (or labels?).

Could you tell me how to solve this? Thank you very much!

gregoryschwartz / too-many-cells Goto Github PK

too-many-cells's Introduction

too-many-cells

Description

New features for v3.0.0.0

New features for v2.2.0.0

New features for v2.0.0.0

New features since initial launch

Installation

nix

Stack (unsupported in too-many-cells >= v2.0.0.0, use nix)

Dependencies

Install stack

Install too-many-cells

Source

Online

macOS

Docker

Troubleshooting

Using nix, I’m getting shared object not found errors.

I am getting errors like AesonException "Error in $.packages.cassava.constraints.flags... when running stack commands

I use conda or custom ld library locations and I cannot install too-many-cells or run into weird R errors

I am still having issues with installation

I am on macOS/Windows with docker and too-many-cells silently crashes.

I am getting the error --draw-leaf cannot be read, but I copied the command!

Included projects

Usage

make-tree

Output

Outline with options

Example

Setup

Default run

Pruning tree

Pie charts

Node numbering

Branch width

No scaling

Gene expression

Diversity

interactive

differential

diversity

paths

Working with scATAC-seq data using too-many-peaks

peaks

motifs

classify

spatial

matrix-output

Advanced documentation

Demo

too-many-cells's People

Contributors

Stargazers

Watchers

Forkers

too-many-cells's Issues

Recommend Projects

Recommend Topics

Recommend Org

Stack (unsupported in `too-many-cells >= v2.0.0.0`, use nix)

Install `stack`

Install `too-many-cells`

I am getting errors like `AesonException "Error in $.packages.cassava.constraints.flags...` when running `stack` commands

I use conda or custom ld library locations and I cannot install `too-many-cells` or run into weird R errors

I am on macOS/Windows with docker and `too-many-cells` silently crashes.

I am getting the error `--draw-leaf` cannot be read, but I copied the command!

`make-tree`

`interactive`

`differential`

`diversity`

`paths`

Working with scATAC-seq data using `too-many-peaks`

`peaks`

`motifs`

`classify`

`spatial`

`matrix-output`