vals / umis Goto Github PK

Tools for processing UMI RNA-tag data

License: MIT License

Shell 8.98% Python 90.82% Cython 0.21%

umis's Introduction

umis

umis provides tools for estimating expression in RNA-Seq data which performs sequencing of end tags of transcript, and incorporate molecular tags to correct for amplification bias.

There are four steps in this process.

Formatting reads
Filtering noisy cellular barcodes
Pseudo-mapping to cDNAs
Counting molecular identifiers

1. Formatting reads

We want to strip out all non-biological segments of the sequenced reads for the sake of mapping. While also keeping this information for later use. We consider non-biological information such as Cellular Barcode and Molecular Barcode. To later be able to extract the optional CB and the MB these are put in the read header, with the following format.

@HWI-ST808:130:H0B8YADXX:1:1101:2088:2222:CELL_GGTCCA:UMI_CCCT
AGGAAGATGGAGGAGAGAAGGCGGTGAAAGAGACCTGTAAAAAGCCACCGN
+
@@@DDBD>=AFCF+<CAFHDECII:DGGGHGIGGIIIEHGIIIGIIDHII#

The command umis fastqtransform is for transforming a (pair of) read(s) to this format based on a transform file. The transform file is a json file which has a Python flavored regular expression for each read, made to extract the necessary components of the reads.

2. Filtering noisy cellular barcodes

Not all cellular barcodes identified in the transformation will be real. Some will be low abundance barcodes that do not represent an actual cell. Others will be barcodes that don't come from a set of known barcodes. The umi cb_filter command can be used to filter a transformed FASTQ file, dropping unknown barcodes. The --nedit option can be supplied to correct barcodes --nedit distance away from known barcodes. After barcode filtering, the umis cb_histogram command will generate a file of counts for each cellular barcode. This file can be used to find a count cut-off for barcodes that are high abundance for downstream quantitation.

3. Pseudo-mapping to cDNAs

This is done by pseudo-aligners, either Kallisto or RapMap. The SAM (or BAM) file output from these tools need to be saved.

4. Counting molecular identifiers

The final step is to infer which cDNA was the origin of the tag a UMI was attached to. We use the pseudo-alignments to the cDNAs, and consider a tag assigned to a cDNA as a partial evidence for a (cDNA, UMI) pairing. For actual counting, we only count unique UMIs for (gene, UMI) pairings with sufficient evidence.

To count, use the command umis tagcount. This requires a SAM or BAM file as input.

By default, the read name will be used to cell barcodes and UMI sequences. Optionally, when using the --parse_tags option, the CR and UM bam tags will be used to extract the cell barcode and UMI, respectively.

The recommended workflow is to map reads to cDNA, in which case the target name in the BAM will be a transcript ID. If the BAM has been mapped to a genome (e.g. with STAR) tagcount can use the optional GX BAM tag to get the gene name. In this case, use the option --gene_tags.

kallisto

The quantitation used in umis handles reads that could come from multiple transcripts by assigning a fractional count to each transcript and then filtering for a minimum count at the end. Many single-cell analyses use something similar to this type of counting, but it has drawbacks (see this paper). For more principled UMI quantification, see Kallisto. kallisto needs the files in a certain format: each cellular barcode has its own FASTQ file and a file that lists the UMI for each read. The umis kallisto command can reformat your fastq files to that format.

umis's People

Stargazers

Watchers

umis's Issues

During Filtering noise celluar barcodes

Is it taking a long time originally? So I did with multi core option, but it doesn't work.

Adjusting for Poisson counting statistics

Hi,

I was wondering if you are adjusting for Poisson counting statistics, when the length of the molecular UMI barcode is short. Some of the recent UMI protocols (e.g. Muraro, Cell Systems 2016) only use 4 bases long random UMI barcodes to tag transcripts. This makes only 256 possible UMIs, so that different transcripts will often have the same UMI bar code. The expected number of transcripts can thus be inferred from the number of UMIs with Poisson Counting statistics. For example if we see all 256 possible UMI, we expect at least 1597 transcripts. I have data of this kind at the moment and I am wondering if I can use the bc_bioc pipeline for it. Thanks!

Alexander

PS: If this is not part of the pipeline, I am not sure if it is worth adding it, as future protocols will hopefully have longer barcodes anyways.

Set up travis

Hi @vals,

Do you think you could turn on Travis for this repo, or make me a co-owner so I can do it? It would be nice to run the tests so we can make sure everything is working before cutting releases and stuff.

Hope you are holding up well in the lockdown, hope your new gig is going well!

Smart-Seq2 data

Hi,

I am currently dealing with a version of Smart-Seq2 data which is 150x150bp and has, on read 2, positions 1-8 as the cellular barcode and positions 9-16 as the UMI. For pseudoalignment with kallisto, I'm wondering if you have any thoughts/experience on aligning with just the first or both reads. On one hand, I'd imagine that having both reads would increase mappability, but then on the other hand they would ultimately be collapsed to the same UMI.

Thanks,
Sahin

is there any function to extract fastq for certain cells passing the filter?

Hi, I used umis cb_histogram to calculate reads count for each cell and then chose a count cutoff based on that. For the downstream analysis, such as pseudo-align and count UMI, we only need to deal with reads from those cells. Is there any function for this based on umis cb_histogram results? Thanks.

Transforms don't understand multiple UMIs

Regexes work with CB1 and CB2, but not with UMIs. As there are nowadays libraries with dual UMIs like IDT's, it might be worthwhile to support UMI1 and UMI2 in transforms.

So slow when cb_filter

Hi,

I am using umis to analyze our single-cell data from 10x genomics, the fastqtransform step is OK but it is extremely slow when I run cb_filter option.

I have downloaded all cell barcode list (around 737k) from Cell Ranger package, do you have some good ideas to speed it up (around 14M reads)? I already try multiple cores but it did improve too much.

Thanks in advance!

demultiplexing

Hi Valentine,

What do you think about adding an option to demultiplex the barcodes into separate files, named by the barcode? We could also pass along a file of allowed barcodes to match and filter out non-matching barcodes as we go. I don't want to muck up your repo with functionality you weren't intending though.

kallisto translation

The kallisto output is mad confusing, I wrote some code to get back to something folks are used to looking at, it reformats the big matrix to have the set of transcripts as the row names instead of the index into the equivalence class matrix by going through the FASTA file and renaming them here:

https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/rnaseq/kallisto.py#L75

That seems like it might go well here, what do you think Valentine?

Merging similar UMIs?

I was wondering if you had considered merging UMIs that might be erroneous copies, eg as outlined in this blog post.

The use of UMIs [...] would work perfectly if it were not for base-calling errors, which erroneously create sequencing reads with the same genomic coordinates and UMIs and that are identical for the base at which the error occurred.

If I understand the current code correctly, it considers two barcodes as separte UMIs if they differ even by one base. Would it be useful to merge reads into the same UMI if they are 'nearly' identical, e.g. based on Hamming distance?

Put in genemap format check when mis-formatted

At the moment a hard to understand error is all that shows if e.g. the number of columns in the genemap is wrong.

error in umis cb_filter

I used the umi fastqtransform to formatted read and it's ok.
The formatting reads is

But when I use the umis cb_filter command to filter cell barcode, it's error:
command : umis cb_filter SRR1058003.fastq
error

what happened?

barcode mismatch correction

Hi Valentine,

What do you think about sticking some code in there to optionally find barcodes 1 or 2 edit distances away from the known barcodes and correcting them? I think we can do up to two edit distances pretty efficiently and simply. I'm down for implementing it if that sounds ok to you.

Error running umis tagcount with 1.0.3 (working on 1.0.0)

Hello!

I had a script that was working when using umis version 1.0.0.

However, it now raises an error with version 1.0.3 when counting the tags.

The commend is:
umis tagcount AACAGCT.bam AACAGCT.txt

The error I get (with 1.0.3):
Traceback (most recent call last):
File "/services/tools/anaconda3/4.4.0/bin/umis", line 10, in
sys.exit(umis())
File "/services/tools/anaconda3/4.4.0/lib/python3.6/site-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/services/tools/anaconda3/4.4.0/lib/python3.6/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/services/tools/anaconda3/4.4.0/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/services/tools/anaconda3/4.4.0/lib/python3.6/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/services/tools/anaconda3/4.4.0/lib/python3.6/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/services/tools/anaconda3/4.4.0/lib/python3.6/site-packages/umis/umis.py", line 459, in tagcount
from utils import weigh_evidence
ImportError: cannot import name 'weigh_evidence'

The version on the server is now 1.0.3 and I would like my script to work with the latest version.

Is there any bug you can fix and/or something I can do differently?

Thanks a lot,
Magali

ddseq SureCell

Hi,
Could you please provide a submission example of processing ddseq data with the umis fastqtransform command? I am not sure which flag to use for the barcodes.txt file:
https://github.com/vals/umis/tree/master/examples/SureCell

This is what I have so far:
umis fastqtransform transform.json K562_R1.fastq --fastq1out S2_out

Thank you

Error when running fasttagcount without cb_histogram

Hi,
I am currently testing umis for our 10XGenomics data. fasttagcount crashes when no cb_histogram is provided:
File "build/bdist.linux-x86_64/egg/umis/umis.py", line 829, in fasttagcount
AttributeError: 'NoneType' object has no attribute 'index'

There seems to be no check at that point whether cb_histogram was actually TRUE or whether cb_hist was not NONE.

Best, Thomas

valid cellular barcodes

What do you think about including valid cellular barcodes for the different chemistries in this repository? They can be a couple MB in size but it would be nice if everything you needed to run through 10x or whatever was all in one place instead of having to hunt around for the pieces.

MARS-Seq transform failing

As I'm still wrapping my head around umi's transform syntax, I thought I'd mention a MARS-Seq example that seems to fail with the current "transform.json" recipe. Any of the .fastq files extracted from this dataset's .sra files should replicate the issue.

tagcount with --genemap option

Hi,

I'm having a problem using umi tagcount with a genemap. I can run the following command without a gene map successfully:

umis tagcount pseudoalignments.bam sample1.gene.cb.counts.txt

But the addition of the --genemap option gives an error:

umis tagcount --genemap genemap.txt pseudoalignments.bam sample1.gene.cb.counts.txt

head: unrecognized option '--genemap'
INFO:umis.umis:Reading optional files
INFO:umis.umis:Tallying evidence
INFO:umis.umis:Processed 1000000 alignments, kept 791808.
INFO:umis.umis:208192 were filtered for being unmapped.
INFO:umis.umis:Processed 2000000 alignments, kept 1584028.
INFO:umis.umis:415972 were filtered for being unmapped.
INFO:umis.umis:Processed 3000000 alignments, kept 2379834.
INFO:umis.umis:620166 were filtered for being unmapped.
INFO:umis.umis:Processed 4000000 alignments, kept 3174835.
INFO:umis.umis:825165 were filtered for being unmapped.
INFO:umis.umis:Processed 5000000 alignments, kept 3969309.
INFO:umis.umis:1030691 were filtered for being unmapped.
INFO:umis.umis:Processed 6000000 alignments, kept 4760607.
INFO:umis.umis:1239393 were filtered for being unmapped.
INFO:umis.umis:Processed 7000000 alignments, kept 5552449.
INFO:umis.umis:1447551 were filtered for being unmapped.
INFO:umis.umis:Processed 8000000 alignments, kept 6349186.
INFO:umis.umis:1650814 were filtered for being unmapped.
INFO:umis.umis:Processed 9000000 alignments, kept 7143504.
INFO:umis.umis:1856496 were filtered for being unmapped.
INFO:umis.umis:Processed 10000000 alignments, kept 7937942.
INFO:umis.umis:2062058 were filtered for being unmapped.
INFO:umis.umis:Tally done - 1.07e+02s, 5,643,231 alns/min
INFO:umis.umis:Collapsing evidence
INFO:umis.umis:Writing evidence
Traceback (most recent call last):
File "/usr/local/bin/python3.6/umis", line 11, in
load_entry_point('umis==1.0.3', 'console_scripts', 'umis')()
File "/usr/lib/python3/dist-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/usr/lib/python3/dist-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/usr/lib/python3/dist-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/lib/python3/dist-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/lib/python3/dist-packages/click/core.py", line 535, in invoke
return callback(args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/umis/umis.py", line 653, in tagcount
genes = expanded.ix[genes.index]
File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py", line 126, in getitem
return self._getitem_axis(key, axis=axis)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py", line 1088, in _getitem_axis
return self._getitem_iterable(key, axis=axis)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py", line 1205, in _getitem_iterable
raise_missing=False)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py", line 1161, in _get_listlike_indexer
raise_missing=raise_missing)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py", line 1252, in _validate_read_indexer
raise KeyError("{} not in index".format(not_found))
KeyError:[''] not in index"]

The *** in the last line is a list of all of the gene names in my genemap file.

My genemap file looks like this (tab delimited):
ENST00000456328 DDX11L1
ENST00000450305 DDX11L1
ENST00000488147 WASH7P
ENST00000619216 MIR6859-1
ENST00000473358 RP11-34P13.3

And my bam file looks like this:
SRR8363321.42:CELL_CTCATTATCTTTACGT:UMI_ATAGGCTGTA 0 ENST00000379715 544 255 98M * 0 0 GTCTTCATCAAGAACAGACTATATACTAATTCCCACTAGAAGCTGTCCATGCCATACAGAAGATCTATTAAAAATGTTTTAAATGGAAAATGTACTCT AA<7<FJJJFJFFJJJ<-A-FJJJJA<<JFJJFFFFAFAJJ<FJJ7FJJJFJJF-<FFJFFFFAF-FJJFAJAJJJJFJJFJJJJFJ<-77FJJJF7A NH:i:1 ZW:f:1
SRR8363321.44:CELL_TCGGGACAGCCAGTAG:UMI_CTATTAGCCC 4 * 0 0 * * 0 0 TGCCTTGGCCTCCCAAAGTGTTGGGATTACAGGTGTGAGCCACCATGCCCGGCCAAGACATTTTATTACTAAGAGAATTGCAGTGTGCTATGAGGGTA AA<AFJF-FJJFJJJJJJAFFFAFAJJJJJJJJFJJJFJJJJJJJFJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJFJFJJJAJFFFFJJJJFJF<J
SRR8363321.45:CELL_GGGCACTGTGTCAATC:UMI_CTAACCTAAT 256 ENST00000323345 551 255 98M * 0 0 GTCGTAAAATGGGGGTCCCTTACTGCATTATCAAGGGAAAGGCAAGACTGGGACGTCTAGTCCACAGGAAGACCTGCACCACTGTCGCCTTCACACAG A7<AFJJJJJJJJ<F7A<JAFJFFJJJJJJJJJJJJJJJJJJJJJJJJJJJA<JJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJ NH:i:2 ZW:f:0.0285953

Thanks for the help,
maria

error when running umis fastqtransform

Got the following error when running SureCell example data:

umis fastqtransform ~/tmp/src/umis/examples/SureCell/transform.json ~/tmp/src/umis/examples/SureCell/K562_R1.fastq ~/tmp/src/umis/examples/SureCell/K562_R2.fastq
INFO:umis.umis:Detected triple cellular indexes.
INFO:umis.umis:Detected UMI.
Traceback (most recent call last):
  File "/users/xinli/.local/bin/umis", line 11, in <module>
    load_entry_point('umis', 'console_scripts', 'umis')()
  File "/users/xinli/anaconda2/lib/python2.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/users/xinli/anaconda2/lib/python2.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/users/xinli/anaconda2/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/users/xinli/anaconda2/lib/python2.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/users/xinli/anaconda2/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/users/xinli/tmp/src/umis/umis/umis.py", line 285, in fastqtransform
    sys.stdout.write(read_template.format(**read1_dict))
KeyError: 'CB'

cell barcode question

Hi,

Recently I tried to analyze scRNA-seq data produced by mcSCRB-seq. Because I have valid cell barcode file, so I did not need to identify the cell barcode in step2. Is any possible your tool could use the reference barcode file to extract the reads? I tried use " umi_tools extract" with --whitelist=MY_REF_INDEX, however it still failed.
Could you give me some suggestions?

Thank you!

IOError: [Errno 28] No space left on device

Hi, I met a problem when testing mouse&human mixed data(hg19_mm10) in tagcount step. I've tried to change elbow-point and running environment(40T available) but didn't work, Here are my commands and error details:

#tagcount
umis tagcount --cb_histogram selected-cb-histogram-2500.txt final_test.sam ./final_result_2500.txt

#error msg
INFO:umis.umis:Processed 1577000000 alignments, kept 940333728.
INFO:umis.umis:220970371 were filtered for being unmapped.
INFO:umis.umis:415695901 were filtered for not matching known barcodes.
INFO:umis.umis:Processed 1578000000 alignments, kept 941015260.
INFO:umis.umis:221101675 were filtered for being unmapped.
INFO:umis.umis:415883065 were filtered for not matching known barcodes.
INFO:umis.umis:Tally done - 9.72e+03s, 9,740,381 alns/min
INFO:umis.umis:Collapsing evidence
INFO:umis.umis:Writing evidence
Traceback (most recent call last):
File "/mnt/data/txw/miniconda/envs/umisss/bin/umis", line 11, in
load_entry_point('umis==1.0.3', 'console_scripts', 'umis')()
File "/mnt/data/txw/miniconda/envs/umisss/lib/python2.7/site-packages/click/core.py", line 764, in call
return self.main(*args, **kwargs)
File "/mnt/data/txw/miniconda/envs/umisss/lib/python2.7/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/mnt/data/txw/miniconda/envs/umisss/lib/python2.7/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/mnt/data/txw/miniconda/envs/umisss/lib/python2.7/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/mnt/data/txw/miniconda/envs/umisss/lib/python2.7/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/mnt/data/txw/miniconda/envs/umisss/lib/python2.7/site-packages/umis/umis.py", line 628, in tagcount
out_handle.write(line)
IOError: [Errno 28] No space left on device

pip install not working

Hello,

I'm using umis on an administered server. The install does not work properly (see Closed Issue #55).

I know that the install can work with bioconda, but the admins do not want to use conda install, only pip.

Is there any way you can make the pip install work so we can have a working version on our server?

Thanks,
Magali

Fancier statistics for counting

Did you have something in mind about how to do something more 'correct' than weighting by number of hits? Something like the way Salmon does it at the end or something like that? Do you know how that works? I am not smart enough to understand it.

v1.0.8 yields KeyErrors when running test.sh

Running the test.sh file on the v1.0.8 version leads to KeyError in test11 and test14. Pertinent log output:

INFO:umis.umis:Transforming examples/STRT-Seq/dualindex_example_1.fastq.
INFO:umis.umis:Detected dual cellular indexes.
INFO:umis.umis:Detected dual UMI.
Traceback (most recent call last):
  File "/Users/mfansler/mambaforge-arm64/envs/test-umis/bin/umis", line 33, in <module>
    sys.exit(load_entry_point('umis', 'console_scripts', 'umis')())
  File "/Users/mfansler/mambaforge-arm64/envs/test-umis/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/mfansler/mambaforge-arm64/envs/test-umis/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/mfansler/mambaforge-arm64/envs/test-umis/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/mfansler/mambaforge-arm64/envs/test-umis/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/mfansler/mambaforge-arm64/envs/test-umis/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/mfansler/scratch/umis/umis/umis.py", line 299, in fastqtransform
    read1_dict['MB'] = read1_dict['MB1'] + read1_dict['MB2']
KeyError: 'MB1'

# ...

INFO:umis.umis:Transforming examples/Klein-inDrop/klein-v3_R1.fq.
INFO:umis.umis:Detected dual cellular indexes.
INFO:umis.umis:Detected dual UMI.
INFO:umis.umis:Detected sample.
Traceback (most recent call last):
  File "/Users/mfansler/mambaforge-arm64/envs/test-umis/bin/umis", line 33, in <module>
    sys.exit(load_entry_point('umis', 'console_scripts', 'umis')())
  File "/Users/mfansler/mambaforge-arm64/envs/test-umis/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/mfansler/mambaforge-arm64/envs/test-umis/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/mfansler/mambaforge-arm64/envs/test-umis/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/mfansler/mambaforge-arm64/envs/test-umis/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/mfansler/mambaforge-arm64/envs/test-umis/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/mfansler/scratch/umis/umis/umis.py", line 299, in fastqtransform
    read1_dict['MB'] = read1_dict['MB1'] + read1_dict['MB2']
KeyError: 'MB1'

# ...

Files tests/results/test11.fq and tests/correct/test11.fq differ
Files tests/results/test14.fq and tests/correct/test14.fq differ

Avoid expanding collapsed counts

Hi @roryk,

I don't know if you have noticed, but at some point during tagcounting, the RAM usage expands rapidly. This causes issues of e.g. cluster nodes quitting due to reaching RAM limits.

I haven't had time to profile this, but I suspect the culprit is this bit:

    evidence_query = 'evidence >= %f' % minevidence
    if positional:
        evidence_table.columns=['cell', 'gene', 'umi', 'pos', 'evidence']
        collapsed = evidence_table.query(evidence_query).groupby(['cell', 'gene'])['umi', 'pos'].size()

    else:
        evidence_table.columns=['cell', 'gene', 'umi', 'evidence']
        collapsed = evidence_table.query(evidence_query).groupby(['cell', 'gene'])['umi'].size()

    expanded = collapsed.unstack().T

The .unstack() here basically takes a sparse matrix and turns it into a dense matrix.

I think it makes more sense to just reformat the collapsed table to a COO sparse matrix, and save those values. At least when run using the --sparse flag.

Would you have any objection to this? Or do you know if the memory explosion is due to something else?

Can you include a License file for the umis repository, please?

Including a license file makes it clear if / how your code can be reused by others. Thanks a lot!

Ensure proper version

Hi @roryk!

Would it be possible to ensure that the 1.0.9 release and tag both point to the commit that fixed the versioning (#70)?

Thank you for being so responsive, we really appreciate it!

Incorrect version

The internal version used for output was not updated, so that umis version still reports 1.0.7.

umis/umis/umis.py

Line 27 in e6e9843

VERSION = "1.0.7"

tagcount error : AttributeError

I am right now using this command umis tagcount pseudoalignments.bam result.txt.
pseudoalignments.bam is output of kallisto.
However, I keep getting errors:

INFO:umis.umis:Reading optional files

INFO:umis.umis:Tallying evidence
Traceback (most recent call last):
  File "/home5/jyangbn/.conda/envs/py2/bin/umis", line 11, in <module>
    load_entry_point('umis==1.0.3', 'console_scripts', 'umis')()
  File "/home5/jyangbn/.conda/envs/py2/lib/python2.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home5/jyangbn/.conda/envs/py2/lib/python2.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home5/jyangbn/.conda/envs/py2/lib/python2.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home5/jyangbn/.conda/envs/py2/lib/python2.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home5/jyangbn/.conda/envs/py2/lib/python2.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home5/jyangbn/.conda/envs/py2/lib/python2.7/site-packages/umis/umis.py", line 583, in tagcount
    CB = match.group('CB')
AttributeError: 'NoneType' object has no attribute 'group'

I tried python 3.6.7 and python 2.7.15 with umis 1.0.3(installed from conda), neither of them works. Could you please help me to figure this out?

The fastq file is 20311_merged_transformed.fastq.gz from session E-MTAB-5480, then I run kallisto quant -i reference/ercc.idx --pseudobam --single -l 180 -s 20 -o 20311_out E-MTAB-5480/20311_merged_transformed.fastq.gz to get this bam file pseudoalignments.bam.
Because I have the cell barcodes I need to maintain so that I start with umis tagcount instead of filtering. I am attempting to extract the original molecular information (read count) not UMI count for 10X data. This is my whole processing.

Thanks for any information you may provide.

Multi-maps

Hello,

I'm testing the pipeline and I was wondering how are multi-mapping reads handled.

I'm currently using Salmon sam output and RapMap, where I have a large fraction of the reads non-uniquely assigned (which is expected).

Does the pipeline count only uniquely aligned or will count the multimapping read if the UMI for the region in question is unique? Will the sum of all counts be larger than the input?

Thanks!

Changing Fastq header format

I was speaking to Nuno at the Expression Atlas who said our format for Fastq headers is not compatible with CASAVA standard.

I think it would make sense to change the header format to one similar to described here: https://github.com/nunofonseca/fastq_utils

The biggest difference is that keeping the original read name at the end rather than beginning will make the read follow the CASAVA standard, but also some optimisations using the htslib API would be possible when parsing the header.

This would also allow us to use the fastq_utils script as an optional faster way to transform fastq files in protocols with simpler read topologies not requiring regular expressions (which are the majority of them).