Git Product home page Git Product logo

umis's Introduction

umis

umis provides tools for estimating expression in RNA-Seq data which performs sequencing of end tags of transcript, and incorporate molecular tags to correct for amplification bias.

There are four steps in this process.

  1. Formatting reads
  2. Filtering noisy cellular barcodes
  3. Pseudo-mapping to cDNAs
  4. Counting molecular identifiers

1. Formatting reads

We want to strip out all non-biological segments of the sequenced reads for the sake of mapping. While also keeping this information for later use. We consider non-biological information such as Cellular Barcode and Molecular Barcode. To later be able to extract the optional CB and the MB these are put in the read header, with the following format.

@HWI-ST808:130:H0B8YADXX:1:1101:2088:2222:CELL_GGTCCA:UMI_CCCT
AGGAAGATGGAGGAGAGAAGGCGGTGAAAGAGACCTGTAAAAAGCCACCGN
+
@@@DDBD>=AFCF+<CAFHDECII:DGGGHGIGGIIIEHGIIIGIIDHII#

The command umis fastqtransform is for transforming a (pair of) read(s) to this format based on a transform file. The transform file is a json file which has a Python flavored regular expression for each read, made to extract the necessary components of the reads.

2. Filtering noisy cellular barcodes

Not all cellular barcodes identified in the transformation will be real. Some will be low abundance barcodes that do not represent an actual cell. Others will be barcodes that don't come from a set of known barcodes. The umi cb_filter command can be used to filter a transformed FASTQ file, dropping unknown barcodes. The --nedit option can be supplied to correct barcodes --nedit distance away from known barcodes. After barcode filtering, the umis cb_histogram command will generate a file of counts for each cellular barcode. This file can be used to find a count cut-off for barcodes that are high abundance for downstream quantitation.

3. Pseudo-mapping to cDNAs

This is done by pseudo-aligners, either Kallisto or RapMap. The SAM (or BAM) file output from these tools need to be saved.

4. Counting molecular identifiers

The final step is to infer which cDNA was the origin of the tag a UMI was attached to. We use the pseudo-alignments to the cDNAs, and consider a tag assigned to a cDNA as a partial evidence for a (cDNA, UMI) pairing. For actual counting, we only count unique UMIs for (gene, UMI) pairings with sufficient evidence.

To count, use the command umis tagcount. This requires a SAM or BAM file as input.

By default, the read name will be used to cell barcodes and UMI sequences. Optionally, when using the --parse_tags option, the CR and UM bam tags will be used to extract the cell barcode and UMI, respectively.

The recommended workflow is to map reads to cDNA, in which case the target name in the BAM will be a transcript ID. If the BAM has been mapped to a genome (e.g. with STAR) tagcount can use the optional GX BAM tag to get the gene name. In this case, use the option --gene_tags.

kallisto

The quantitation used in umis handles reads that could come from multiple transcripts by assigning a fractional count to each transcript and then filtering for a minimum count at the end. Many single-cell analyses use something similar to this type of counting, but it has drawbacks (see this paper). For more principled UMI quantification, see Kallisto. kallisto needs the files in a certain format: each cellular barcode has its own FASTQ file and a file that lists the UMI for each read. The umis kallisto command can reformat your fastq files to that format.

umis's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

umis's Issues

Adjusting for Poisson counting statistics

Hi,

I was wondering if you are adjusting for Poisson counting statistics, when the length of the molecular UMI barcode is short. Some of the recent UMI protocols (e.g. Muraro, Cell Systems 2016) only use 4 bases long random UMI barcodes to tag transcripts. This makes only 256 possible UMIs, so that different transcripts will often have the same UMI bar code. The expected number of transcripts can thus be inferred from the number of UMIs with Poisson Counting statistics. For example if we see all 256 possible UMI, we expect at least 1597 transcripts. I have data of this kind at the moment and I am wondering if I can use the bc_bioc pipeline for it. Thanks!

Alexander

PS: If this is not part of the pipeline, I am not sure if it is worth adding it, as future protocols will hopefully have longer barcodes anyways.

Set up travis

Hi @vals,

Do you think you could turn on Travis for this repo, or make me a co-owner so I can do it? It would be nice to run the tests so we can make sure everything is working before cutting releases and stuff.

Hope you are holding up well in the lockdown, hope your new gig is going well!

Smart-Seq2 data

Hi,

I am currently dealing with a version of Smart-Seq2 data which is 150x150bp and has, on read 2, positions 1-8 as the cellular barcode and positions 9-16 as the UMI. For pseudoalignment with kallisto, I'm wondering if you have any thoughts/experience on aligning with just the first or both reads. On one hand, I'd imagine that having both reads would increase mappability, but then on the other hand they would ultimately be collapsed to the same UMI.

Thanks,
Sahin

is there any function to extract fastq for certain cells passing the filter?

Hi, I used umis cb_histogram to calculate reads count for each cell and then chose a count cutoff based on that. For the downstream analysis, such as pseudo-align and count UMI, we only need to deal with reads from those cells. Is there any function for this based on umis cb_histogram results? Thanks.

Transforms don't understand multiple UMIs

Regexes work with CB1 and CB2, but not with UMIs. As there are nowadays libraries with dual UMIs like IDT's, it might be worthwhile to support UMI1 and UMI2 in transforms.

So slow when cb_filter

Hi,

I am using umis to analyze our single-cell data from 10x genomics, the fastqtransform step is OK but it is extremely slow when I run cb_filter option.

I have downloaded all cell barcode list (around 737k) from Cell Ranger package, do you have some good ideas to speed it up (around 14M reads)? I already try multiple cores but it did improve too much.

Thanks in advance!

demultiplexing

Hi Valentine,

What do you think about adding an option to demultiplex the barcodes into separate files, named by the barcode? We could also pass along a file of allowed barcodes to match and filter out non-matching barcodes as we go. I don't want to muck up your repo with functionality you weren't intending though.

Merging similar UMIs?

I was wondering if you had considered merging UMIs that might be erroneous copies, eg as outlined in this blog post.

The use of UMIs [...] would work perfectly if it were not for base-calling errors, which erroneously create sequencing reads with the same genomic coordinates and UMIs and that are identical for the base at which the error occurred.

If I understand the current code correctly, it considers two barcodes as separte UMIs if they differ even by one base. Would it be useful to merge reads into the same UMI if they are 'nearly' identical, e.g. based on Hamming distance?

error in umis cb_filter

I used the umi fastqtransform to formatted read and it's ok.
The formatting reads is
image
But when I use the umis cb_filter command to filter cell barcode, it's error:
command : umis cb_filter SRR1058003.fastq
error
image
what happened?

barcode mismatch correction

Hi Valentine,

What do you think about sticking some code in there to optionally find barcodes 1 or 2 edit distances away from the known barcodes and correcting them? I think we can do up to two edit distances pretty efficiently and simply. I'm down for implementing it if that sounds ok to you.

Error running umis tagcount with 1.0.3 (working on 1.0.0)

Hello!

I had a script that was working when using umis version 1.0.0.

However, it now raises an error with version 1.0.3 when counting the tags.

The commend is:
umis tagcount AACAGCT.bam AACAGCT.txt

The error I get (with 1.0.3):
Traceback (most recent call last):
File "/services/tools/anaconda3/4.4.0/bin/umis", line 10, in
sys.exit(umis())
File "/services/tools/anaconda3/4.4.0/lib/python3.6/site-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/services/tools/anaconda3/4.4.0/lib/python3.6/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/services/tools/anaconda3/4.4.0/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/services/tools/anaconda3/4.4.0/lib/python3.6/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/services/tools/anaconda3/4.4.0/lib/python3.6/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/services/tools/anaconda3/4.4.0/lib/python3.6/site-packages/umis/umis.py", line 459, in tagcount
from utils import weigh_evidence
ImportError: cannot import name 'weigh_evidence'

The version on the server is now 1.0.3 and I would like my script to work with the latest version.

Is there any bug you can fix and/or something I can do differently?

Thanks a lot,
Magali

Error when running fasttagcount without cb_histogram

Hi,
I am currently testing umis for our 10XGenomics data. fasttagcount crashes when no cb_histogram is provided:
File "build/bdist.linux-x86_64/egg/umis/umis.py", line 829, in fasttagcount
AttributeError: 'NoneType' object has no attribute 'index'

There seems to be no check at that point whether cb_histogram was actually TRUE or whether cb_hist was not NONE.

Best, Thomas

valid cellular barcodes

What do you think about including valid cellular barcodes for the different chemistries in this repository? They can be a couple MB in size but it would be nice if everything you needed to run through 10x or whatever was all in one place instead of having to hunt around for the pieces.

MARS-Seq transform failing

As I'm still wrapping my head around umi's transform syntax, I thought I'd mention a MARS-Seq example that seems to fail with the current "transform.json" recipe. Any of the .fastq files extracted from this dataset's .sra files should replicate the issue.

tagcount with --genemap option

Hi,

I'm having a problem using umi tagcount with a genemap. I can run the following command without a gene map successfully:

umis tagcount pseudoalignments.bam sample1.gene.cb.counts.txt

But the addition of the --genemap option gives an error:

umis tagcount --genemap genemap.txt pseudoalignments.bam sample1.gene.cb.counts.txt

head: unrecognized option '--genemap'
INFO:umis.umis:Reading optional files
INFO:umis.umis:Tallying evidence
INFO:umis.umis:Processed 1000000 alignments, kept 791808.
INFO:umis.umis:208192 were filtered for being unmapped.
INFO:umis.umis:Processed 2000000 alignments, kept 1584028.
INFO:umis.umis:415972 were filtered for being unmapped.
INFO:umis.umis:Processed 3000000 alignments, kept 2379834.
INFO:umis.umis:620166 were filtered for being unmapped.
INFO:umis.umis:Processed 4000000 alignments, kept 3174835.
INFO:umis.umis:825165 were filtered for being unmapped.
INFO:umis.umis:Processed 5000000 alignments, kept 3969309.
INFO:umis.umis:1030691 were filtered for being unmapped.
INFO:umis.umis:Processed 6000000 alignments, kept 4760607.
INFO:umis.umis:1239393 were filtered for being unmapped.
INFO:umis.umis:Processed 7000000 alignments, kept 5552449.
INFO:umis.umis:1447551 were filtered for being unmapped.
INFO:umis.umis:Processed 8000000 alignments, kept 6349186.
INFO:umis.umis:1650814 were filtered for being unmapped.
INFO:umis.umis:Processed 9000000 alignments, kept 7143504.
INFO:umis.umis:1856496 were filtered for being unmapped.
INFO:umis.umis:Processed 10000000 alignments, kept 7937942.
INFO:umis.umis:2062058 were filtered for being unmapped.
INFO:umis.umis:Tally done - 1.07e+02s, 5,643,231 alns/min
INFO:umis.umis:Collapsing evidence
INFO:umis.umis:Writing evidence
Traceback (most recent call last):
File "/usr/local/bin/python3.6/umis", line 11, in
load_entry_point('umis==1.0.3', 'console_scripts', 'umis')()
File "/usr/lib/python3/dist-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/usr/lib/python3/dist-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/usr/lib/python3/dist-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/lib/python3/dist-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/lib/python3/dist-packages/click/core.py", line 535, in invoke
return callback(args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/umis/umis.py", line 653, in tagcount
genes = expanded.ix[genes.index]
File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py", line 126, in getitem
return self._getitem_axis(key, axis=axis)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py", line 1088, in _getitem_axis
return self._getitem_iterable(key, axis=axis)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py", line 1205, in _getitem_iterable
raise_missing=False)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py", line 1161, in _get_listlike_indexer
raise_missing=raise_missing)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py", line 1252, in _validate_read_indexer
raise KeyError("{} not in index".format(not_found))
KeyError:['
'] not in index"]

The *** in the last line is a list of all of the gene names in my genemap file.

My genemap file looks like this (tab delimited):
ENST00000456328 DDX11L1
ENST00000450305 DDX11L1
ENST00000488147 WASH7P
ENST00000619216 MIR6859-1
ENST00000473358 RP11-34P13.3

And my bam file looks like this:
SRR8363321.42:CELL_CTCATTATCTTTACGT:UMI_ATAGGCTGTA 0 ENST00000379715 544 255 98M * 0 0 GTCTTCATCAAGAACAGACTATATACTAATTCCCACTAGAAGCTGTCCATGCCATACAGAAGATCTATTAAAAATGTTTTAAATGGAAAATGTACTCT AA<7<FJJJFJFFJJJ<-A-FJJJJA<<JFJJFFFFAFAJJ<FJJ7FJJJFJJF-<FFJFFFFAF-FJJFAJAJJJJFJJFJJJJFJ<-77FJJJF7A NH:i:1 ZW:f:1
SRR8363321.44:CELL_TCGGGACAGCCAGTAG:UMI_CTATTAGCCC 4 * 0 0 * * 0 0 TGCCTTGGCCTCCCAAAGTGTTGGGATTACAGGTGTGAGCCACCATGCCCGGCCAAGACATTTTATTACTAAGAGAATTGCAGTGTGCTATGAGGGTA AA<AFJF-FJJFJJJJJJAFFFAFAJJJJJJJJFJJJFJJJJJJJFJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJFJFJJJAJFFFFJJJJFJF<J
SRR8363321.45:CELL_GGGCACTGTGTCAATC:UMI_CTAACCTAAT 256 ENST00000323345 551 255 98M * 0 0 GTCGTAAAATGGGGGTCCCTTACTGCATTATCAAGGGAAAGGCAAGACTGGGACGTCTAGTCCACAGGAAGACCTGCACCACTGTCGCCTTCACACAG A7<AFJJJJJJJJ<F7A<JAFJFFJJJJJJJJJJJJJJJJJJJJJJJJJJJA<JJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJ NH:i:2 ZW:f:0.0285953

Thanks for the help,
maria

error when running umis fastqtransform

Got the following error when running SureCell example data:

umis fastqtransform ~/tmp/src/umis/examples/SureCell/transform.json ~/tmp/src/umis/examples/SureCell/K562_R1.fastq ~/tmp/src/umis/examples/SureCell/K562_R2.fastq
INFO:umis.umis:Detected triple cellular indexes.
INFO:umis.umis:Detected UMI.
Traceback (most recent call last):
  File "/users/xinli/.local/bin/umis", line 11, in <module>
    load_entry_point('umis', 'console_scripts', 'umis')()
  File "/users/xinli/anaconda2/lib/python2.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/users/xinli/anaconda2/lib/python2.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/users/xinli/anaconda2/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/users/xinli/anaconda2/lib/python2.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/users/xinli/anaconda2/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/users/xinli/tmp/src/umis/umis/umis.py", line 285, in fastqtransform
    sys.stdout.write(read_template.format(**read1_dict))
KeyError: 'CB'

cell barcode question

Hi,

Recently I tried to analyze scRNA-seq data produced by mcSCRB-seq. Because I have valid cell barcode file, so I did not need to identify the cell barcode in step2. Is any possible your tool could use the reference barcode file to extract the reads? I tried use " umi_tools extract" with --whitelist=MY_REF_INDEX, however it still failed.
Could you give me some suggestions?

Thank you!

IOError: [Errno 28] No space left on device

Hi, I met a problem when testing mouse&human mixed data(hg19_mm10) in tagcount step. I've tried to change elbow-point and running environment(40T available) but didn't work, Here are my commands and error details:

#tagcount
umis tagcount --cb_histogram selected-cb-histogram-2500.txt final_test.sam ./final_result_2500.txt

#error msg
INFO:umis.umis:Processed 1577000000 alignments, kept 940333728.
INFO:umis.umis:220970371 were filtered for being unmapped.
INFO:umis.umis:415695901 were filtered for not matching known barcodes.
INFO:umis.umis:Processed 1578000000 alignments, kept 941015260.
INFO:umis.umis:221101675 were filtered for being unmapped.
INFO:umis.umis:415883065 were filtered for not matching known barcodes.
INFO:umis.umis:Tally done - 9.72e+03s, 9,740,381 alns/min
INFO:umis.umis:Collapsing evidence
INFO:umis.umis:Writing evidence
Traceback (most recent call last):
File "/mnt/data/txw/miniconda/envs/umisss/bin/umis", line 11, in
load_entry_point('umis==1.0.3', 'console_scripts', 'umis')()
File "/mnt/data/txw/miniconda/envs/umisss/lib/python2.7/site-packages/click/core.py", line 764, in call
return self.main(*args, **kwargs)
File "/mnt/data/txw/miniconda/envs/umisss/lib/python2.7/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/mnt/data/txw/miniconda/envs/umisss/lib/python2.7/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/mnt/data/txw/miniconda/envs/umisss/lib/python2.7/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/mnt/data/txw/miniconda/envs/umisss/lib/python2.7/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/mnt/data/txw/miniconda/envs/umisss/lib/python2.7/site-packages/umis/umis.py", line 628, in tagcount
out_handle.write(line)
IOError: [Errno 28] No space left on device

pip install not working

Hello,

I'm using umis on an administered server. The install does not work properly (see Closed Issue #55).

I know that the install can work with bioconda, but the admins do not want to use conda install, only pip.

Is there any way you can make the pip install work so we can have a working version on our server?

Thanks,
Magali

Fancier statistics for counting

Did you have something in mind about how to do something more 'correct' than weighting by number of hits? Something like the way Salmon does it at the end or something like that? Do you know how that works? I am not smart enough to understand it.

v1.0.8 yields KeyErrors when running test.sh

Running the test.sh file on the v1.0.8 version leads to KeyError in test11 and test14. Pertinent log output:

INFO:umis.umis:Transforming examples/STRT-Seq/dualindex_example_1.fastq.
INFO:umis.umis:Detected dual cellular indexes.
INFO:umis.umis:Detected dual UMI.
Traceback (most recent call last):
  File "/Users/mfansler/mambaforge-arm64/envs/test-umis/bin/umis", line 33, in <module>
    sys.exit(load_entry_point('umis', 'console_scripts', 'umis')())
  File "/Users/mfansler/mambaforge-arm64/envs/test-umis/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/mfansler/mambaforge-arm64/envs/test-umis/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/mfansler/mambaforge-arm64/envs/test-umis/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/mfansler/mambaforge-arm64/envs/test-umis/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/mfansler/mambaforge-arm64/envs/test-umis/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/mfansler/scratch/umis/umis/umis.py", line 299, in fastqtransform
    read1_dict['MB'] = read1_dict['MB1'] + read1_dict['MB2']
KeyError: 'MB1'

# ...

INFO:umis.umis:Transforming examples/Klein-inDrop/klein-v3_R1.fq.
INFO:umis.umis:Detected dual cellular indexes.
INFO:umis.umis:Detected dual UMI.
INFO:umis.umis:Detected sample.
Traceback (most recent call last):
  File "/Users/mfansler/mambaforge-arm64/envs/test-umis/bin/umis", line 33, in <module>
    sys.exit(load_entry_point('umis', 'console_scripts', 'umis')())
  File "/Users/mfansler/mambaforge-arm64/envs/test-umis/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/mfansler/mambaforge-arm64/envs/test-umis/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/mfansler/mambaforge-arm64/envs/test-umis/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/mfansler/mambaforge-arm64/envs/test-umis/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/mfansler/mambaforge-arm64/envs/test-umis/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/mfansler/scratch/umis/umis/umis.py", line 299, in fastqtransform
    read1_dict['MB'] = read1_dict['MB1'] + read1_dict['MB2']
KeyError: 'MB1'

# ...

Files tests/results/test11.fq and tests/correct/test11.fq differ
Files tests/results/test14.fq and tests/correct/test14.fq differ

Avoid expanding collapsed counts

Hi @roryk,

I don't know if you have noticed, but at some point during tagcounting, the RAM usage expands rapidly. This causes issues of e.g. cluster nodes quitting due to reaching RAM limits.

I haven't had time to profile this, but I suspect the culprit is this bit:

    evidence_query = 'evidence >= %f' % minevidence
    if positional:
        evidence_table.columns=['cell', 'gene', 'umi', 'pos', 'evidence']
        collapsed = evidence_table.query(evidence_query).groupby(['cell', 'gene'])['umi', 'pos'].size()

    else:
        evidence_table.columns=['cell', 'gene', 'umi', 'evidence']
        collapsed = evidence_table.query(evidence_query).groupby(['cell', 'gene'])['umi'].size()

    expanded = collapsed.unstack().T

The .unstack() here basically takes a sparse matrix and turns it into a dense matrix.

I think it makes more sense to just reformat the collapsed table to a COO sparse matrix, and save those values. At least when run using the --sparse flag.

Would you have any objection to this? Or do you know if the memory explosion is due to something else?

Ensure proper version

Hi @roryk!

Would it be possible to ensure that the 1.0.9 release and tag both point to the commit that fixed the versioning (#70)?

Thank you for being so responsive, we really appreciate it!

tagcount error : AttributeError

I am right now using this command umis tagcount pseudoalignments.bam result.txt.
pseudoalignments.bam is output of kallisto.
However, I keep getting errors:

INFO:umis.umis:Reading optional files

INFO:umis.umis:Tallying evidence
Traceback (most recent call last):
  File "/home5/jyangbn/.conda/envs/py2/bin/umis", line 11, in <module>
    load_entry_point('umis==1.0.3', 'console_scripts', 'umis')()
  File "/home5/jyangbn/.conda/envs/py2/lib/python2.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home5/jyangbn/.conda/envs/py2/lib/python2.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home5/jyangbn/.conda/envs/py2/lib/python2.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home5/jyangbn/.conda/envs/py2/lib/python2.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home5/jyangbn/.conda/envs/py2/lib/python2.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home5/jyangbn/.conda/envs/py2/lib/python2.7/site-packages/umis/umis.py", line 583, in tagcount
    CB = match.group('CB')
AttributeError: 'NoneType' object has no attribute 'group'

I tried python 3.6.7 and python 2.7.15 with umis 1.0.3(installed from conda), neither of them works. Could you please help me to figure this out?

The fastq file is 20311_merged_transformed.fastq.gz from session E-MTAB-5480, then I run kallisto quant -i reference/ercc.idx --pseudobam --single -l 180 -s 20 -o 20311_out E-MTAB-5480/20311_merged_transformed.fastq.gz to get this bam file pseudoalignments.bam.
Because I have the cell barcodes I need to maintain so that I start with umis tagcount instead of filtering. I am attempting to extract the original molecular information (read count) not UMI count for 10X data. This is my whole processing.

Thanks for any information you may provide.

Multi-maps

Hello,

I'm testing the pipeline and I was wondering how are multi-mapping reads handled.

I'm currently using Salmon sam output and RapMap, where I have a large fraction of the reads non-uniquely assigned (which is expected).

Does the pipeline count only uniquely aligned or will count the multimapping read if the UMI for the region in question is unique? Will the sum of all counts be larger than the input?

Thanks!

Changing Fastq header format

I was speaking to Nuno at the Expression Atlas who said our format for Fastq headers is not compatible with CASAVA standard.

I think it would make sense to change the header format to one similar to described here: https://github.com/nunofonseca/fastq_utils

The biggest difference is that keeping the original read name at the end rather than beginning will make the read follow the CASAVA standard, but also some optimisations using the htslib API would be possible when parsing the header.

This would also allow us to use the fastq_utils script as an optional faster way to transform fastq files in protocols with simpler read topologies not requiring regular expressions (which are the majority of them).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.