magdoll / cdna_cupcake Goto Github PK

View Code? Open in Web Editor NEW

253.0 253.0 101.0 198.59 MB

Miscellaneous collection of Python and R scripts for processing Iso-Seq data

License: BSD 3-Clause Clear License

Python 91.97% Shell 0.28% R 0.38% Cython 7.37%

cdna_cupcake's People

Contributors

Stargazers

Watchers

Forkers

52teth sisov gouthamatla josator anne-oxford thejacksonlaboratory wenmm litswu hobrien yqwu1983 senaj zhaoyuqi616 zgt1021 smm19900210 attrna alisandra wayneforever fanglu01 xiaoaozqd taorantaoran nlapalu csf-ngs glin0311 ioutwcf ydliu-hit hpobio-lab gsheynkman ktp-forked-repos smyang2018 shangguandong1996 ddpinto vikash84 chlorproaine biocko dpryan79 yichangyu xuelei-dai christine-liu yixf-self leosfan pcarbone milescsmith cez026 qiuyixmm wangpanqiao mengchengyao shulp2211 hj1994412 ismail-curie crazyhsu wenmore sarahhp mingleiyang cbrueffer denis-torre tidesun njdbickhart gongjingtang bijikyu fly-fiancee zuhayr-pacbio yenlingpeng slukowski genomicsnx davymoon kprabhu-soundag lqhhhhh bj8th lzx325 arthurdondi amrr101 velociroger-pb zhipengliu92 dicerna tanglabsysu2020 devxia adeslatt khandaud15 unique379r cjops wijerasa sichongp yanding srisvs33 shehongbing wenjingli3367 ssutharzan uso2015 xiaoyulei0406 changqingw methodsdev chrisamiller youyupei odielwoolmore dmsehddl7 miaot2901 gplee85 amazingshi

cdna_cupcake's Issues

get_abundance_post_collapse.py seems not compliant with isoseq3 format

Running the isoseq3 suite provide me "cluster_report.csv file" on this format :

cluster_id,read_id,read_type
transcript/11530,m54064_180112_100315/68092545/ccs,FL
transcript/11530,m54064_180112_100315/35979778/ccs,FL
transcript/11530,m54064_180112_100315/20448150/ccs,FL

and « collapsed.group.txt" on this format :

PB.1.1 transcript/1968,transcript/2407,transcript/3021,transcript/3170,transcript/5415
PB.1.2 transcript/22838
PB.2.1 transcript/3489,transcript/3803

Now running the get_abundance_post_collapse.py give meh this error message

$ srun get_abundance_post_collapse.py hq_isoforms.fasta.no5merge.collapsed /home/jleple/work/Isoseq3_CO1_ccs/POLISH9999/cluster_report.csv
Traceback (most recent call last):
File "/usr/local/bioinfo/src/Miniconda/Miniconda3-4.4.10/envs/cogent-v3.6_env/bin/get_abundance_post_collapse.py", line 4, in
import('pkg_resources').run_script('cupcake==5.3', 'get_abundance_post_collapse.py')
File "/usr/local/bioinfo/src/Miniconda/Miniconda3-4.4.10/envs/cogent-v3.6_env/lib/python2.7/site-packages/pkg_resources/init.py", line 661, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/local/bioinfo/src/Miniconda/Miniconda3-4.4.10/envs/cogent-v3.6_env/lib/python2.7/site-packages/pkg_resources/init.py", line 1441, in run_script
exec(code, namespace, namespace)
File "/usr/local/bioinfo/src/Miniconda/Miniconda3-4.4.10/envs/cogent-v3.6_env/lib/python2.7/site-packages/cupcake-5.3-py2.7-linux-x86_64.egg/EGG-INFO/scripts/get_abundance_post_collapse.py", line 435, in
get_abundance_post_collapse(args.collapse_prefix, args.cluster_report, args.collapse_prefix)
File "/usr/local/bioinfo/src/Miniconda/Miniconda3-4.4.10/envs/cogent-v3.6_env/lib/python2.7/site-packages/cupcake-5.3-py2.7-linux-x86_64.egg/EGG-INFO/scripts/get_abundance_post_collapse.py", line 422, in get_abundance_post_collapse
output_read_count_IsoSeq_csv(cid_info, cluster_report_csv, output_prefix + '.read_stat.txt')
File "/usr/local/bioinfo/src/Miniconda/Miniconda3-4.4.10/envs/cogent-v3.6_env/lib/python2.7/site-packages/cupcake-5.3-py2.7-linux-x86_64.egg/EGG-INFO/scripts/get_abundance_post_collapse.py", line 148, in output_read_count_IsoSeq_csv
raise Exception, "cluster_id {0} is not a valid cluster ID!".format(cid)
Exception: cluster_id transcript/0 is not a valid cluster ID!
srun: error: node130: task 0: Exited with exit code 1

Have you a solution for me?

Thank you very much in advance

Jean-Charles

duplicate isoforms in chain_samples.py output？

Hi，
I have used command 'chain_samples.py chain.cfg norm_nfl' to chain my isoseq collapsed result from two samples, But when I check the "all_samples.chained.gff" file, I find that there's duplicate transcripts(PB.1.1 and PB.1.2):
chr1 PacBio transcript 70 4943 . + . gene_id "PB.1"; transcript_id "PB.1.1";
chr1 PacBio exon 70 4943 . + . gene_id "PB.1"; transcript_id "PB.1.1";
chr1 PacBio transcript 70 4943 . + . gene_id "PB.1"; transcript_id "PB.1.2";
chr1 PacBio exon 70 4943 . + . gene_id "PB.1"; transcript_id "PB.1.2";
chr1 PacBio transcript 1143 3686 . + . gene_id "PB.1"; transcript_id "PB.1.3";
chr1 PacBio exon 1143 2231 . + . gene_id "PB.1"; transcript_id "PB.1.3";
chr1 PacBio exon 2292 3686 . + . gene_id "PB.1"; transcript_id "PB.1.3";
chr1 PacBio transcript 1215 6942 . + . gene_id "PB.1"; transcript_id "PB.1.4";
chr1 PacBio exon 1215 2219 . + . gene_id "PB.1"; transcript_id "PB.1.4";
chr1 PacBio exon 5669 6942 . + . gene_id "PB.1"; transcript_id "PB.1.4";
Here is the chained ids in 'all_samples.chained_ids.txt':
superPBID sample-1 sample-2
PB.1.1 PB.1.1 PB.1.1
PB.1.2 PB.1.1 PB.2.1
PB.1.3 NA PB.2.2
PB.1.4 NA PB.2.3

Here is the collapsed.gff file for sample-1:
chr1 PacBio transcript 70 4943 . + . gene_id "PB.1"; transcript_id "PB.1.1";
chr1 PacBio exon 70 4943 . + . gene_id "PB.1"; transcript_id "PB.1.1";
chr1 PacBio transcript 1348 3693 . - . gene_id "PB.2"; transcript_id "PB.2.1";
chr1 PacBio exon 1348 3693 . - . gene_id "PB.2"; transcript_id "PB.2.1";
And head lines in the collapsed.gff file for sample-2:
chr1 PacBio transcript 70 1020 . + . gene_id "PB.1"; transcript_id "PB.1.1";
chr1 PacBio exon 70 1020 . + . gene_id "PB.1"; transcript_id "PB.1.1";
chr1 PacBio transcript 1033 4943 . + . gene_id "PB.2"; transcript_id "PB.2.1";
chr1 PacBio exon 1033 4943 . + . gene_id "PB.2"; transcript_id "PB.2.1";
chr1 PacBio transcript 1143 3686 . + . gene_id "PB.2"; transcript_id "PB.2.2";
chr1 PacBio exon 1143 2231 . + . gene_id "PB.2"; transcript_id "PB.2.2";
chr1 PacBio exon 2292 3686 . + . gene_id "PB.2"; transcript_id "PB.2.2";
chr1 PacBio transcript 1215 6942 . + . gene_id "PB.2"; transcript_id "PB.2.3";
chr1 PacBio exon 1215 2219 . + . gene_id "PB.2"; transcript_id "PB.2.3";
chr1 PacBio exon 5669 6942 . + . gene_id "PB.2"; transcript_id "PB.2.3";

I cannot figure out whether there is any error in data/command prepare or not. Can anyone explain for this?
Thanks a lot！

Trouble w/ install/run of Cupcake

Hi Liz,

I installed Cupcake via anaCogent env and can't get it work. To my knowledge the error I observe (see screen) mean I try running it with different python that was used to build it, but how is this possible, I use anaCogent python (white is my user and server name masked).

Error: Unknown format code 'f' for object of type 'str'

This is the command I used:
collapse_isoforms_by_sam.py --input $FASTQ --fq -s $SORTED_SAM -o ${OUT_DIR}/25675.polished_high_qv_consensus_isoforms_collapsed

The program ran for a while and I got the following error:

Traceback (most recent call last):
File "/home/shenl03/anaconda2/bin/collapse_isoforms_by_sam.py", line 4, in
import('pkg_resources').run_script('cupcake==3.7', 'collapse_isoforms_by_sam.py')
File "/home/shenl03/anaconda2/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/init.py", line 744, in run_script
File "/home/shenl03/anaconda2/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/init.py", line 1499, in run_script
File "/data/home/shenl03/anaconda2/lib/python2.7/site-packages/cupcake-3.7-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 244, in
main(args)
File "/data/home/shenl03/anaconda2/lib/python2.7/site-packages/cupcake-3.7-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 190, in main
for recs in iter:
File "/home/shenl03/anaconda2/lib/python2.7/site-packages/cupcake-3.7-py2.7-linux-x86_64.egg/cupcake/tofu/branch/branch_simple2.py", line 81, in iter_gmap_sam
ignored_fout.write("{0}\tCoverage {1:.3f} too low.\n".format(r.qID, r.qCoverage))
ValueError: Unknown format code 'f' for object of type 'str'

Please offer a fix!

collapsed.gff file is not consistent with collapsed.rep.fa file

Hello,

I've used collapse_isoforms_by_sam.py to remove redundant transcripts and get the results.

However, I found that the content in collapsed.gff file is not consistent with collapsed.rep.fa file.

Take 'PB.10113.15' as an example. As you can see in the following, the length of this transcript is 2091 but the sequence in fasta file is 2154-nucleotide long.

And I wonder how can I deal with the difference between gff file and fasta file?

The annotation of gene structure of this transcripts is:
22 PacBio transcript 36253133 36267525 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";
22 PacBio exon 36253133 36253219 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";
22 PacBio exon 36254937 36254999 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";
22 PacBio exon 36257083 36257136 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";
22 PacBio exon 36257319 36257407 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";
22 PacBio exon 36261596 36261722 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";
22 PacBio exon 36265151 36266135 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";
22 PacBio exon 36266840 36267525 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";

PB.10113.15
ATTATACAGACGCATAACTGGAGGTGGGATCCACACAGCTCAGAACAGCTGGATCTTGCT
CAGTCTCTGCCAGGGGAAGATTCCTTGGAGGAGGCCCTGCAGCGACATGGAGGGAGCTGC
TTTGCTGAGAGTCTCTGTCCTCTGCATCTGGATGAGTGCACTTTTCCTTGGTGTGGGAGT
GAGGGCAGAGGAAGCTGGAGCGAGGGTGCAACAAAACGTTCCAAGTGGGACAGATACTGG
AGATCCTCAAAGTAAGCCCCTCGGTGACTGGGCTGCTGGCACCATGGACCCAGAGAGCAG
TATCTTTATTGAGGATGCCATTAAGTATTTCAAGGAAAAAGTGAGCACACAGAATCTGCT
ACTCCTGCTGACTGATAATGAGGCCTGGAACGGATTCGTGGCTGCTGCTGAACTGCCCAG
GAATGAGGCAGATGAGCTCCGTAAAGCTCTGGACAACCTTGCAAGACAAATGATCATGAA
AGACAAAAACTGGCACGATAAAGGCCAGCAGTACAGAAACTGGTTTCTGAAAGAGTTTCC
TCGGTTGAAAAGTGAGCTTGAGGATAACATAAGAAGGCTCCGTGCCCTTGCAGATGGGGT
TCAGAAGGTCCACAAAGGCACCACCATCGCCAGTGTGGTGTCTGGCTCTCTCAGCATTTC
CTCTGGCATCCTGACCCTCGTCGGCATGGGTCTGGCACCCTTCACAGAGGGAGGCAGCCT
TGTACTCTTGGAACCTGGGATGGAGTTGGGAATCACAGCAGCTTTGACCGGGATTACCAG
CAGTACCATAGACTACGGAAAGAAGTGGTGGACACAAGCCCAAGCCCACGACCTGGTCAT
CAAAAGCCTTGACAAATTGAAGGAGGTGAAGGAGTTTTTGGGTGAGAACATATCCAACTT
TCTTTCCTTAGCTGGCAATACTTACCAACTCACACGAGGCATTGGGAAGGACATCCGTGC
CCTCAGACGAGCCAGAGCCAATCTTCAGTCAGTACCGCATGCCTCAGCCTCACGCCCCCG
GGTCACTGAGCCAATCTCAGCTGAAAGCGGTGAACAGGTGGAGAGAGTTAATGAACCCAG
CATCCTGGAAATGAGCAGAGGAGTCAAGCTCACGAATGTGGCCCCTGTAAGCTTCTTTCT
TGTGCTGGATGTAGTCTACCTCGTGTACGAATCAAAGCACTTACATGAGGGGGCAAAGTC
AGAGACAGCTGAGGAGCTGAAGAAGGTGGCTCAGGAGCTGGAGGAGAAGCTAAACATTCT
CAACAATAATTATAAGATTCTGCAGGCGGACCAAGAACTGTGACCACAGGGCAGGGCAGC
CACCAGGAGAGATATGCCTGGCAGGGGCCAGGACAAAATGCAAACTTTTTTTTTTTCTGA
GACAGAGTCTTGCACTGTCGCCAAGTGGAGCTTGCAGTGAGCCGAGATATCGCCACTGCA
CTCCAGCCTGGGTGACAGAGCGAGACTCCATCTCAAAAAAAAAAAAAAAAAGAATATATT
GACGGAAGAATAGAGAGGAGGCTTGAAGGAACCAGCAATGAGAAGGCCAGGAAAAGAAAG
AGCTGAAAATGGAGAAAGCCCAAGAGTTAGAACAGTTGGATACAGGAGAAGAAACAGCGG
CTCCACTACAGACCCAGCCCCAGGTTCAATGTCCTCCGAAGAATGAAGTCTTTCCCTGGT
GATGGTCCCCTGCCCTGTCTTTCCAGCATCCACTCTCCCTTGTCCTCCTGGGGGCATATC
TCAGTCAGGCAGCGGCTTCCTGATGATGGTCGTTGGGGTGGTTGTCATGTGATGGGTCCC
CTCCAGGTTACTAAAGGGTGCATGTCCCCTGCTTGAACACTGAAGGGCAGGTGGTGGGCC
ATGGCCATGGTCCCCAGCTGAGGAGCAGGTGTCCCTGAGAACCCAAACTTCCCAGAGAGT
ATGTGAGAACCAACCAATGAAAACAGTCCCATCGCTCTTACCCGGTAAGTAAACAGTCAG
AAAATTAGCATGAAAGCAGTTTAGCATTGGGAGGAAGCTCAGATCTCTAGAGCTGTCTTG
TCGCCGCCCAGGATTGACCTGTGTGTAAGTCCCAATAAACTCACCTACTCATCAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACCC

allow chaining for fusion transcripts

integrate fusion finding into cupcake
allow chaining of fusion transcripts

counts.txt header length has changed, causing problems with chain_samples.py

chain_samples.py skips a fixed number of header rows in *counts.txt as part of a sanity check. The number of header rows has changed, causing a KeyError:

    ids3 = [r['pbid'] for r in DictReader(f, delimiter='\t')]
KeyError: 'pbid'

Install Issue and Chain Samples Question

Addition of cupcake2.tofu2 dependency to setup.py without associated files.

How am I meant to execute the chain_samples.py script? I suspect it needs to be installed and it doesn't appear in the setup.py. If I call it directly it is having trouble resolving the module:

Traceback (most recent call last):
File "cDNA_Cupcake/cupcake/tofu/counting/chain_samples.py", line 9, in
from cupcake.tofu.counting import combine_abundance_across_samples as sp
ImportError: No module named counting

differences in output for preCluster.py depending on batchsize

I have an ISOSeq dataset which contains 71k sequences.
before running Cogent, I want to run the preCluster step.
However if I play with the --num_seqs_per_batch parameter, I get different outputs. I do [edit]not [/edit] fully understand why this is.
With the batch size at 5000 I get ~7000 clusters, with also around ~7000 orphan reads.
Setting the batch size to 500000, I get ~23k orphans reads and 11k clusters.
I can see from the partitioning csv file that the cluster with the latter setting are smaller compared to the small batch size parameter.
The clusters might get divided more with the run_mash script afterwards, but the orphans are untouched as far as I understand.

My question, does this really matter and if so, how to determine the batchsize in run_preCluster.py?

Invalid cluster_id for abundance script

Hello @Magdoll,
Hope you are well.

I'm currently experiencing some challenges obtaining count information using get_abundance_post_collapse.py script after Isoseq3 due to errors with clusterID:

get_abundance_post_collapse.py ./Collapse_Isoform/Q21.collapsed cluster_report.csv
Traceback (most recent call last):
File "/usr/bin/get_abundance_post_collapse.py", line 5, in
pkg_resources.run_script('cupcake==5.2', 'get_abundance_post_collapse.py')
File "/usr/lib/python2.7/site-packages/pkg_resources.py", line 540, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python2.7/site-packages/pkg_resources.py", line 1455, in run_script
execfile(script_filename, namespace, namespace)
File "/usr/lib64/python2.7/site-packages/cupcake-5.2-py2.7-linux-x86_64.egg/EGG-INFO/scripts/get_abundance_post_collapse.py", line 435, in
get_abundance_post_collapse(args.collapse_prefix, args.cluster_report, args.collapse_prefix)
File "/usr/lib64/python2.7/site-packages/cupcake-5.2-py2.7-linux-x86_64.egg/EGG-INFO/scripts/get_abundance_post_collapse.py", line 422, in get_abundance_post_collapse
output_read_count_IsoSeq_csv(cid_info, cluster_report_csv, output_prefix + '.read_stat.txt')
File "/usr/lib64/python2.7/site-packages/cupcake-5.2-py2.7-linux-x86_64.egg/EGG-INFO/scripts/get_abundance_post_collapse.py", line 148, in output_read_count_IsoSeq_csv
raise Exception, "cluster_id {0} is not a valid cluster ID!".format(cid)
Exception: cluster_id transcript/0 is not a valid cluster ID!

However, the cluster report appears to be correct as outlined in the Isoseq3 github page:
cluster_id,read_id,read_type
transcript/0,m54082_180607_173058/60883723/ccs,FL
transcript/0,m54082_180607_173058/6161326/ccs,FL
transcript/0,m54082_180607_173058/27460031/ccs,FL
transcript/1,m54082_180607_173058/24838457/ccs,FL

I did also try modifiying the collapsed result and cluster_report.csv, however given there were no prefix and cid needs formatting, I wasn't able to proceed further.

As a side note, I also did have errors when generating the cluster report itself with the isoseq3_make_cluster_report.py with polished.bam file:
Traceback (most recent call last):
File "/mnt/data1/software/PacBio/cDNA_Cupcake/post_isoseq_cluster/isoseq3_make_cluster_report.py", line 35, in
make_cluster_report_from_polished_bam(args.polished_bam)
File "/mnt/data1/software/PacBio/cDNA_Cupcake/post_isoseq_cluster/isoseq3_make_cluster_report.py", line 9, in make_cluster_report_from_polished_bam
f = pysam.Samfile(polished_bam)
File "pysam/libcalignmentfile.pyx", line 444, in pysam.libcalignmentfile.AlignmentFile.cinit
File "pysam/libcalignmentfile.pyx", line 664, in pysam.libcalignmentfile.AlignmentFile._open
ValueError: file has no sequences defined (mode='r') - is it SAM/BAM format? Consider opening with check_sq=False
However, this was resolved when I edited the script with "f = pysam.Samfile(polished_bam,check_sq=False)"

wondering whether this has been observed before, or whether this suggests that there are problems with my original polished.bam file

Any help will be greatly appreciated. Thank you so much!

Issue with install of latest commit

I have been trying to install the most recent commit, but have been running into errors. Running setup.py gives me

running install
running bdist_egg
running egg_info
creating cupcake.egg-info
writing cupcake.egg-info/PKG-INFO
writing dependency_links to cupcake.egg-info/dependency_links.txt
writing requirements to cupcake.egg-info/requires.txt
writing top-level names to cupcake.egg-info/top_level.txt
writing manifest file 'cupcake.egg-info/SOURCES.txt'
error: package directory 'cupcake2' does not exist

Tracing this issue shows the addition of requirements for "'cupcake2', 'cupcake2.io', 'cupcake2.ice2'" in the newest commit, but I can't find evidence of this package in the rest of the repository. I'm assuming there are files missing?

Everything installs fine prior to your commit a few days back.

Wrong gene classification

Hi:
I find a problem about wrong gene classification.

In the picture,PB.583 have two part,PB.583.1 PB.583.2 PB583.3 PB.583.4 and PB.583.5 PB.583.6 PB.583.7 PB.583.8. It is obviously two genes.
I think the reason that wrong gene classification is the PB.584 link the PB.583 two part.
There are many such situations here.like this:

run fusion_finder.py on collapsed fastq and sam found no fusion event

Hi @Magdoll ,

I run fusion_finder.py on both collapsed data and uncollapsed data:

with uncollapsed data, the script found 25 fusion events
with collapsed data, the script found no fusion event

I cannot figure out why, in my understanding, the results should be the same for collapsed or uncollapsed data as input.

Thanks a lot.

--max_fuzzy_junction=0

Hi,

is it okay to run collapse_isoforms_by_sam.py with --max_fuzzy_junction set to 0 (zero)?

Thanks
A

filter_by_count failed

Data was generated by isoseq3 in SMRTLINK 6.0. header is like:
@multiple_samples_HQ_transcript/0

filter by count.py generated error:

Traceback (most recent call last):
File "/usr/local/bin/filter_by_count.py", line 4, in
import('pkg_resources').run_script('cupcake==6.1', 'filter_by_count.py')
File "/usr/lib/python2.7/dist-packages/pkg_resources/init.py", line 719, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python2.7/dist-packages/pkg_resources/init.py", line 1504, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python2.7/dist-packages/cupcake-6.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/filter_by_count.py", line 106, in
filter_by_count(args.input_prefix, output_prefix, args.min_count, args.dun_use_group_count)
File "/usr/local/lib/python2.7/dist-packages/cupcake-6.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/filter_by_count.py", line 40, in filter_by_count
fl_count, p_count = tmp.split('p')
ValueError: need more than 1 value to unpack

filter_away_subset.py error

For S1 sample the filter_away_subset.py finished without a problem but for the second sample (S2) I get this error:

Traceback (most recent call last):
File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/bin/filter_away_subset.py", line 4, in
import('pkg_resources').run_script('cupcake==3.1', 'filter_away_subset.py')
File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/init.py", line 744, in run_script
File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/init.py", line 1499, in run_script
File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/lib/python2.7/site-packages/cupcake-3.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/filter_away_subset.py", line 150, in
main()
File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/lib/python2.7/site-packages/cupcake-3.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/filter_away_subset.py", line 141, in main
r = d[k]
KeyError: 'PB.1.1'

this is the command I used for running the script (in anaCogent environment):

filter_away_subset.py all_sizes.quivered_hq_MPE_S2.fastq.collapsed

the input and generated output files are available here:
https://www.dropbox.com/sh/z4wps7y6gvyg4w2/AACDiAk4X-SyyoeWYwTqL0i2a?dl=0

abundance metric

Hi @Magdoll ,

The abundance file gives 6 measurements, which one do you recommend to use as gene expression metric in upstream analysis? Or should I do other nomarlization like TPM, RPKM or FPKM?

# count_fl: Number of associated FL reads
# count_nfl: Number of associated FL + unique nFL reads
# count_nfl_amb: Number of associated FL + unique nFL + weighted ambiguous nFL reads
# norm_fl: count_fl / total number of FL reads
# norm_nfl: count_nfl / total number of FL + unique nFL reads
# norm_nfl_amb: count_nfl_amb / total number of all reads

Thanks a lot.

[feature] add per-exon identity filter to collapse_isoforms_by_sam.py

Currently, collapse_isoforms_by_sam.py accepts a -i identity threshold, but that is the average identity over the whole alignment.

Importantly, I want to add a per-exon identity threshold to filter out cases where subregions of low quality / err causes bad alignments. These should be thrown out.

chr1    hg38    exon    247416157   247416193   100 +   .   ID=c14289.mrna1.exon1;Name=c14289;Parent=c14289.mrna1;Target=c14289 1 37 +
chr1    hg38    exon    247418053   247419077   99  +   .   ID=c14289.mrna1.exon2;Name=c14289;Parent=c14289.mrna1;Target=c14289 38 1062 +
chr1    hg38    exon    247423230   247423349   100 +   .   ID=c14289.mrna1.exon3;Name=c14289;Parent=c14289.mrna1;Target=c14289 1063 1182 +
chr1    hg38    exon    247423847   247425599   99  +   .   ID=c14289.mrna1.exon4;Name=c14289;Parent=c14289.mrna1;Target=c14289 1183 2935 +
chr1    hg38    exon    247434103   247434273   100 +   .   ID=c14289.mrna1.exon5;Name=c14289;Parent=c14289.mrna1;Target=c14289 2936 3106 +
chr1    hg38    exon    247442199   247442204   100 +   .   ID=c14289.mrna1.exon6;Name=c14289;Parent=c14289.mrna1;Target=c14289 3108 3113 +
chr1    hg38    exon    247442230   247442237   100 +   .   ID=c14289.mrna1.exon7;Name=c14289;Parent=c14289.mrna1;Target=c14289 3134 3141 +
chr1    hg38    exon    247444016   247444092   74  +   .   ID=c14289.mrna1.exon8;Name=c14289;Parent=c14289.mrna1;Target=c14289 3150 3210 +
chr1    hg38    exon    247444749   247444821   87  +   .   ID=c14289.mrna1.exon9;Name=c14289;Parent=c14289.mrna1;Target=c14289 3211 3275 +
chr1    hg38    exon    247448405   247448815   100 +   .   ID=c14289.mrna1.exon10;Name=c14289;Parent=c14289.mrna1;Target=c14289 3276 3686 +

As an example above, most exons have high identity 99-100% but two exons have very low identity. This is a case where collapse script should discard this alignment.

--Liz

ONT reads

Hi Elizabeth
I have Nanopore Direct RNA-seq, and mapped them to reference using minimap2. I wonder if fusion_finder.py can handle these mapped ONT reads to find gene fusion.
Thanks
Edison

scrub_sample_GFF_junctions errors while parsing bed from summarize step

Running

scrub_sample_GFF_junctions.py --scrubbed_junction_file "${summary_prefix}.junction.bed" "${scrub_config}" "${summary_prefix}.junction_detail.txt" "${scrub_prefix}"

Results in the following error
Traceback (most recent call last):
File "/usr/local/bin/scrub_sample_GFF_junctions.py", line 5, in
pkg_resources.run_script('cupcake==6.1', 'scrub_sample_GFF_junctions.py')
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 528, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 1394, in run_script
execfile(script_filename, namespace, namespace)
File "/usr/local/lib/python2.7/dist-packages/cupcake-6.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/scrub_sample_GFF_junctions.py", line 279, in
tree = read_scrubbed_junction_to_tree(output_filename)
File "/usr/local/lib/python2.7/dist-packages/cupcake-6.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/scrub_sample_GFF_junctions.py", line 112, in read_scrubbed_junction_to_tree
chrom, left, right, strand = line.strip().split('\t')
ValueError: need more than 1 value to unpack

Indeed, the BED file produced by the summary step has 6 columns. Here are the first few lines

track name=junctions description="summary" useScore=1
chr1 1013575 1013984 summary 3 +
chr1 1803416 1803876 summary 1 +
chr1 2391957 2395784 summary 3 +

Chaining script improvements: 5 merge, output fasta/fastq

Improvements to chain_samples.py

Top priority:

allow 5' merge, that is, if one isoform in sample A is a 5' truncted form of another in sample B, allow them to be chained as equivalent and select the longer form. Prerequisite: both sample A and B must have already gone through collapse using 5merge option, otherwise results can be incorrect.
output fasta/fastq files along with the current output of chained GFF, ids txt, and count.

Medium priority:

allow iterative chaining via pickled object

Running subsampling to draw rarefaction curves with only FL reads？

@Magdoll ，
If I have only FL reads, can I run subsampling to draw rarefaction curves after collapsing read without Iso-Seq cluster? I want to use tofu_wrap to cluster fl reads but I don't have other files that need to be required. Is that OK?

file format wrong

File "/cupcake/tofu/compare_junctions.py", line 21, in compare_junctions
strand = r1.strand
AttributeError: GMAPSAMRecord instance has no attribute 'strand'

installing error for Cupcake ToFU

@Magdoll Hi, when I tried to install cupcake tofu
many warnings generated in python setup.py buildstep.

More warnings saying "unused function" generated in python setup.py install step.
And end up like this:

I am using Mac OS X, a new Anaconda environment with python 2.7. created like this:
conda create -n cupcake Anaconda python=2
conda install biopython source activate cupcake

Please help...

gff header present, causing problems with chain_samples.py

During the sanity check, chain_samples.py uses GFF.collapseGFFReader to process the gff. This function fails if there is a header.

  File "/home/wrowell/anaconda2/lib/python2.7/site-packages/cupcake-3.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/chain_samples.py", line 19, in sample_sanity_check
    ids2 = [r.seqid for r in GFF.collapseGFFReader(gff_filename)]
  File "/home/wrowell/anaconda2/lib/python2.7/site-packages/cupcake-3.1-py2.7-linux-x86_64.egg/cupcake/io/GFF.py", line 386, in next
    return self.read()
  File "/home/wrowell/anaconda2/lib/python2.7/site-packages/cupcake-3.1-py2.7-linux-x86_64.egg/cupcake/io/GFF.py", line 543, in read
    assert raw[2] == 'transcript'
IndexError: list index out of range

New PB gff files have the following header:

##gff-version 3

feature request: integrating short read evidence with Iso-Seq for isoform/fusion detection

artifacts are present in Iso-Seq data often due to library preparation (template switching, PCR chimeras) that are difficult to remove because they come from the cDNA level (and not at the sequencing or bioinformatics level) and introduce false positives for both discovery of novel isoforms and fusion genes.

short reads and other seq platforms can serve as an independent evidence to reduce these artifacts. a simple check is to align short read data (using --all options to allow multi-mapping) to collapsed Iso-Seq data and have a script that removes / filters all Iso-Seq data that is not supported continuously at each base with a minimum T coverage by short reads (note: short reads often have issues with covering the ends, so the focus here is in the middle section).

this process should consist of several steps:

align using fast short read aligners (bowtie2, bwa) of short reads to collapsed, high-quality, Iso-Seq data. care needs to be taken to allow for more errors and multi-mapping since Iso-Seq data will contain many isoforms that may share the same exons and we want them to all map.
create a coverage map per Iso-Seq transcript (best format to show this is?)
user-defined filtering criteria (how much to ignore at the ends, minimum base coverage at what interval, etc)
output filtered/reduced set of Iso-Seq data and a summary report

summarize_sample_GFF_junctions does not work without genome filename

Running summarize_sample_GFF_junctions without GENOME_FILENAME results in the following error:

Traceback (most recent call last):
File "/usr/local/bin/summarize_sample_GFF_junctions.py", line 5, in
pkg_resources.run_script('cupcake==6.1', 'summarize_sample_GFF_junctions.py')
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 528, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 1394, in run_script
execfile(script_filename, namespace, namespace)
File "/usr/local/lib/python2.7/dist-packages/cupcake-6.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/summarize_sample_GFF_junctions.py", line 179, in
print >> sys.stder, "No genome file given. Ignore."
AttributeError: 'module' object has no attribute 'stder'

Looks like a basic typo.

Classification of adjacent genes in In the same direction by collapse_isoforms_by_sam.py

Hi:
I have a problem with the classification of genes.

The PB.6693 is classified as the same gene without problems,because thera have PB.6693.3 long isoform.

But like this, it is wrong to classify the same gene because the borders are partially overlapped.
Is there a way to solve it?

fusion_finder.py assertion

Hi,

First, I used minimap2 to get the hq_isoforms.sam, then run fusion_finder.py with:

fusion_finder.py \                                                                                
    --input $fastq --fq \
    -s $sam \
    --dun-merge-5-shorter \
    -o $out_fn \
    --cluster_report_csv $report

while I got the error:

combo seen: (58, 59)
combo seen: (9, 10)
combo seen: (29, 30)
combo seen: (9, 10)
combo seen: (4, 8)
combo seen: (18, 19)
Traceback (most recent call last):
  File "/home/chenlingxi/mnt/chenlingxi/lib/tools/anaconda2/bin/fusion_finder.py", line 395, in <module>
    min_dist_between_loci=args.min_dist_between_loci)
  File "/home/chenlingxi/mnt/chenlingxi/lib/tools/anaconda2/bin/fusion_finder.py", line 347, in fusion_main
    assert pbid1.split('.')[1] == pbid2.split('.')[1]
AssertionError

How can I fix this? Thanks.

error in chaining fusion transcripts

Hi, @Magdoll

I ran into an error while chaining fusion transcripts:
"""
ERROR! more than one possible candidate in match_fusion_record! DEBUG.
MATCHED: ['PBfusion.114', 'PBfusion.210']
"""

Information of two involved transcripts are given below.
"""
1 PacBio transcript 161285514 161288653 . - . gene_id "PBfusion.114"; transcript_id "PBfusion.114.1";
1 PacBio exon 161285514 161288653 . - . gene_id "PBfusion.114"; transcript_id "PBfusion.114.1";
2 PacBio transcript 7852284 7853099 . - . gene_id "PBfusion.114"; transcript_id "PBfusion.114.2";
2 PacBio exon 7852284 7853099 . - . gene_id "PBfusion.114"; transcript_id "PBfusion.114.2";

1 PacBio transcript 161285514 161288251 . - . gene_id "PBfusion.210"; transcript_id "PBfusion.210.1";
1 PacBio exon 161285514 161288251 . - . gene_id "PBfusion.210"; transcript_id "PBfusion.210.1";
2 PacBio transcript 7852289 7853099 . - . gene_id "PBfusion.210"; transcript_id "PBfusion.210.2";
2 PacBio exon 7852289 7853099 . - . gene_id "PBfusion.210"; transcript_id "PBfusion.210.2";
"""
PBfusion.114 and PBfusion.210 are two different fusion transcripts while PBfusion.114.2 and PBfusion.210.2 are identical. Is this the cause of this issue? And how can we solve this? Thanks.

Error when running get_abundance_post_collapse.py

I am in the process of running through the cupcake/tofu pipeline (following cogent) to remove redundant transcripts (in the absence of genomes), and I've gotten hung up at the get_abundance_post_collapse.py step. The script I used is as follows:

get_abundance_post_collapse.py hq_isoforms.fasta.no5merge.collapsed cluster_report.csv
My job fails with the following output:

Traceback (most recent call last):
File "/scg/apps/software/anaconda/3_5.0.1_20180125/envs/cdna_cupcake_20180921/bin/get_abundance_post_collapse.py", line 4, in
import('pkg_resources').run_script('cupcake==5.9', 'get_abundance_post_collapse.py')
File "/scg/apps/software/anaconda/3_5.0.1_20180125/envs/cdna_cupcake_20180921/lib/python2.7/site-packages/pkg_resources/init.py", line 661, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/scg/apps/software/anaconda/3_5.0.1_20180125/envs/cdna_cupcake_20180921/lib/python2.7/site-packages/pkg_resources/init.py", line 1441, in run_script
exec(code, namespace, namespace)
File "/scg/apps/software/anaconda/3_5.0.1_20180125/envs/cdna_cupcake_20180921/lib/python2.7/site-packages/cupcake-5.9-py2.7-linux-x86_64.egg/EGG-INFO/scripts/get_abundance_post_collapse.py", line 319, in
get_abundance_post_collapse(args.collapse_prefix, args.cluster_report, args.collapse_prefix)
File "/scg/apps/software/anaconda/3_5.0.1_20180125/envs/cdna_cupcake_20180921/lib/python2.7/site-packages/cupcake-5.9-py2.7-linux-x86_64.egg/EGG-INFO/scripts/get_abundance_post_collapse.py", line 306, in get_abundance_post_collapse
output_read_count_IsoSeq_csv(cid_info, cluster_report_csv, output_prefix + '.read_stat.txt')
File "/scg/apps/software/anaconda/3_5.0.1_20180125/envs/cdna_cupcake_20180921/lib/python2.7/site-packages/cupcake-5.9-py2.7-linux-x86_64.egg/EGG-INFO/scripts/get_abundance_post_collapse.py", line 177, in output_read_count_IsoSeq_csv
id=x, len=get_roi_len(x), is_fl='Y', stat=stat, pbid=pbid))
File "/scg/apps/software/anaconda/3_5.0.1_20180125/envs/cdna_cupcake_20180921/lib/python2.7/site-packages/cupcake-5.9-py2.7-linux-x86_64.egg/EGG-INFO/scripts/get_abundance_post_collapse.py", line 50, in get_roi_len
s, e, junk = seqid.split('/')[2].split('_')
ValueError: need more than 1 value to unpack

I'm guessing that this is likely something very simple that I've done incorrectly, but I seem to have exhausted the extent of my troubleshooting potential for this (I think). Please let me know if you've run into this before!

Thanks,
Ryan

installing error for Cupcake ToFU

@Magdoll
I tried to install latest cupcake tofu using conda, but it produced a error（shown on screenshot below）.

Unexpected Chain Samples Results

Running:
chain_samples.py CONFIG_FILE count_nfl
I am getting results that don't seem to be in line with the IGV image of the same regions. In particular, I'm seeing ~1000 to ~10 count differentials between samples for some some regions via chain samples, when the IGV image shows about 160 reads in both samples (almost all full length). Furthermore while I would assume that the count_fl results would be based on the number of input reads, I only have about 14,000 input reads, but the chain_samples.py columns sum to around 140,000.

These results seem unexpected to me, is this a bug or am I misunderstanding something?

An error produced in scrub_sample_GFF_junctions.py

Hi @Magdoll ,
I used the script scrub_sample_GFF_junctions.py of the latest version of Cupcake, but an error happened. Could you help me to solve it? I'd appreciate that. I attached the screenshot of error at below.

collapse_isoforms_by_sam.py generate long transcripts

Hi:
I use consensus.fasta generate gff files.This is the command line.

gmap -D /disk/luping/bam/iso-bam/all-cluster/evaluate_tofu2 -d FG -f samse --no-chimeras -n 1 -t 13 -z sense_force consensus.fasta >clearn-consensus.fasta.sam

sort -k 3,3 -k 4,4n clearn-consensus.fasta.sam >clearn-consensus.fasta.sort.sam

collapse_isoforms_by_sam.py --input consensus.fasta -s clearn-consensus.fasta.sort.sam -o no-fusion-check2-all

Then I get some very long transcript like this: PB.12987.1

4	PacBio	transcript	7794173	8025705	.	-	.	gene_id "PB.12987"; transcript_id "PB.12987.1";
4	PacBio	exon	7794173	7794629	.	-	.	gene_id "PB.12987"; transcript_id "PB.12987.1";
4	PacBio	exon	8025556	8025705	.	-	.	gene_id "PB.12987"; transcript_id "PB.12987.1";
4	PacBio	transcript	7794174	7794835	.	-	.	gene_id "PB.12987"; transcript_id "PB.12987.2";
4	PacBio	exon	7794174	7794835	.	-	.	gene_id "PB.12987"; transcript_id "PB.12987.2";
4	PacBio	transcript	7794986	7800216	.	-	.	gene_id "PB.12987"; transcript_id "PB.12987.3";
4	PacBio	exon	7794986	7795627	.	-	.	gene_id "PB.12987"; transcript_id "PB.12987.3";
4	PacBio	exon	7795677	7800216	.	-	.	gene_id "PB.12987"; transcript_id "PB.12987.3";
4	PacBio	transcript	7794988	7800004	.	-	.	gene_id "PB.12987"; transcript_id "PB.12987.4";
4	PacBio	exon	7794988	7800004	.	-	.	gene_id "PB.12987"; transcript_id "PB.12987.4";

The long transcript covering many genes like this:

shangchan.zip

How should I do to remove this transcript?

IndexError: list index out of range

License info

Hi,

I want to create a bioconda package for cDNA_Cupcake for easier distribution between researchers. However, I can not do that since the package needs to have a license that allows redistribution.
Do you think it is possible for you to add a license for cDNA_Cupcake or ToFu2?
I have the same request for "Cogent" as well.

Thank you,
Natasha

A small question about collapse isoforms script

Hi,

Firstly, thanks for designing this great software. I'm now working on one project in which the script 'collapse_isoforms_by_sam.py' is used for reducing redundancy of the GMAP result.

During the first time of running the script, I used the default parameters of minimal coverage and minimal identity. The final result shows that I got 288,773 collapse isoforms and 13,354 genes.

I personally thought that the number of collapse isoforms is small due to the high filter rate (42.75%). Therefore I changed the coverage and identity value to 0.85 and 0.9 respectively. The final result shows 476,088 collapse isoforms with the filter rate of 16.86%. However, I found that this time I got only 1,573 genes based on the file collapsed.group.txt.

In other words, I got more isoforms but fewer genes after changing the two parameters.

I'm still a beginner at bioinformatics and it's a little bit hard for me to find the reason from the script. Therefore, I'm writing this issue to figure out the reason. My personal guess is that after changing the parameters, I got more isoforms and on the other hand, the way of dividing gene groups changed as well.

I'm looking forward to waiting for answers and thank you very much.

collapse_isoforms_by_sam ValueError: Unknown format code 'f' for object of type 'str'

I've mapped the hq pacbio sequences to genome using gmap (latest version), sorted the sam file using samtools and then run the collapse_isoforms_by_sam.py script in the anaconda environment (operating system Ubuntu 14.04-biolinux). The script works for few seconds and then crashes with the following error message:

...
merged 1 down to 1 transcripts
alt. junction found at 15
alt. junction found at 285
alt. junction found at 577
alt. junction found at 926
alt. junction found at 1333
alt. junction found at 2473
alt. junction found at 3396
alt. junction found at 3611
alt. junction found at 4065
alt. junction found at 4244
alt. junction found at 4988
alt. junction found at 5216
alt. junction found at 5720
alt. junction found at 5970
alt. junction found at 6806
alt. junction found at 7161
merged 2 down to 1 transcripts
Traceback (most recent call last):
  File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/bin/collapse_isoforms_by_sam.py", line 4, in <module>
    __import__('pkg_resources').run_script('cupcake==3.1', 'collapse_isoforms_by_sam.py')
  File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/__init__.py", line 744, in run_script
  File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/__init__.py", line 1499, in run_script
  File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/lib/python2.7/site-packages/cupcake-3.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 244, in <module>
    main(args)
  File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/lib/python2.7/site-packages/cupcake-3.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 190, in main
    for recs in iter:
  File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/lib/python2.7/site-packages/cupcake-3.1-py2.7-linux-x86_64.egg/cupcake/tofu/branch/branch_simple2.py", line 81, in iter_gmap_sam
    ignored_fout.write("{0}\tCoverage {1:.3f} too low.\n".format(r.qID, r.qCoverage))
ValueError: Unknown format code 'f' for object of type 'str'
(anaCogent) adminuser@biolx[S1_allsize]

I've tried to sort the sam file with picard-tools SamSort but get the same error.

I've uploaded the files to dropbox so you can try to reproduce the error (see console_commands.txt for commands used):
https://www.dropbox.com/sh/6n0q61rq62324hi/AAAtdGIH6O3Wyt3FlzYETUVSa?dl=0

CupCake error duplicate ID

I had combined two isoseq cells and got an error for duplicate ids but I didn't find any duplicates in my polished.hq.fasta file. Where exactly is it finding the duplicate id names.

thanks.

Traceback (most recent call last):
  File "/scinet01/gov/usda/ars/scinet/project/tgbs_analysis/anaconda3/envs/TofCogent/bin/collapse_isoforms_by_sam.py", line 4, in <module>
    __import__('pkg_resources').run_script('cupcake==6.1', 'collapse_isoforms_by_sam.py')
  File "/scinet01/gov/usda/ars/scinet/project/tgbs_analysis/anaconda3/envs/TofCogent/lib/python2.7/site-packages/pkg_resources/__init__.py", line 664, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/scinet01/gov/usda/ars/scinet/project/tgbs_analysis/anaconda3/envs/TofCogent/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1444, in run_script
    exec(code, namespace, namespace)
  File "/scinet01/gov/usda/ars/scinet/project/tgbs_analysis/anaconda3/envs/TofCogent/lib/python2.7/site-packages/cupcake-6.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 248, in <module>
    main(args)
  File "/scinet01/gov/usda/ars/scinet/project/tgbs_analysis/anaconda3/envs/TofCogent/lib/python2.7/site-packages/cupcake-6.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 177, in main
    check_ids_unique(args.input, is_fq=args.fq)
  File "/scinet01/gov/usda/ars/scinet/project/tgbs_analysis/anaconda3/envs/TofCogent/lib/python2.7/site-packages/cupcake-6.1-py2.7-linux-x86_64.egg/cupcake/tofu/utils.py", line 13, in check_ids_unique
    raise Exception, "Duplicate id {0} detected. Abort!".format(r.id)
Exception: Duplicate id  detected. Abort

Chain samples together ('chain_samples.py')

Hi,

I am trying to use the 'chain_samples.py'. But, I couldn't find the count file (.count.txt) for the COUNT_FILENAME in the config file. The '.count.txt' is not generated by running the 'collapse_isoforms_by_sam.py'. I did try to use the '.collapsed.abundance.txt' instead of the '.count.txt' after running the 'get_abundance_post_collapse.py', but I got an error such as "NameError: global name 'count_header' is not defined". I will appreciate it if you let me know how to treat the COUNT_FILENAME in the config file to run the 'chain_samples.py'. Thank you!

Taehee

Error about installing Cupcake (running 'python setup.py install')

Hi, @Magdoll ,
I have tried to install Cupcake. But when I run python setup.py install, it reported such errors:

So what can I do now to deal with this error? Thanks for any suggestions!

Value_error

Hi,

I got this error message after running "collapse_isoforms_by_sam.py".
Could you please let me know what needs to be done here?

Thanks.

Traceback (most recent call last):
File "/home/joonlee3/bin/anaconda2/envs/anaCogent5.2/bin/collapse_isoforms_by_sam.py", line 4, in
import('pkg_resources').run_script('cupcake==5.8', 'collapse_isoforms_by_sam.py')
File "/home/joonlee3/bin/anaconda2/envs/anaCogent5.2/lib/python2.7/site-packages/pkg_resources/init.py", line 658, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/home/joonlee3/bin/anaconda2/envs/anaCogent5.2/lib/python2.7/site-packages/pkg_resources/init.py", line 1438, in run_script
exec(code, namespace, namespace)
File "/home/joonlee3/bin/anaconda2/envs/anaCogent5.2/lib/python2.7/site-packages/cupcake-5.8-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 245, in
main(args)
File "/home/joonlee3/bin/anaconda2/envs/anaCogent5.2/lib/python2.7/site-packages/cupcake-5.8-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 204, in main
collapse_fuzzy_junctions(f_good.name, f_txt.name, args.allow_extra_5exon, internal_fuzzy_max_dist=args.max_fuzzy_junction)
File "/home/joonlee3/bin/anaconda2/envs/anaCogent5.2/lib/python2.7/site-packages/cupcake-5.8-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 149, in collapse_fuzzy_junctions
_size = get_fl_from_id(group_info[pbid])
File "/home/joonlee3/bin/anaconda2/envs/anaCogent5.2/lib/python2.7/site-packages/cupcake-5.8-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 92, in get_fl_from_id
return sum(int(_id.split('/')[1].split('p')[0][1:]) for _id in members)
File "/home/joonlee3/bin/anaconda2/envs/anaCogent5.2/lib/python2.7/site-packages/cupcake-5.8-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 92, in
return sum(int(_id.split('/')[1].split('p')[0][1:]) for _id in members)
ValueError: invalid literal for int() with base 10: '714.1'

A problem about using Cogent to collapse redundant transcripts in absence of genome

Hi!
Before I want to use Cogent and Cupcake to collapse redundant transcripts in absence of genome, what kinds of data I should use?
After SMRTLink, I have had the polished consensus fasta, polished hq fasta, polished lq fasta, and which fasta file should be used to do this collapsing jobs?
Thanks for any suggestion!

Error in iter_gmap_sam

The program ran through a few thousand transcripts before getting to this:

merged 1 down to 1 transcripts
merged 1 down to 1 transcripts
Traceback (most recent call last):
  File "/cluster/home/ifiddes/anaconda2/bin/collapse_isoforms_by_sam.py", line 4, in <module>
    __import__('pkg_resources').run_script('cupcake==2.1', 'collapse_isoforms_by_sam.py')
  File "/cluster/home/ifiddes/anaconda2/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/__init__.py", line 744, in run_script
  File "/cluster/home/ifiddes/anaconda2/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/__init__.py", line 1499, in run_script
  File "/cluster/home/ifiddes/anaconda2/lib/python2.7/site-packages/cupcake-2.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 242, in <module>
    main(args)
  File "/cluster/home/ifiddes/anaconda2/lib/python2.7/site-packages/cupcake-2.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 190, in main
    for recs in iter:
  File "/cluster/home/ifiddes/anaconda2/lib/python2.7/site-packages/cupcake-2.1-py2.7-linux-x86_64.egg/cupcake/tofu/branch/branch_simple2.py", line 77, in iter_gmap_sam
    ignored_fout.write("{0}\tCoverage {1:.3f} too low.\n".format(r.qID, r.qCoverage))
ValueError: Unknown format code 'f' for object of type 'str'

I saw this much earlier when I tried to use a STAR aligned SAM file, but then I looked through the code a bit and it seemed that you rely on identity and coverage metrics provided by gmap, so I used a gmap SAM. I was able to get through one dataset entirely, but then the next had this problem.

Saturation analysis from isoform level

Dear,

For most study, people have conducted saturation analysis from aspect of gene.
However, I think the saturation in isoform level should be paid more attention to, isn't it?

In my data set, I've got 100,000 ROI, 50,000 FLNC, and 18,000 HQ-polished isoforms.
I found most of HQ-polished isoforms is constructed by 1 FLNC reads, and if I draw a curve to show the relationship of FLNC reads and HQ-polished isoforms, I could only get a completely unsaturated result.

Does it means that my data is completely unsaturated?
Or the way I used the data is in wrong way?

Thanks a lot in advance!

How To Plot Rarefaction/Saturation Curve to assess the sequencing depth?

@Magdoll
I've got rarefaction file by running 'subsample.py' in the light of what you said, but I don't know what to do next for plotting rarefaction curve. I'm trying to use iNEXT, but the input files seems to be different, Could you give me some hints? Thank you so much.

Annotating unique isoforms

Hi @Magdoll,

First time user of Cupcake here.

I have two separate questions -

I have Isoseq CCS runs for two separate smrt cells run for a single tissue. Do I combine the HQ_isoforms.fasta of both runs and then run the collapse_isoforms scripts and the remaining downstream steps or should I collapse isoforms individually, get count information and then do the chaining?
I have a reference genome and I have mapped the HQ_isoforms to this but I would like to use the annotation file (GFF) as well to know the gene names. Is there a way to do this or should I just use bedtools to intersect the gff file after collapsing isoforms?

Many thanks!

fusion_finder.py error

what's wrong with this?

gmap -D /disk/luping/bam/iso-bam/all-cluster/evaluate_tofu2 -d FG -f samse -n 0 -t 13 -z sense_force consensus.fasta >consensus.fasta.sam
sort -k 3,3 -k 4,4n consensus.fasta.sam >consensus.fasta.sort.sam

[luping@centos pb-con-check]$ fusion_finder.py --input consensus.fasta -s consensus.fasta.sort.sam -o ./fusion
Traceback (most recent call last):
  File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/bin/fusion_finder.py", line 4, in <module>
    __import__('pkg_resources').run_script('cupcake==3.0', 'fusion_finder.py')
  File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/lib/python2.7/site-packages/pkg_resources/__init__.py", line 739, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1494, in run_script
    exec(code, namespace, namespace)
  File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/lib/python2.7/site-packages/cupcake-3.0-py2.7-linux-x86_64.egg/EGG-INFO/scripts/fusion_finder.py", line 393, in <module>
    min_dist_between_loci=args.min_dist_between_loci)
  File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/lib/python2.7/site-packages/cupcake-3.0-py2.7-linux-x86_64.egg/EGG-INFO/scripts/fusion_finder.py", line 301, in fusion_main
    o = merge_fusion_exons(v, max_fusion_point_dist=100, max_exon_end_dist=0, allow_extra_5_exons=allow_extra_5_exons)
  File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/lib/python2.7/site-packages/cupcake-3.0-py2.7-linux-x86_64.egg/EGG-INFO/scripts/fusion_finder.py", line 207, in merge_fusion_exons
    if all(is_fusion_compatible(r1, r2, max_fusion_point_dist, max_exon_end_dist, allow_extra_5_exons) for r2 in r2s):
  File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/lib/python2.7/site-packages/cupcake-3.0-py2.7-linux-x86_64.egg/EGG-INFO/scripts/fusion_finder.py", line 207, in <genexpr>
    if all(is_fusion_compatible(r1, r2, max_fusion_point_dist, max_exon_end_dist, allow_extra_5_exons) for r2 in r2s):
  File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/lib/python2.7/site-packages/cupcake-3.0-py2.7-linux-x86_64.egg/EGG-INFO/scripts/fusion_finder.py", line 153, in is_fusion_compatible
    type = compare_junctions(r1, r2)
  File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/lib/python2.7/site-packages/cupcake-3.0-py2.7-linux-x86_64.egg/cupcake/tofu/compare_junctions.py", line 21, in compare_junctions
    strand = r1.strand
AttributeError: GMAPSAMRecord instance has no attribute 'strand'

collapse_isoforms_by_sam.py runtime error

Hi,

I have already updated to the v5.8.

When I run collapse_isoforms_by_sam.py, error occured:

$tail runtime.log
alt. junction found at 514
alt. junction found at 804
Traceback (most recent call last):
  File "/home/chenlingxi/mnt/chenlingxi/lib/tools/anaconda2/bin/collapse_isoforms_by_sam.py", line 245, in <module>
    main(args)
  File "/home/chenlingxi/mnt/chenlingxi/lib/tools/anaconda2/bin/collapse_isoforms_by_sam.py", line 190, in main
    for recs in iter: # recs is {'+': list of list of records, '-': list of list of records}
  File "/home/chenlingxi/mnt/chenlingxi/lib/tools/anaconda2/lib/python2.7/site-packages/cupcake/tofu/branch/branch_simple2.py", line 97, in iter_gmap_sam
    ignored_fout.write("{0}\tCoverage {1:.3f} too low.\n".format(r.qID, r.qCoverage))
ValueError: Unknown format code 'f' for object of type 'str'

There are already some outputs:

$ls
OC001T.hq_isoforms.collapsed.gff  OC001T.hq_isoforms.collapsed.group.txt  OC001T.hq_isoforms.ignored_ids.txt  runtime.log

How to fix it? Thanks.