magdoll / cdna_cupcake Goto Github PK
View Code? Open in Web Editor NEWMiscellaneous collection of Python and R scripts for processing Iso-Seq data
License: BSD 3-Clause Clear License
Miscellaneous collection of Python and R scripts for processing Iso-Seq data
License: BSD 3-Clause Clear License
Running the isoseq3 suite provide me "cluster_report.csv file" on this format :
cluster_id,read_id,read_type
transcript/11530,m54064_180112_100315/68092545/ccs,FL
transcript/11530,m54064_180112_100315/35979778/ccs,FL
transcript/11530,m54064_180112_100315/20448150/ccs,FL
and « collapsed.group.txt" on this format :
PB.1.1 transcript/1968,transcript/2407,transcript/3021,transcript/3170,transcript/5415
PB.1.2 transcript/22838
PB.2.1 transcript/3489,transcript/3803
Now running the get_abundance_post_collapse.py give meh this error message
$ srun get_abundance_post_collapse.py hq_isoforms.fasta.no5merge.collapsed /home/jleple/work/Isoseq3_CO1_ccs/POLISH9999/cluster_report.csv
Traceback (most recent call last):
File "/usr/local/bioinfo/src/Miniconda/Miniconda3-4.4.10/envs/cogent-v3.6_env/bin/get_abundance_post_collapse.py", line 4, in
import('pkg_resources').run_script('cupcake==5.3', 'get_abundance_post_collapse.py')
File "/usr/local/bioinfo/src/Miniconda/Miniconda3-4.4.10/envs/cogent-v3.6_env/lib/python2.7/site-packages/pkg_resources/init.py", line 661, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/local/bioinfo/src/Miniconda/Miniconda3-4.4.10/envs/cogent-v3.6_env/lib/python2.7/site-packages/pkg_resources/init.py", line 1441, in run_script
exec(code, namespace, namespace)
File "/usr/local/bioinfo/src/Miniconda/Miniconda3-4.4.10/envs/cogent-v3.6_env/lib/python2.7/site-packages/cupcake-5.3-py2.7-linux-x86_64.egg/EGG-INFO/scripts/get_abundance_post_collapse.py", line 435, in
get_abundance_post_collapse(args.collapse_prefix, args.cluster_report, args.collapse_prefix)
File "/usr/local/bioinfo/src/Miniconda/Miniconda3-4.4.10/envs/cogent-v3.6_env/lib/python2.7/site-packages/cupcake-5.3-py2.7-linux-x86_64.egg/EGG-INFO/scripts/get_abundance_post_collapse.py", line 422, in get_abundance_post_collapse
output_read_count_IsoSeq_csv(cid_info, cluster_report_csv, output_prefix + '.read_stat.txt')
File "/usr/local/bioinfo/src/Miniconda/Miniconda3-4.4.10/envs/cogent-v3.6_env/lib/python2.7/site-packages/cupcake-5.3-py2.7-linux-x86_64.egg/EGG-INFO/scripts/get_abundance_post_collapse.py", line 148, in output_read_count_IsoSeq_csv
raise Exception, "cluster_id {0} is not a valid cluster ID!".format(cid)
Exception: cluster_id transcript/0 is not a valid cluster ID!
srun: error: node130: task 0: Exited with exit code 1
Have you a solution for me?
Thank you very much in advance
Jean-Charles
Hi,
I have used command 'chain_samples.py chain.cfg norm_nfl' to chain my isoseq collapsed result from two samples, But when I check the "all_samples.chained.gff" file, I find that there's duplicate transcripts(PB.1.1 and PB.1.2):
chr1 PacBio transcript 70 4943 . + . gene_id "PB.1"; transcript_id "PB.1.1";
chr1 PacBio exon 70 4943 . + . gene_id "PB.1"; transcript_id "PB.1.1";
chr1 PacBio transcript 70 4943 . + . gene_id "PB.1"; transcript_id "PB.1.2";
chr1 PacBio exon 70 4943 . + . gene_id "PB.1"; transcript_id "PB.1.2";
chr1 PacBio transcript 1143 3686 . + . gene_id "PB.1"; transcript_id "PB.1.3";
chr1 PacBio exon 1143 2231 . + . gene_id "PB.1"; transcript_id "PB.1.3";
chr1 PacBio exon 2292 3686 . + . gene_id "PB.1"; transcript_id "PB.1.3";
chr1 PacBio transcript 1215 6942 . + . gene_id "PB.1"; transcript_id "PB.1.4";
chr1 PacBio exon 1215 2219 . + . gene_id "PB.1"; transcript_id "PB.1.4";
chr1 PacBio exon 5669 6942 . + . gene_id "PB.1"; transcript_id "PB.1.4";
Here is the chained ids in 'all_samples.chained_ids.txt':
superPBID sample-1 sample-2
PB.1.1 PB.1.1 PB.1.1
PB.1.2 PB.1.1 PB.2.1
PB.1.3 NA PB.2.2
PB.1.4 NA PB.2.3
Here is the collapsed.gff file for sample-1:
chr1 PacBio transcript 70 4943 . + . gene_id "PB.1"; transcript_id "PB.1.1";
chr1 PacBio exon 70 4943 . + . gene_id "PB.1"; transcript_id "PB.1.1";
chr1 PacBio transcript 1348 3693 . - . gene_id "PB.2"; transcript_id "PB.2.1";
chr1 PacBio exon 1348 3693 . - . gene_id "PB.2"; transcript_id "PB.2.1";
And head lines in the collapsed.gff file for sample-2:
chr1 PacBio transcript 70 1020 . + . gene_id "PB.1"; transcript_id "PB.1.1";
chr1 PacBio exon 70 1020 . + . gene_id "PB.1"; transcript_id "PB.1.1";
chr1 PacBio transcript 1033 4943 . + . gene_id "PB.2"; transcript_id "PB.2.1";
chr1 PacBio exon 1033 4943 . + . gene_id "PB.2"; transcript_id "PB.2.1";
chr1 PacBio transcript 1143 3686 . + . gene_id "PB.2"; transcript_id "PB.2.2";
chr1 PacBio exon 1143 2231 . + . gene_id "PB.2"; transcript_id "PB.2.2";
chr1 PacBio exon 2292 3686 . + . gene_id "PB.2"; transcript_id "PB.2.2";
chr1 PacBio transcript 1215 6942 . + . gene_id "PB.2"; transcript_id "PB.2.3";
chr1 PacBio exon 1215 2219 . + . gene_id "PB.2"; transcript_id "PB.2.3";
chr1 PacBio exon 5669 6942 . + . gene_id "PB.2"; transcript_id "PB.2.3";
I cannot figure out whether there is any error in data/command prepare or not. Can anyone explain for this?
Thanks a lot!
This is the command I used:
collapse_isoforms_by_sam.py --input $FASTQ --fq -s $SORTED_SAM -o ${OUT_DIR}/25675.polished_high_qv_consensus_isoforms_collapsed
The program ran for a while and I got the following error:
Traceback (most recent call last):
File "/home/shenl03/anaconda2/bin/collapse_isoforms_by_sam.py", line 4, in
import('pkg_resources').run_script('cupcake==3.7', 'collapse_isoforms_by_sam.py')
File "/home/shenl03/anaconda2/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/init.py", line 744, in run_script
File "/home/shenl03/anaconda2/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/init.py", line 1499, in run_script
File "/data/home/shenl03/anaconda2/lib/python2.7/site-packages/cupcake-3.7-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 244, in
main(args)
File "/data/home/shenl03/anaconda2/lib/python2.7/site-packages/cupcake-3.7-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 190, in main
for recs in iter:
File "/home/shenl03/anaconda2/lib/python2.7/site-packages/cupcake-3.7-py2.7-linux-x86_64.egg/cupcake/tofu/branch/branch_simple2.py", line 81, in iter_gmap_sam
ignored_fout.write("{0}\tCoverage {1:.3f} too low.\n".format(r.qID, r.qCoverage))
ValueError: Unknown format code 'f' for object of type 'str'
Please offer a fix!
Hello,
I've used collapse_isoforms_by_sam.py to remove redundant transcripts and get the results.
However, I found that the content in collapsed.gff file is not consistent with collapsed.rep.fa file.
Take 'PB.10113.15' as an example. As you can see in the following, the length of this transcript is 2091 but the sequence in fasta file is 2154-nucleotide long.
And I wonder how can I deal with the difference between gff file and fasta file?
The annotation of gene structure of this transcripts is:
22 PacBio transcript 36253133 36267525 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";
22 PacBio exon 36253133 36253219 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";
22 PacBio exon 36254937 36254999 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";
22 PacBio exon 36257083 36257136 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";
22 PacBio exon 36257319 36257407 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";
22 PacBio exon 36261596 36261722 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";
22 PacBio exon 36265151 36266135 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";
22 PacBio exon 36266840 36267525 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";
PB.10113.15
ATTATACAGACGCATAACTGGAGGTGGGATCCACACAGCTCAGAACAGCTGGATCTTGCT
CAGTCTCTGCCAGGGGAAGATTCCTTGGAGGAGGCCCTGCAGCGACATGGAGGGAGCTGC
TTTGCTGAGAGTCTCTGTCCTCTGCATCTGGATGAGTGCACTTTTCCTTGGTGTGGGAGT
GAGGGCAGAGGAAGCTGGAGCGAGGGTGCAACAAAACGTTCCAAGTGGGACAGATACTGG
AGATCCTCAAAGTAAGCCCCTCGGTGACTGGGCTGCTGGCACCATGGACCCAGAGAGCAG
TATCTTTATTGAGGATGCCATTAAGTATTTCAAGGAAAAAGTGAGCACACAGAATCTGCT
ACTCCTGCTGACTGATAATGAGGCCTGGAACGGATTCGTGGCTGCTGCTGAACTGCCCAG
GAATGAGGCAGATGAGCTCCGTAAAGCTCTGGACAACCTTGCAAGACAAATGATCATGAA
AGACAAAAACTGGCACGATAAAGGCCAGCAGTACAGAAACTGGTTTCTGAAAGAGTTTCC
TCGGTTGAAAAGTGAGCTTGAGGATAACATAAGAAGGCTCCGTGCCCTTGCAGATGGGGT
TCAGAAGGTCCACAAAGGCACCACCATCGCCAGTGTGGTGTCTGGCTCTCTCAGCATTTC
CTCTGGCATCCTGACCCTCGTCGGCATGGGTCTGGCACCCTTCACAGAGGGAGGCAGCCT
TGTACTCTTGGAACCTGGGATGGAGTTGGGAATCACAGCAGCTTTGACCGGGATTACCAG
CAGTACCATAGACTACGGAAAGAAGTGGTGGACACAAGCCCAAGCCCACGACCTGGTCAT
CAAAAGCCTTGACAAATTGAAGGAGGTGAAGGAGTTTTTGGGTGAGAACATATCCAACTT
TCTTTCCTTAGCTGGCAATACTTACCAACTCACACGAGGCATTGGGAAGGACATCCGTGC
CCTCAGACGAGCCAGAGCCAATCTTCAGTCAGTACCGCATGCCTCAGCCTCACGCCCCCG
GGTCACTGAGCCAATCTCAGCTGAAAGCGGTGAACAGGTGGAGAGAGTTAATGAACCCAG
CATCCTGGAAATGAGCAGAGGAGTCAAGCTCACGAATGTGGCCCCTGTAAGCTTCTTTCT
TGTGCTGGATGTAGTCTACCTCGTGTACGAATCAAAGCACTTACATGAGGGGGCAAAGTC
AGAGACAGCTGAGGAGCTGAAGAAGGTGGCTCAGGAGCTGGAGGAGAAGCTAAACATTCT
CAACAATAATTATAAGATTCTGCAGGCGGACCAAGAACTGTGACCACAGGGCAGGGCAGC
CACCAGGAGAGATATGCCTGGCAGGGGCCAGGACAAAATGCAAACTTTTTTTTTTTCTGA
GACAGAGTCTTGCACTGTCGCCAAGTGGAGCTTGCAGTGAGCCGAGATATCGCCACTGCA
CTCCAGCCTGGGTGACAGAGCGAGACTCCATCTCAAAAAAAAAAAAAAAAAGAATATATT
GACGGAAGAATAGAGAGGAGGCTTGAAGGAACCAGCAATGAGAAGGCCAGGAAAAGAAAG
AGCTGAAAATGGAGAAAGCCCAAGAGTTAGAACAGTTGGATACAGGAGAAGAAACAGCGG
CTCCACTACAGACCCAGCCCCAGGTTCAATGTCCTCCGAAGAATGAAGTCTTTCCCTGGT
GATGGTCCCCTGCCCTGTCTTTCCAGCATCCACTCTCCCTTGTCCTCCTGGGGGCATATC
TCAGTCAGGCAGCGGCTTCCTGATGATGGTCGTTGGGGTGGTTGTCATGTGATGGGTCCC
CTCCAGGTTACTAAAGGGTGCATGTCCCCTGCTTGAACACTGAAGGGCAGGTGGTGGGCC
ATGGCCATGGTCCCCAGCTGAGGAGCAGGTGTCCCTGAGAACCCAAACTTCCCAGAGAGT
ATGTGAGAACCAACCAATGAAAACAGTCCCATCGCTCTTACCCGGTAAGTAAACAGTCAG
AAAATTAGCATGAAAGCAGTTTAGCATTGGGAGGAAGCTCAGATCTCTAGAGCTGTCTTG
TCGCCGCCCAGGATTGACCTGTGTGTAAGTCCCAATAAACTCACCTACTCATCAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACCC
integrate fusion finding into cupcake
allow chaining of fusion transcripts
chain_samples.py
skips a fixed number of header rows in *counts.txt as part of a sanity check. The number of header rows has changed, causing a KeyError:
ids3 = [r['pbid'] for r in DictReader(f, delimiter='\t')]
KeyError: 'pbid'
Addition of cupcake2.tofu2 dependency to setup.py without associated files.
How am I meant to execute the chain_samples.py script? I suspect it needs to be installed and it doesn't appear in the setup.py. If I call it directly it is having trouble resolving the module:
Traceback (most recent call last):
File "cDNA_Cupcake/cupcake/tofu/counting/chain_samples.py", line 9, in
from cupcake.tofu.counting import combine_abundance_across_samples as sp
ImportError: No module named counting
I have an ISOSeq dataset which contains 71k sequences.
before running Cogent, I want to run the preCluster step.
However if I play with the --num_seqs_per_batch parameter, I get different outputs. I do [edit]not [/edit] fully understand why this is.
With the batch size at 5000 I get ~7000 clusters, with also around ~7000 orphan reads.
Setting the batch size to 500000, I get ~23k orphans reads and 11k clusters.
I can see from the partitioning csv file that the cluster with the latter setting are smaller compared to the small batch size parameter.
The clusters might get divided more with the run_mash script afterwards, but the orphans are untouched as far as I understand.
My question, does this really matter and if so, how to determine the batchsize in run_preCluster.py?
Hello @Magdoll,
Hope you are well.
I'm currently experiencing some challenges obtaining count information using get_abundance_post_collapse.py script after Isoseq3 due to errors with clusterID:
get_abundance_post_collapse.py ./Collapse_Isoform/Q21.collapsed cluster_report.csv
Traceback (most recent call last):
File "/usr/bin/get_abundance_post_collapse.py", line 5, in
pkg_resources.run_script('cupcake==5.2', 'get_abundance_post_collapse.py')
File "/usr/lib/python2.7/site-packages/pkg_resources.py", line 540, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python2.7/site-packages/pkg_resources.py", line 1455, in run_script
execfile(script_filename, namespace, namespace)
File "/usr/lib64/python2.7/site-packages/cupcake-5.2-py2.7-linux-x86_64.egg/EGG-INFO/scripts/get_abundance_post_collapse.py", line 435, in
get_abundance_post_collapse(args.collapse_prefix, args.cluster_report, args.collapse_prefix)
File "/usr/lib64/python2.7/site-packages/cupcake-5.2-py2.7-linux-x86_64.egg/EGG-INFO/scripts/get_abundance_post_collapse.py", line 422, in get_abundance_post_collapse
output_read_count_IsoSeq_csv(cid_info, cluster_report_csv, output_prefix + '.read_stat.txt')
File "/usr/lib64/python2.7/site-packages/cupcake-5.2-py2.7-linux-x86_64.egg/EGG-INFO/scripts/get_abundance_post_collapse.py", line 148, in output_read_count_IsoSeq_csv
raise Exception, "cluster_id {0} is not a valid cluster ID!".format(cid)
Exception: cluster_id transcript/0 is not a valid cluster ID!
However, the cluster report appears to be correct as outlined in the Isoseq3 github page:
cluster_id,read_id,read_type
transcript/0,m54082_180607_173058/60883723/ccs,FL
transcript/0,m54082_180607_173058/6161326/ccs,FL
transcript/0,m54082_180607_173058/27460031/ccs,FL
transcript/1,m54082_180607_173058/24838457/ccs,FL
I did also try modifiying the collapsed result and cluster_report.csv, however given there were no prefix and cid needs formatting, I wasn't able to proceed further.
As a side note, I also did have errors when generating the cluster report itself with the isoseq3_make_cluster_report.py with polished.bam file:
Traceback (most recent call last):
File "/mnt/data1/software/PacBio/cDNA_Cupcake/post_isoseq_cluster/isoseq3_make_cluster_report.py", line 35, in
make_cluster_report_from_polished_bam(args.polished_bam)
File "/mnt/data1/software/PacBio/cDNA_Cupcake/post_isoseq_cluster/isoseq3_make_cluster_report.py", line 9, in make_cluster_report_from_polished_bam
f = pysam.Samfile(polished_bam)
File "pysam/libcalignmentfile.pyx", line 444, in pysam.libcalignmentfile.AlignmentFile.cinit
File "pysam/libcalignmentfile.pyx", line 664, in pysam.libcalignmentfile.AlignmentFile._open
ValueError: file has no sequences defined (mode='r') - is it SAM/BAM format? Consider opening with check_sq=False
However, this was resolved when I edited the script with "f = pysam.Samfile(polished_bam,check_sq=False)"
Any help will be greatly appreciated. Thank you so much!
I have been trying to install the most recent commit, but have been running into errors. Running setup.py gives me
running install
running bdist_egg
running egg_info
creating cupcake.egg-info
writing cupcake.egg-info/PKG-INFO
writing dependency_links to cupcake.egg-info/dependency_links.txt
writing requirements to cupcake.egg-info/requires.txt
writing top-level names to cupcake.egg-info/top_level.txt
writing manifest file 'cupcake.egg-info/SOURCES.txt'
error: package directory 'cupcake2' does not exist
Tracing this issue shows the addition of requirements for "'cupcake2', 'cupcake2.io', 'cupcake2.ice2'" in the newest commit, but I can't find evidence of this package in the rest of the repository. I'm assuming there are files missing?
Everything installs fine prior to your commit a few days back.
Hi:
I find a problem about wrong gene classification.
In the picture,PB.583 have two part,PB.583.1 PB.583.2 PB583.3 PB.583.4 and PB.583.5 PB.583.6 PB.583.7 PB.583.8. It is obviously two genes.
I think the reason that wrong gene classification is the PB.584 link the PB.583 two part.
There are many such situations here.like this:
Hi @Magdoll ,
I run fusion_finder.py on both collapsed data and uncollapsed data:
I cannot figure out why, in my understanding, the results should be the same for collapsed or uncollapsed data as input.
Thanks a lot.
Hi,
is it okay to run collapse_isoforms_by_sam.py with --max_fuzzy_junction set to 0 (zero)?
Thanks
A
Data was generated by isoseq3 in SMRTLINK 6.0. header is like:
@multiple_samples_HQ_transcript/0
filter by count.py generated error:
Traceback (most recent call last):
File "/usr/local/bin/filter_by_count.py", line 4, in
import('pkg_resources').run_script('cupcake==6.1', 'filter_by_count.py')
File "/usr/lib/python2.7/dist-packages/pkg_resources/init.py", line 719, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python2.7/dist-packages/pkg_resources/init.py", line 1504, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python2.7/dist-packages/cupcake-6.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/filter_by_count.py", line 106, in
filter_by_count(args.input_prefix, output_prefix, args.min_count, args.dun_use_group_count)
File "/usr/local/lib/python2.7/dist-packages/cupcake-6.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/filter_by_count.py", line 40, in filter_by_count
fl_count, p_count = tmp.split('p')
ValueError: need more than 1 value to unpack
For S1 sample the filter_away_subset.py finished without a problem but for the second sample (S2) I get this error:
Traceback (most recent call last):
File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/bin/filter_away_subset.py", line 4, in
import('pkg_resources').run_script('cupcake==3.1', 'filter_away_subset.py')
File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/init.py", line 744, in run_script
File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/init.py", line 1499, in run_script
File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/lib/python2.7/site-packages/cupcake-3.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/filter_away_subset.py", line 150, in
main()
File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/lib/python2.7/site-packages/cupcake-3.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/filter_away_subset.py", line 141, in main
r = d[k]
KeyError: 'PB.1.1'
this is the command I used for running the script (in anaCogent environment):
filter_away_subset.py all_sizes.quivered_hq_MPE_S2.fastq.collapsed
the input and generated output files are available here:
https://www.dropbox.com/sh/z4wps7y6gvyg4w2/AACDiAk4X-SyyoeWYwTqL0i2a?dl=0
Hi @Magdoll ,
The abundance file gives 6 measurements, which one do you recommend to use as gene expression metric in upstream analysis? Or should I do other nomarlization like TPM, RPKM or FPKM?
# count_fl: Number of associated FL reads
# count_nfl: Number of associated FL + unique nFL reads
# count_nfl_amb: Number of associated FL + unique nFL + weighted ambiguous nFL reads
# norm_fl: count_fl / total number of FL reads
# norm_nfl: count_nfl / total number of FL + unique nFL reads
# norm_nfl_amb: count_nfl_amb / total number of all reads
Thanks a lot.
Currently, collapse_isoforms_by_sam.py
accepts a -i
identity threshold, but that is the average identity over the whole alignment.
Importantly, I want to add a per-exon identity threshold to filter out cases where subregions of low quality / err causes bad alignments. These should be thrown out.
chr1 hg38 exon 247416157 247416193 100 + . ID=c14289.mrna1.exon1;Name=c14289;Parent=c14289.mrna1;Target=c14289 1 37 +
chr1 hg38 exon 247418053 247419077 99 + . ID=c14289.mrna1.exon2;Name=c14289;Parent=c14289.mrna1;Target=c14289 38 1062 +
chr1 hg38 exon 247423230 247423349 100 + . ID=c14289.mrna1.exon3;Name=c14289;Parent=c14289.mrna1;Target=c14289 1063 1182 +
chr1 hg38 exon 247423847 247425599 99 + . ID=c14289.mrna1.exon4;Name=c14289;Parent=c14289.mrna1;Target=c14289 1183 2935 +
chr1 hg38 exon 247434103 247434273 100 + . ID=c14289.mrna1.exon5;Name=c14289;Parent=c14289.mrna1;Target=c14289 2936 3106 +
chr1 hg38 exon 247442199 247442204 100 + . ID=c14289.mrna1.exon6;Name=c14289;Parent=c14289.mrna1;Target=c14289 3108 3113 +
chr1 hg38 exon 247442230 247442237 100 + . ID=c14289.mrna1.exon7;Name=c14289;Parent=c14289.mrna1;Target=c14289 3134 3141 +
chr1 hg38 exon 247444016 247444092 74 + . ID=c14289.mrna1.exon8;Name=c14289;Parent=c14289.mrna1;Target=c14289 3150 3210 +
chr1 hg38 exon 247444749 247444821 87 + . ID=c14289.mrna1.exon9;Name=c14289;Parent=c14289.mrna1;Target=c14289 3211 3275 +
chr1 hg38 exon 247448405 247448815 100 + . ID=c14289.mrna1.exon10;Name=c14289;Parent=c14289.mrna1;Target=c14289 3276 3686 +
As an example above, most exons have high identity 99-100% but two exons have very low identity. This is a case where collapse script should discard this alignment.
--Liz
Hi Elizabeth
I have Nanopore Direct RNA-seq, and mapped them to reference using minimap2. I wonder if fusion_finder.py can handle these mapped ONT reads to find gene fusion.
Thanks
Edison
Running
scrub_sample_GFF_junctions.py --scrubbed_junction_file "${summary_prefix}.junction.bed" "${scrub_config}" "${summary_prefix}.junction_detail.txt" "${scrub_prefix}"
Results in the following error
Traceback (most recent call last):
File "/usr/local/bin/scrub_sample_GFF_junctions.py", line 5, in
pkg_resources.run_script('cupcake==6.1', 'scrub_sample_GFF_junctions.py')
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 528, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 1394, in run_script
execfile(script_filename, namespace, namespace)
File "/usr/local/lib/python2.7/dist-packages/cupcake-6.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/scrub_sample_GFF_junctions.py", line 279, in
tree = read_scrubbed_junction_to_tree(output_filename)
File "/usr/local/lib/python2.7/dist-packages/cupcake-6.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/scrub_sample_GFF_junctions.py", line 112, in read_scrubbed_junction_to_tree
chrom, left, right, strand = line.strip().split('\t')
ValueError: need more than 1 value to unpack
Indeed, the BED file produced by the summary step has 6 columns. Here are the first few lines
track name=junctions description="summary" useScore=1
chr1 1013575 1013984 summary 3 +
chr1 1803416 1803876 summary 1 +
chr1 2391957 2395784 summary 3 +
Improvements to chain_samples.py
Top priority:
allow 5' merge, that is, if one isoform in sample A is a 5' truncted form of another in sample B, allow them to be chained as equivalent and select the longer form. Prerequisite: both sample A and B must have already gone through collapse using 5merge option, otherwise results can be incorrect.
output fasta/fastq files along with the current output of chained GFF, ids txt, and count.
Medium priority:
@Magdoll ,
If I have only FL reads, can I run subsampling to draw rarefaction curves after collapsing read without Iso-Seq cluster? I want to use tofu_wrap to cluster fl reads but I don't have other files that need to be required. Is that OK?
File "/cupcake/tofu/compare_junctions.py", line 21, in compare_junctions
strand = r1.strand
AttributeError: GMAPSAMRecord instance has no attribute 'strand'
@Magdoll Hi, when I tried to install cupcake tofu
many warnings generated in python setup.py build
step.
More warnings saying "unused function" generated in python setup.py install
step.
And end up like this:
I am using Mac OS X, a new Anaconda environment with python 2.7. created like this:
conda create -n cupcake Anaconda python=2
conda install biopython source activate cupcake
Please help...
During the sanity check, chain_samples.py
uses GFF.collapseGFFReader to process the gff. This function fails if there is a header.
File "/home/wrowell/anaconda2/lib/python2.7/site-packages/cupcake-3.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/chain_samples.py", line 19, in sample_sanity_check
ids2 = [r.seqid for r in GFF.collapseGFFReader(gff_filename)]
File "/home/wrowell/anaconda2/lib/python2.7/site-packages/cupcake-3.1-py2.7-linux-x86_64.egg/cupcake/io/GFF.py", line 386, in next
return self.read()
File "/home/wrowell/anaconda2/lib/python2.7/site-packages/cupcake-3.1-py2.7-linux-x86_64.egg/cupcake/io/GFF.py", line 543, in read
assert raw[2] == 'transcript'
IndexError: list index out of range
New PB gff files have the following header:
##gff-version 3
artifacts are present in Iso-Seq data often due to library preparation (template switching, PCR chimeras) that are difficult to remove because they come from the cDNA level (and not at the sequencing or bioinformatics level) and introduce false positives for both discovery of novel isoforms and fusion genes.
short reads and other seq platforms can serve as an independent evidence to reduce these artifacts. a simple check is to align short read data (using --all
options to allow multi-mapping) to collapsed Iso-Seq data and have a script that removes / filters all Iso-Seq data that is not supported continuously at each base with a minimum T coverage by short reads (note: short reads often have issues with covering the ends, so the focus here is in the middle section).
this process should consist of several steps:
align using fast short read aligners (bowtie2, bwa) of short reads to collapsed, high-quality, Iso-Seq data. care needs to be taken to allow for more errors and multi-mapping since Iso-Seq data will contain many isoforms that may share the same exons and we want them to all map.
create a coverage map per Iso-Seq transcript (best format to show this is?)
user-defined filtering criteria (how much to ignore at the ends, minimum base coverage at what interval, etc)
output filtered/reduced set of Iso-Seq data and a summary report
Running summarize_sample_GFF_junctions without GENOME_FILENAME results in the following error:
Traceback (most recent call last):
File "/usr/local/bin/summarize_sample_GFF_junctions.py", line 5, in
pkg_resources.run_script('cupcake==6.1', 'summarize_sample_GFF_junctions.py')
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 528, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 1394, in run_script
execfile(script_filename, namespace, namespace)
File "/usr/local/lib/python2.7/dist-packages/cupcake-6.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/summarize_sample_GFF_junctions.py", line 179, in
print >> sys.stder, "No genome file given. Ignore."
AttributeError: 'module' object has no attribute 'stder'
Looks like a basic typo.
Hi,
First, I used minimap2 to get the hq_isoforms.sam, then run fusion_finder.py with:
fusion_finder.py \
--input $fastq --fq \
-s $sam \
--dun-merge-5-shorter \
-o $out_fn \
--cluster_report_csv $report
while I got the error:
combo seen: (58, 59)
combo seen: (9, 10)
combo seen: (29, 30)
combo seen: (9, 10)
combo seen: (4, 8)
combo seen: (18, 19)
Traceback (most recent call last):
File "/home/chenlingxi/mnt/chenlingxi/lib/tools/anaconda2/bin/fusion_finder.py", line 395, in <module>
min_dist_between_loci=args.min_dist_between_loci)
File "/home/chenlingxi/mnt/chenlingxi/lib/tools/anaconda2/bin/fusion_finder.py", line 347, in fusion_main
assert pbid1.split('.')[1] == pbid2.split('.')[1]
AssertionError
How can I fix this? Thanks.
Hi, @Magdoll
I ran into an error while chaining fusion transcripts:
"""
ERROR! more than one possible candidate in match_fusion_record! DEBUG.
MATCHED: ['PBfusion.114', 'PBfusion.210']
"""
Information of two involved transcripts are given below.
"""
1 PacBio transcript 161285514 161288653 . - . gene_id "PBfusion.114"; transcript_id "PBfusion.114.1";
1 PacBio exon 161285514 161288653 . - . gene_id "PBfusion.114"; transcript_id "PBfusion.114.1";
2 PacBio transcript 7852284 7853099 . - . gene_id "PBfusion.114"; transcript_id "PBfusion.114.2";
2 PacBio exon 7852284 7853099 . - . gene_id "PBfusion.114"; transcript_id "PBfusion.114.2";
1 PacBio transcript 161285514 161288251 . - . gene_id "PBfusion.210"; transcript_id "PBfusion.210.1";
1 PacBio exon 161285514 161288251 . - . gene_id "PBfusion.210"; transcript_id "PBfusion.210.1";
2 PacBio transcript 7852289 7853099 . - . gene_id "PBfusion.210"; transcript_id "PBfusion.210.2";
2 PacBio exon 7852289 7853099 . - . gene_id "PBfusion.210"; transcript_id "PBfusion.210.2";
"""
PBfusion.114 and PBfusion.210 are two different fusion transcripts while PBfusion.114.2 and PBfusion.210.2 are identical. Is this the cause of this issue? And how can we solve this? Thanks.
I am in the process of running through the cupcake/tofu pipeline (following cogent) to remove redundant transcripts (in the absence of genomes), and I've gotten hung up at the get_abundance_post_collapse.py step. The script I used is as follows:
get_abundance_post_collapse.py hq_isoforms.fasta.no5merge.collapsed cluster_report.csv
My job fails with the following output:
Traceback (most recent call last):
File "/scg/apps/software/anaconda/3_5.0.1_20180125/envs/cdna_cupcake_20180921/bin/get_abundance_post_collapse.py", line 4, in
import('pkg_resources').run_script('cupcake==5.9', 'get_abundance_post_collapse.py')
File "/scg/apps/software/anaconda/3_5.0.1_20180125/envs/cdna_cupcake_20180921/lib/python2.7/site-packages/pkg_resources/init.py", line 661, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/scg/apps/software/anaconda/3_5.0.1_20180125/envs/cdna_cupcake_20180921/lib/python2.7/site-packages/pkg_resources/init.py", line 1441, in run_script
exec(code, namespace, namespace)
File "/scg/apps/software/anaconda/3_5.0.1_20180125/envs/cdna_cupcake_20180921/lib/python2.7/site-packages/cupcake-5.9-py2.7-linux-x86_64.egg/EGG-INFO/scripts/get_abundance_post_collapse.py", line 319, in
get_abundance_post_collapse(args.collapse_prefix, args.cluster_report, args.collapse_prefix)
File "/scg/apps/software/anaconda/3_5.0.1_20180125/envs/cdna_cupcake_20180921/lib/python2.7/site-packages/cupcake-5.9-py2.7-linux-x86_64.egg/EGG-INFO/scripts/get_abundance_post_collapse.py", line 306, in get_abundance_post_collapse
output_read_count_IsoSeq_csv(cid_info, cluster_report_csv, output_prefix + '.read_stat.txt')
File "/scg/apps/software/anaconda/3_5.0.1_20180125/envs/cdna_cupcake_20180921/lib/python2.7/site-packages/cupcake-5.9-py2.7-linux-x86_64.egg/EGG-INFO/scripts/get_abundance_post_collapse.py", line 177, in output_read_count_IsoSeq_csv
id=x, len=get_roi_len(x), is_fl='Y', stat=stat, pbid=pbid))
File "/scg/apps/software/anaconda/3_5.0.1_20180125/envs/cdna_cupcake_20180921/lib/python2.7/site-packages/cupcake-5.9-py2.7-linux-x86_64.egg/EGG-INFO/scripts/get_abundance_post_collapse.py", line 50, in get_roi_len
s, e, junk = seqid.split('/')[2].split('_')
ValueError: need more than 1 value to unpack
I'm guessing that this is likely something very simple that I've done incorrectly, but I seem to have exhausted the extent of my troubleshooting potential for this (I think). Please let me know if you've run into this before!
Thanks,
Ryan
@Magdoll
I tried to install latest cupcake tofu using conda, but it produced a error(shown on screenshot below).
Running:
chain_samples.py CONFIG_FILE count_nfl
I am getting results that don't seem to be in line with the IGV image of the same regions. In particular, I'm seeing ~1000 to ~10 count differentials between samples for some some regions via chain samples, when the IGV image shows about 160 reads in both samples (almost all full length). Furthermore while I would assume that the count_fl results would be based on the number of input reads, I only have about 14,000 input reads, but the chain_samples.py columns sum to around 140,000.
These results seem unexpected to me, is this a bug or am I misunderstanding something?
Hi @Magdoll ,
I used the script scrub_sample_GFF_junctions.py
of the latest version of Cupcake, but an error happened. Could you help me to solve it? I'd appreciate that. I attached the screenshot of error at below.
Hi:
I use consensus.fasta generate gff files.This is the command line.
gmap -D /disk/luping/bam/iso-bam/all-cluster/evaluate_tofu2 -d FG -f samse --no-chimeras -n 1 -t 13 -z sense_force consensus.fasta >clearn-consensus.fasta.sam
sort -k 3,3 -k 4,4n clearn-consensus.fasta.sam >clearn-consensus.fasta.sort.sam
collapse_isoforms_by_sam.py --input consensus.fasta -s clearn-consensus.fasta.sort.sam -o no-fusion-check2-all
Then I get some very long transcript like this: PB.12987.1
4 PacBio transcript 7794173 8025705 . - . gene_id "PB.12987"; transcript_id "PB.12987.1";
4 PacBio exon 7794173 7794629 . - . gene_id "PB.12987"; transcript_id "PB.12987.1";
4 PacBio exon 8025556 8025705 . - . gene_id "PB.12987"; transcript_id "PB.12987.1";
4 PacBio transcript 7794174 7794835 . - . gene_id "PB.12987"; transcript_id "PB.12987.2";
4 PacBio exon 7794174 7794835 . - . gene_id "PB.12987"; transcript_id "PB.12987.2";
4 PacBio transcript 7794986 7800216 . - . gene_id "PB.12987"; transcript_id "PB.12987.3";
4 PacBio exon 7794986 7795627 . - . gene_id "PB.12987"; transcript_id "PB.12987.3";
4 PacBio exon 7795677 7800216 . - . gene_id "PB.12987"; transcript_id "PB.12987.3";
4 PacBio transcript 7794988 7800004 . - . gene_id "PB.12987"; transcript_id "PB.12987.4";
4 PacBio exon 7794988 7800004 . - . gene_id "PB.12987"; transcript_id "PB.12987.4";
The long transcript covering many genes like this:
How should I do to remove this transcript?
Hi,
I want to create a bioconda package for cDNA_Cupcake for easier distribution between researchers. However, I can not do that since the package needs to have a license that allows redistribution.
Do you think it is possible for you to add a license for cDNA_Cupcake or ToFu2?
I have the same request for "Cogent" as well.
Thank you,
Natasha
Hi,
Firstly, thanks for designing this great software. I'm now working on one project in which the script 'collapse_isoforms_by_sam.py' is used for reducing redundancy of the GMAP result.
During the first time of running the script, I used the default parameters of minimal coverage and minimal identity. The final result shows that I got 288,773 collapse isoforms and 13,354 genes.
I personally thought that the number of collapse isoforms is small due to the high filter rate (42.75%). Therefore I changed the coverage and identity value to 0.85 and 0.9 respectively. The final result shows 476,088 collapse isoforms with the filter rate of 16.86%. However, I found that this time I got only 1,573 genes based on the file collapsed.group.txt.
In other words, I got more isoforms but fewer genes after changing the two parameters.
I'm still a beginner at bioinformatics and it's a little bit hard for me to find the reason from the script. Therefore, I'm writing this issue to figure out the reason. My personal guess is that after changing the parameters, I got more isoforms and on the other hand, the way of dividing gene groups changed as well.
I'm looking forward to waiting for answers and thank you very much.
I've mapped the hq pacbio sequences to genome using gmap (latest version), sorted the sam file using samtools and then run the collapse_isoforms_by_sam.py script in the anaconda environment (operating system Ubuntu 14.04-biolinux). The script works for few seconds and then crashes with the following error message:
...
merged 1 down to 1 transcripts
alt. junction found at 15
alt. junction found at 285
alt. junction found at 577
alt. junction found at 926
alt. junction found at 1333
alt. junction found at 2473
alt. junction found at 3396
alt. junction found at 3611
alt. junction found at 4065
alt. junction found at 4244
alt. junction found at 4988
alt. junction found at 5216
alt. junction found at 5720
alt. junction found at 5970
alt. junction found at 6806
alt. junction found at 7161
merged 2 down to 1 transcripts
Traceback (most recent call last):
File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/bin/collapse_isoforms_by_sam.py", line 4, in <module>
__import__('pkg_resources').run_script('cupcake==3.1', 'collapse_isoforms_by_sam.py')
File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/__init__.py", line 744, in run_script
File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/__init__.py", line 1499, in run_script
File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/lib/python2.7/site-packages/cupcake-3.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 244, in <module>
main(args)
File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/lib/python2.7/site-packages/cupcake-3.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 190, in main
for recs in iter:
File "/home/adminuser/Documents/MarkoP/potato_pacbio/anaconda2/envs/anaCogent/lib/python2.7/site-packages/cupcake-3.1-py2.7-linux-x86_64.egg/cupcake/tofu/branch/branch_simple2.py", line 81, in iter_gmap_sam
ignored_fout.write("{0}\tCoverage {1:.3f} too low.\n".format(r.qID, r.qCoverage))
ValueError: Unknown format code 'f' for object of type 'str'
(anaCogent) adminuser@biolx[S1_allsize]
I've tried to sort the sam file with picard-tools SamSort but get the same error.
I've uploaded the files to dropbox so you can try to reproduce the error (see console_commands.txt for commands used):
https://www.dropbox.com/sh/6n0q61rq62324hi/AAAtdGIH6O3Wyt3FlzYETUVSa?dl=0
I had combined two isoseq cells and got an error for duplicate ids but I didn't find any duplicates in my polished.hq.fasta file. Where exactly is it finding the duplicate id names.
thanks.
Traceback (most recent call last):
File "/scinet01/gov/usda/ars/scinet/project/tgbs_analysis/anaconda3/envs/TofCogent/bin/collapse_isoforms_by_sam.py", line 4, in <module>
__import__('pkg_resources').run_script('cupcake==6.1', 'collapse_isoforms_by_sam.py')
File "/scinet01/gov/usda/ars/scinet/project/tgbs_analysis/anaconda3/envs/TofCogent/lib/python2.7/site-packages/pkg_resources/__init__.py", line 664, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/scinet01/gov/usda/ars/scinet/project/tgbs_analysis/anaconda3/envs/TofCogent/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1444, in run_script
exec(code, namespace, namespace)
File "/scinet01/gov/usda/ars/scinet/project/tgbs_analysis/anaconda3/envs/TofCogent/lib/python2.7/site-packages/cupcake-6.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 248, in <module>
main(args)
File "/scinet01/gov/usda/ars/scinet/project/tgbs_analysis/anaconda3/envs/TofCogent/lib/python2.7/site-packages/cupcake-6.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 177, in main
check_ids_unique(args.input, is_fq=args.fq)
File "/scinet01/gov/usda/ars/scinet/project/tgbs_analysis/anaconda3/envs/TofCogent/lib/python2.7/site-packages/cupcake-6.1-py2.7-linux-x86_64.egg/cupcake/tofu/utils.py", line 13, in check_ids_unique
raise Exception, "Duplicate id {0} detected. Abort!".format(r.id)
Exception: Duplicate id detected. Abort
Hi,
I am trying to use the 'chain_samples.py'. But, I couldn't find the count file (.count.txt) for the COUNT_FILENAME in the config file. The '.count.txt' is not generated by running the 'collapse_isoforms_by_sam.py'. I did try to use the '.collapsed.abundance.txt' instead of the '.count.txt' after running the 'get_abundance_post_collapse.py', but I got an error such as "NameError: global name 'count_header' is not defined". I will appreciate it if you let me know how to treat the COUNT_FILENAME in the config file to run the 'chain_samples.py'. Thank you!
Taehee
Hi, @Magdoll ,
I have tried to install Cupcake. But when I run python setup.py install
, it reported such errors:
So what can I do now to deal with this error? Thanks for any suggestions!
Hi,
I got this error message after running "collapse_isoforms_by_sam.py".
Could you please let me know what needs to be done here?
Thanks.
Traceback (most recent call last):
File "/home/joonlee3/bin/anaconda2/envs/anaCogent5.2/bin/collapse_isoforms_by_sam.py", line 4, in
import('pkg_resources').run_script('cupcake==5.8', 'collapse_isoforms_by_sam.py')
File "/home/joonlee3/bin/anaconda2/envs/anaCogent5.2/lib/python2.7/site-packages/pkg_resources/init.py", line 658, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/home/joonlee3/bin/anaconda2/envs/anaCogent5.2/lib/python2.7/site-packages/pkg_resources/init.py", line 1438, in run_script
exec(code, namespace, namespace)
File "/home/joonlee3/bin/anaconda2/envs/anaCogent5.2/lib/python2.7/site-packages/cupcake-5.8-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 245, in
main(args)
File "/home/joonlee3/bin/anaconda2/envs/anaCogent5.2/lib/python2.7/site-packages/cupcake-5.8-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 204, in main
collapse_fuzzy_junctions(f_good.name, f_txt.name, args.allow_extra_5exon, internal_fuzzy_max_dist=args.max_fuzzy_junction)
File "/home/joonlee3/bin/anaconda2/envs/anaCogent5.2/lib/python2.7/site-packages/cupcake-5.8-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 149, in collapse_fuzzy_junctions
_size = get_fl_from_id(group_info[pbid])
File "/home/joonlee3/bin/anaconda2/envs/anaCogent5.2/lib/python2.7/site-packages/cupcake-5.8-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 92, in get_fl_from_id
return sum(int(_id.split('/')[1].split('p')[0][1:]) for _id in members)
File "/home/joonlee3/bin/anaconda2/envs/anaCogent5.2/lib/python2.7/site-packages/cupcake-5.8-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 92, in
return sum(int(_id.split('/')[1].split('p')[0][1:]) for _id in members)
ValueError: invalid literal for int() with base 10: '714.1'
Hi!
Before I want to use Cogent and Cupcake to collapse redundant transcripts in absence of genome, what kinds of data I should use?
After SMRTLink, I have had the polished consensus fasta
, polished hq fasta
, polished lq fasta
, and which fasta file should be used to do this collapsing jobs?
Thanks for any suggestion!
The program ran through a few thousand transcripts before getting to this:
merged 1 down to 1 transcripts
merged 1 down to 1 transcripts
Traceback (most recent call last):
File "/cluster/home/ifiddes/anaconda2/bin/collapse_isoforms_by_sam.py", line 4, in <module>
__import__('pkg_resources').run_script('cupcake==2.1', 'collapse_isoforms_by_sam.py')
File "/cluster/home/ifiddes/anaconda2/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/__init__.py", line 744, in run_script
File "/cluster/home/ifiddes/anaconda2/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/__init__.py", line 1499, in run_script
File "/cluster/home/ifiddes/anaconda2/lib/python2.7/site-packages/cupcake-2.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 242, in <module>
main(args)
File "/cluster/home/ifiddes/anaconda2/lib/python2.7/site-packages/cupcake-2.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 190, in main
for recs in iter:
File "/cluster/home/ifiddes/anaconda2/lib/python2.7/site-packages/cupcake-2.1-py2.7-linux-x86_64.egg/cupcake/tofu/branch/branch_simple2.py", line 77, in iter_gmap_sam
ignored_fout.write("{0}\tCoverage {1:.3f} too low.\n".format(r.qID, r.qCoverage))
ValueError: Unknown format code 'f' for object of type 'str'
I saw this much earlier when I tried to use a STAR aligned SAM file, but then I looked through the code a bit and it seemed that you rely on identity and coverage metrics provided by gmap, so I used a gmap SAM. I was able to get through one dataset entirely, but then the next had this problem.
Dear,
For most study, people have conducted saturation analysis from aspect of gene.
However, I think the saturation in isoform level should be paid more attention to, isn't it?
In my data set, I've got 100,000 ROI, 50,000 FLNC, and 18,000 HQ-polished isoforms.
I found most of HQ-polished isoforms is constructed by 1 FLNC reads, and if I draw a curve to show the relationship of FLNC reads and HQ-polished isoforms, I could only get a completely unsaturated result.
Does it means that my data is completely unsaturated?
Or the way I used the data is in wrong way?
Thanks a lot in advance!
@Magdoll
I've got rarefaction file by running 'subsample.py' in the light of what you said, but I don't know what to do next for plotting rarefaction curve. I'm trying to use iNEXT, but the input files seems to be different, Could you give me some hints? Thank you so much.
Hi @Magdoll,
First time user of Cupcake here.
I have two separate questions -
I have Isoseq CCS runs for two separate smrt cells run for a single tissue. Do I combine the HQ_isoforms.fasta of both runs and then run the collapse_isoforms scripts and the remaining downstream steps or should I collapse isoforms individually, get count information and then do the chaining?
I have a reference genome and I have mapped the HQ_isoforms to this but I would like to use the annotation file (GFF) as well to know the gene names. Is there a way to do this or should I just use bedtools to intersect the gff file after collapsing isoforms?
Many thanks!
what's wrong with this?
gmap -D /disk/luping/bam/iso-bam/all-cluster/evaluate_tofu2 -d FG -f samse -n 0 -t 13 -z sense_force consensus.fasta >consensus.fasta.sam
sort -k 3,3 -k 4,4n consensus.fasta.sam >consensus.fasta.sort.sam
[luping@centos pb-con-check]$ fusion_finder.py --input consensus.fasta -s consensus.fasta.sort.sam -o ./fusion
Traceback (most recent call last):
File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/bin/fusion_finder.py", line 4, in <module>
__import__('pkg_resources').run_script('cupcake==3.0', 'fusion_finder.py')
File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/lib/python2.7/site-packages/pkg_resources/__init__.py", line 739, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1494, in run_script
exec(code, namespace, namespace)
File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/lib/python2.7/site-packages/cupcake-3.0-py2.7-linux-x86_64.egg/EGG-INFO/scripts/fusion_finder.py", line 393, in <module>
min_dist_between_loci=args.min_dist_between_loci)
File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/lib/python2.7/site-packages/cupcake-3.0-py2.7-linux-x86_64.egg/EGG-INFO/scripts/fusion_finder.py", line 301, in fusion_main
o = merge_fusion_exons(v, max_fusion_point_dist=100, max_exon_end_dist=0, allow_extra_5_exons=allow_extra_5_exons)
File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/lib/python2.7/site-packages/cupcake-3.0-py2.7-linux-x86_64.egg/EGG-INFO/scripts/fusion_finder.py", line 207, in merge_fusion_exons
if all(is_fusion_compatible(r1, r2, max_fusion_point_dist, max_exon_end_dist, allow_extra_5_exons) for r2 in r2s):
File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/lib/python2.7/site-packages/cupcake-3.0-py2.7-linux-x86_64.egg/EGG-INFO/scripts/fusion_finder.py", line 207, in <genexpr>
if all(is_fusion_compatible(r1, r2, max_fusion_point_dist, max_exon_end_dist, allow_extra_5_exons) for r2 in r2s):
File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/lib/python2.7/site-packages/cupcake-3.0-py2.7-linux-x86_64.egg/EGG-INFO/scripts/fusion_finder.py", line 153, in is_fusion_compatible
type = compare_junctions(r1, r2)
File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/lib/python2.7/site-packages/cupcake-3.0-py2.7-linux-x86_64.egg/cupcake/tofu/compare_junctions.py", line 21, in compare_junctions
strand = r1.strand
AttributeError: GMAPSAMRecord instance has no attribute 'strand'
Hi,
I have already updated to the v5.8.
When I run collapse_isoforms_by_sam.py, error occured:
$tail runtime.log
alt. junction found at 514
alt. junction found at 804
Traceback (most recent call last):
File "/home/chenlingxi/mnt/chenlingxi/lib/tools/anaconda2/bin/collapse_isoforms_by_sam.py", line 245, in <module>
main(args)
File "/home/chenlingxi/mnt/chenlingxi/lib/tools/anaconda2/bin/collapse_isoforms_by_sam.py", line 190, in main
for recs in iter: # recs is {'+': list of list of records, '-': list of list of records}
File "/home/chenlingxi/mnt/chenlingxi/lib/tools/anaconda2/lib/python2.7/site-packages/cupcake/tofu/branch/branch_simple2.py", line 97, in iter_gmap_sam
ignored_fout.write("{0}\tCoverage {1:.3f} too low.\n".format(r.qID, r.qCoverage))
ValueError: Unknown format code 'f' for object of type 'str'
There are already some outputs:
$ls
OC001T.hq_isoforms.collapsed.gff OC001T.hq_isoforms.collapsed.group.txt OC001T.hq_isoforms.ignored_ids.txt runtime.log
How to fix it? Thanks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.