mortazavilab / swan_vis Goto Github PK
View Code? Open in Web Editor NEWA Python library to visualize and analyze long-read transcriptomes
Home Page: https://freese.gitbook.io/swan/
License: MIT License
A Python library to visualize and analyze long-read transcriptomes
Home Page: https://freese.gitbook.io/swan/
License: MIT License
Hello,
I want to ask for a further skill on how to keep all the transcripts as background when I only plot a transcript. I tried plot the gene first, then plot a transcript, the figure shows mix figures, rather the gene as light background in the tutorial. If I only plot the transcript, after read the swan.p in, I cannot see the other transcripts in my plot. Thanks
Case:
Minimal working (or non-working) example:
sg = swan.SwanGraph('swan.p')
sg.plot_graph('SARS', indicate_novel=True, display=True, prefix='figures/SARS')
sg.plot_transcript_path('ENCODEHT000445064', indicate_novel=True, display=True, prefix='figures/SARS')
Hello Fairlie,
Thank you for the swan package - love the visualisations! I just had a question regarding the novel splice junctions visualisation, as the warning message "Novelty info not found for x. Transcripts without novelty information will be labelled "Undefined" appears every time I add a dataset.
I am using the mouse reference genome gtf as annotation gtf and my Iso-Seq gtf output from SQANTI as the dataset input (bypassing TALON).
Is the novelty aspect documented in the gtf or is there a specific file I need to include? Any guidance on this will be greatly appreciated!
Thank you,
Szi Kay
It seems that some transcripts that are present in 'all_talon_abundance_filtered.tsv' and 'all_talon_observedOnly.gtf' are not in the swangraph object when I upload them. How could this be? Is there an additional filtering that filters out transcripts while I add them?
More specifically- I have 6,070 of 12,812 novel transcripts remaining after I upload them into the swan object.
I'm using the standard sg.add_transcriptome(talon_db, pass_list=pass_list)
and sg.add_abundance(ab_file)
as you suggest in the tutorial
Hi,
I was wondering if there is a tabular info of the relation between splicing in/out sites and exons of genes. It will be very helpful when the gene is very long. Thanks
Hi
I've been an extremely satisfied user of swanvis up until now. The last time I used it was in mid-January, about 6 months ago. I tried to use it again now and it seems to be broken?
I tried to re-install with pip and the installation failed, i used 'pip3' like another user stated in another issue and that did run to completion, but when trying to import into python itself the package couldn't be found.
I cloned the repository and installed it with 'pip3 install .'. That worked and I was able to load the package into python. but then i ran 'sg = swan.SwanGraph()' and got "AttributeError: module 'swan_vis' has no attribute 'SwanGraph'"!! What gives??
ETA: Python 3.7.6 in a conda env
I have generated the gtf files from talon database. I am encountering the following errors. Can you please have a look in it.
Novelty types between datasets conflict. Strongly consider using input from the same data source to reconcile these. Conflicting isoforms will be labelled "Ambiguous".
Traceback (most recent call last):
File "swan.p", line 21, in <module>
count_cols='vRG_370')
File "/u/home/a/ashokpat/.local/lib/python3.7/site-packages/swan_vis/swangraph.py", line 187, in add_dataset
self.update_ids()
File "/u/home/a/ashokpat/.local/lib/python3.7/site-packages/swan_vis/graph.py", line 120, in update_ids
self.dfs_to_dicts()
File "/u/home/a/ashokpat/.local/lib/python3.7/site-packages/swan_vis/graph.py", line 231, in dfs_to_dicts
self.t_df = self.t_df.to_dict('index')
File "/u/home/a/ashokpat/.local/lib/python3.7/site-packages/pandas/core/frame.py", line 1391, in to_dict
raise ValueError("DataFrame index must be unique for orient='index'.")
ValueError: DataFrame index must be unique for orient='index'.
Hi,
I've been trying run swan
as a script on our HPC and it seems that es_df = sg.find_es_genes(verbose=True)
and ir_df = sg.find_ir_genes(verbose=True)
take a very long time to run (48+ hours) as it has 40+ k edges to test. I've tried going down multiprocessing
route but not sure that's the best way.
I'm wondering if you had any advice on trying to speed this up, especially in HPC environment?
Hey Fairlie, me again...
I would like to measure DIE between multiple conditions. At the moment only 2 conditions are allowed for the obs_conditions argument. In theory, the method you use for DIE from Joglekar et al should allow for the comparison of more than 2 conditions. At the moment I can compare each of my 4 conditions with control and looking at the overlap, but that is noisy and a worse approach than just comparing the all conditions to each other simultaneously...
Hi! I'm trying this tools. And I found some problem. Hope you can help solve them!
In /master/scripts/driver.py
, line 14~28:
sg.add_dataset('HepG2_1', hep_1_gtf,
counts_file=ab_file,
count_cols='hepg2_1')
during I'm running, this error come out
--------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [5], in <cell line: 1>()
----> 1 sg.add_dataset('HepG2_1', hep_1_gtf,
2 counts_file=ab_file,
3 count_cols='hepg2_1')
TypeError: add_dataset() got an unexpected keyword argument 'counts_file'
I see the soure code find it has changed. What I should do now?
Another problem in my own data:
/home/lengliang/lzy/mambaforge/lib/python3.9/site-packages/anndata/_core/anndata.py:798: UserWarning:
AnnData expects .obs.index to contain strings, but got values like:
[0, 1]
Inferred to be: integer
value_idx = self._prep_dim_index(value.index, attr)
................. some code
---------------------------------------------------------------------------ZeroDivisionError Traceback (most recent call last)Input In [2], in <cell line: 1>()----> 1 test = sg.de_gene_test(obs_col, obs_conditions=obs_conditions)
..............some code
File ~/lzy/mambaforge/lib/python3.9/site-packages/dask/array/core.py:3142, in auto_chunks(chunks, shape, limit, dtype, previous_chunks)
3138 raise ValueError(
3139 "auto-chunking with dtype.itemsize == 0 is not supported, please pass in `chunks` explicitly"
3140 )
3141 print(limit , dtype.itemsize , largest_block)
-> 3142 size = (limit / dtype.itemsize / largest_block) ** (1 / len(autos))
3143 small = [i for i in autos if shape[i] < size]
3144 if small:
ZeroDivisionError: float division by zero
I find the semilariwarning in https://github.com/mortazavilab/swan_paper/blob/master/swan/swan_driver.ipynb
, But the error is warning by what I got is stop. What should I do?
/Users/fairliereese/miniconda3/lib/python3.7/site-packages/dask/array/core.py:2622: RuntimeWarning: divide by zero encountered in true_divide
size = (limit / dtype.itemsize / largest_block) ** (1 / len(autos))
I noticed that in my diff expression files, there are no negative values for log2fc even when I do not impose filters. I use exactly the same code that you provide in your tutorial except I do not filter e.g. de_genes = sg.get_de_genes(obs_col, obs_conditions=obs_conditions)
. Not sure what's going wrong..
Also, another maybe more nuanced issue... with long read seq there are a lot of cases where for example the total number of transcripts/reads for a particular gene are very low. Swan still runs diff expression tests on these, resulting in a lot of log2fc that are huge and the same. In the tutorial on the swanvis website you can see this even happens to you, see the +/-297.776029 log2fc numbers in your sample tabes. As these are probably not so interesting, is there a reason they still included in swan output?
Hi,
I am keen to use swanvis to explore the results of running TALON on our single cell dataset. Unfortunately, when reading in the filtered abundance information into the swan graph, I get a memory error.
We have 9992 cells and 17,808 transcripts. The error is due to trying to create an array with shape (9992, 3033596)
. I am not sure what the 3033596 refers to. I tried increasing the memory allocation to 200Gb on the HPC but it still fails. Increasing the memory further does not work as I am not granted the resources for my job on the HPC. Do you have any suggestions about how to get around this?
Many thanks,
Catherine
Hello,
I have my data preprocessed using SQANIT3. I was wondering if I don't want to use the talon.db, what data from SQANTI3 could replace it? If I don't need the quantification, but only the visualization of my transcripts, is it still necessary? Thanks
Best
Xiaona
Hi,
I basically ran the script as the website presents. while I received below error. Any idea to solve it? Thanks
Adding annotation to the SwanGraph
Traceback (most recent call last):
File "/Swan/Swan.py", line 24, in
sg.ass_transcriptome(data_gtf)
AttributeError: 'SwanGraph' object has no attribute 'ass_transcriptome'
Hi,
Thanks for a gorgeous tool!
I've been trying to trial swan on my samples but I seem to be encountering this error:
Adding annotation to the SwanGraph
Adding transcriptome to the SwanGraph
/users/k19022845/.local/lib/python3.8/site-packages/anndata/_core/anndata.py:1830: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
Adding abundance for datasets NPCBC01, NPCBC02, NPCBC03, NPCBC04, NPCBC05... (and 31 more) to SwanGraph
Calculating TPM...
Calculating PI...
Traceback (most recent call last):
File "swan_trial.py", line 17, in <module>
sg.add_abundance(ab_file)
File "/users/k19022845/.local/lib/python3.8/site-packages/swan_vis/swangraph.py", line 571, in add_abundance
self.merge_adata_abundance(adata, how=how)
File "/users/k19022845/.local/lib/python3.8/site-packages/swan_vis/swangraph.py", line 424, in merge_adata_abundance
sg_adata.layers['pi'] = sparse.csr_matrix(calc_pi(sg_adata, self.t_df)[0].to_numpy())
File "/users/k19022845/.local/lib/python3.8/site-packages/swan_vis/utils.py", line 427, in calc_pi
df = df.pivot(columns=obs_col, index=id_col, values='pi')
File "/users/k19022845/.local/lib/python3.8/site-packages/pandas/util/_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File "/users/k19022845/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 8567, in pivot
return pivot(self, index=index, columns=columns, values=values)
File "/users/k19022845/.local/lib/python3.8/site-packages/pandas/util/_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File "/users/k19022845/.local/lib/python3.8/site-packages/pandas/core/reshape/pivot.py", line 540, in pivot
return indexed.unstack(columns_listlike) # type: ignore[arg-type]
File "/users/k19022845/.local/lib/python3.8/site-packages/pandas/core/series.py", line 4455, in unstack
return unstack(self, level, fill_value)
File "/users/k19022845/.local/lib/python3.8/site-packages/pandas/core/reshape/reshape.py", line 489, in unstack
unstacker = _Unstacker(
File "/users/k19022845/.local/lib/python3.8/site-packages/pandas/core/reshape/reshape.py", line 137, in __init__
self._make_selectors()
File "/users/k19022845/.local/lib/python3.8/site-packages/pandas/core/reshape/reshape.py", line 189, in _make_selectors
raise ValueError("Index contains duplicate entries, cannot reshape")
ValueError: Index contains duplicate entries, cannot reshape
I'm not entirely sure where the duplicate entry issue is coming from, so any advice on that would be great!
Hi @fairliereese,
I'm following the tutorial (https://freese.gitbook.io/swan/tutorials/getting_started) for SWAN with my data but when I run:
adata_file = '/home/jupyter/talon_output/WGBR_20240330_transcript_adata.h5ad'
sg.add_adata(adata_file)
I get:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/tmp/ipykernel_9583/3669442909.py in <module>
1 adata_file = '/home/jupyter/talon_output/WGBR_20240330_transcript_adata.h5ad'
----> 2 sg.add_adata(adata_file)
AttributeError: 'SwanGraph' object has no attribute 'add_adata'
The functions:
sg = swan.SwanGraph()
sg.add_annotation(annot_gtf)
sg.add_transcriptome(data_gtf)
sg.add_abundance(ab_file)
All seem to work perfectly fine so far. Any help you can provide would be greatly appreciated.
Thanks,
wdg118
I am having the following error while looking for exon skipping and intron retention events. Other modules are working fine. Can you please look into it.
Thanks
es_genes = sg.find_es_genes()
print(es_genes[:5])
~/.local/lib/python3.7/site-packages/swan_vis/swangraph.py in find_es_genes(self)
1092 sub_G = self.G.subgraph(sub_nodes)
1093 sub_edges = list(sub_G.edges())
-> 1094 sub_edges = self.edge_df.loc[sub_edges]
1095 sub_edges = sub_edges.loc[sub_edges.edge_type == 'exon']
1096
~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
1422
1423 maybe_callable = com.apply_if_callable(key, self.obj)
-> 1424 return self._getitem_axis(maybe_callable, axis=axis)
1425
1426 def _is_scalar_access(self, key: Tuple):
~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
1837 raise ValueError("Cannot index with multidimensional key")
1838
-> 1839 return self._getitem_iterable(key, axis=axis)
1840
1841 # nested tuple slicing
~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_iterable(self, key, axis)
1131 else:
1132 # A collection of keys
-> 1133 keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
1134 return self.obj._reindex_with_indexers(
1135 {axis: [keyarr, indexer]}, copy=True, allow_dups=True
~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
1090
1091 self._validate_read_indexer(
-> 1092 keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
1093 )
1094 return keyarr, indexer
~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
1175 raise KeyError(
1176 "None of [{key}] are in the [{axis}]".format(
-> 1177 key=key, axis=self.obj._get_axis_name(axis)
1178 )
1179 )
KeyError: "None of [Index([(519693, 519694)], dtype='object', name='edge_id')] are in the [index]"
ir_genes = sg.find_ir_genes()
print(ir_genes[:5])
~/.local/lib/python3.7/site-packages/swan_vis/swangraph.py in find_ir_genes(self)
1019 sub_G = self.G.subgraph(sub_nodes)
1020 sub_edges = list(sub_G.edges())
-> 1021 sub_edges = self.edge_df.loc[sub_edges]
1022 sub_edges = sub_edges.loc[sub_edges.edge_type == 'intron']
1023
~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
1422
1423 maybe_callable = com.apply_if_callable(key, self.obj)
-> 1424 return self._getitem_axis(maybe_callable, axis=axis)
1425
1426 def _is_scalar_access(self, key: Tuple):
~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
1837 raise ValueError("Cannot index with multidimensional key")
1838
-> 1839 return self._getitem_iterable(key, axis=axis)
1840
1841 # nested tuple slicing
~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_iterable(self, key, axis)
1131 else:
1132 # A collection of keys
-> 1133 keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
1134 return self.obj._reindex_with_indexers(
1135 {axis: [keyarr, indexer]}, copy=True, allow_dups=True
~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
1090
1091 self._validate_read_indexer(
-> 1092 keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
1093 )
1094 return keyarr, indexer
~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
1175 raise KeyError(
1176 "None of [{key}] are in the [{axis}]".format(
-> 1177 key=key, axis=self.obj._get_axis_name(axis)
1178 )
1179 )
KeyError: "None of [Index([(521455, 521457)], dtype='object', name='edge_id')] are in the [index]"
Its a really nice package. I was looking for such visualization and switcinging tool for some. However, I am having some issues. I am having an error and I am trying to use talon database. Its mainly an error for not finding hepg2_1 in database. I think its because of the following line. Can you look into it?
self.create_dfs_db(fname, annot, whitelist, 'hepg2_1')
Hi your tutorial page mentioned that users could use other tools that yield transcriptomes as input to Swan; however, I saw that in the step of "Adding transcript models from a GTF", a gtf file ("all_talon_observedOnly.gtf") seems to be generated from TALON. I wonder if you could suggest a way to do the same without the need to use TALON. Many thanks!
Hi
This is more of a recommendation than an issue. I'm really interested in intron retention and exon skipping for my research and I find it a bit confusing how this analysis is set up in swanvis now. By running ir_genes, ir_transcripts, ir_edges = sg.find_ir_genes()
as is in the tutorial, you get 3 lists that you actually cannot easily relate to one another. For any particular tuple in the ir_edges list, there is no way for me to know which transcript or gene it is associated to. Would it not make more sense to have as output a pandas dataframe with 3 columns (gene, transcript, edge) instead of 3 lists?
Thanks for hearing me out!
Hi, first of all thank you for developing this wonderful software package! I am using a trying to use Swan on a 2019 MacBook Pro (Catalina), and I came across an issue while installing Swan.
I tried installing this package using pip install swan_vis
as recommended, however I could not import the library into python. I actually had to use pip3 install swan_vis
in order to get it to work properly.
Thanks,
Sam
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.