mortazavilab / swan_vis Goto Github PK

View Code? Open in Web Editor NEW

50.0 4.0 10.0 65.37 MB

A Python library to visualize and analyze long-read transcriptomes

Home Page: https://freese.gitbook.io/swan/

License: MIT License

Python 5.49% Jupyter Notebook 94.51%

bioinformatics transcriptome pacbio pacbio-iso-seq oxford-nanopore long-read

swan_vis's People

Contributors

Stargazers

Watchers

Forkers

altingia balharbi loganylchen ricyoung makethebrainhappy ahmedarslan lipingshu kreldjarn olgabot bridgebioanalytics

swan_vis's Issues

plot a transcript with all transcripts as background

Hello,
I want to ask for a further skill on how to keep all the transcripts as background when I only plot a transcript. I tried plot the gene first, then plot a transcript, the figure shows mix figures, rather the gene as light background in the tutorial. If I only plot the transcript, after read the swan.p in, I cannot see the other transcripts in my plot. Thanks

Missing case for plottedgraph init

Case:

Plot gene summary graph for gene
Plot transcript path graph for transcript of same gene
Try to plot gene summary graph for same gene again --> it plots the transcript path graph!

Minimal working (or non-working) example:

sg = swan.SwanGraph('swan.p')
sg.plot_graph('SARS', indicate_novel=True, display=True, prefix='figures/SARS')
sg.plot_transcript_path('ENCODEHT000445064', indicate_novel=True, display=True, prefix='figures/SARS')

Novelty info not found

Hello Fairlie,

Thank you for the swan package - love the visualisations! I just had a question regarding the novel splice junctions visualisation, as the warning message "Novelty info not found for x. Transcripts without novelty information will be labelled "Undefined" appears every time I add a dataset.

I am using the mouse reference genome gtf as annotation gtf and my Iso-Seq gtf output from SQANTI as the dataset input (bypassing TALON).

Is the novelty aspect documented in the gtf or is there a specific file I need to include? Any guidance on this will be greatly appreciated!

Thank you,
Szi Kay

Missing transcripts

It seems that some transcripts that are present in 'all_talon_abundance_filtered.tsv' and 'all_talon_observedOnly.gtf' are not in the swangraph object when I upload them. How could this be? Is there an additional filtering that filters out transcripts while I add them?

More specifically- I have 6,070 of 12,812 novel transcripts remaining after I upload them into the swan object.

I'm using the standard sg.add_transcriptome(talon_db, pass_list=pass_list) and sg.add_abundance(ab_file) as you suggest in the tutorial

find_es_genes returns duplicated entries

relation between splicing in/out sites and exons

Hi,
I was wondering if there is a tabular info of the relation between splicing in/out sites and exons of genes. It will be very helpful when the gene is very long. Thanks

swanvis broken?

Hi
I've been an extremely satisfied user of swanvis up until now. The last time I used it was in mid-January, about 6 months ago. I tried to use it again now and it seems to be broken?
I tried to re-install with pip and the installation failed, i used 'pip3' like another user stated in another issue and that did run to completion, but when trying to import into python itself the package couldn't be found.
I cloned the repository and installed it with 'pip3 install .'. That worked and I was able to load the package into python. but then i ran 'sg = swan.SwanGraph()' and got "AttributeError: module 'swan_vis' has no attribute 'SwanGraph'"!! What gives??

ETA: Python 3.7.6 in a conda env

ValueError

Hi @fairliereese

I have generated the gtf files from talon database. I am encountering the following errors. Can you please have a look in it.

Novelty types between datasets conflict. Strongly consider using input from the same data source to reconcile these. Conflicting isoforms will be labelled "Ambiguous".
Traceback (most recent call last):
  File "swan.p", line 21, in <module>
    count_cols='vRG_370')
  File "/u/home/a/ashokpat/.local/lib/python3.7/site-packages/swan_vis/swangraph.py", line 187, in add_dataset
    self.update_ids()
  File "/u/home/a/ashokpat/.local/lib/python3.7/site-packages/swan_vis/graph.py", line 120, in update_ids
    self.dfs_to_dicts()
  File "/u/home/a/ashokpat/.local/lib/python3.7/site-packages/swan_vis/graph.py", line 231, in dfs_to_dicts
    self.t_df = self.t_df.to_dict('index')
  File "/u/home/a/ashokpat/.local/lib/python3.7/site-packages/pandas/core/frame.py", line 1391, in to_dict
    raise ValueError("DataFrame index must be unique for orient='index'.")
ValueError: DataFrame index must be unique for orient='index'.

Speeding up Exon skipping and intron retention on HPC

Hi,

I've been trying run swan as a script on our HPC and it seems that es_df = sg.find_es_genes(verbose=True) and ir_df = sg.find_ir_genes(verbose=True) take a very long time to run (48+ hours) as it has 40+ k edges to test. I've tried going down multiprocessing route but not sure that's the best way.

I'm wondering if you had any advice on trying to speed this up, especially in HPC environment?

Differential isoform usage multiple conditions

Hey Fairlie, me again...
I would like to measure DIE between multiple conditions. At the moment only 2 conditions are allowed for the obs_conditions argument. In theory, the method you use for DIE from Joglekar et al should allow for the comparison of more than 2 conditions. At the moment I can compare each of my 4 conditions with control and looking at the overlap, but that is noisy and a worse approach than just comparing the all conditions to each other simultaneously...

Example code problem

Hi! I'm trying this tools. And I found some problem. Hope you can help solve them!
In /master/scripts/driver.py, line 14~28:

sg.add_dataset('HepG2_1', hep_1_gtf,
	counts_file=ab_file,
	count_cols='hepg2_1')

during I'm running, this error come out

--------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [5], in <cell line: 1>()
----> 1 sg.add_dataset('HepG2_1', hep_1_gtf,
      2         counts_file=ab_file,
      3         count_cols='hepg2_1')

TypeError: add_dataset() got an unexpected keyword argument 'counts_file'

I see the soure code find it has changed. What I should do now?

Another problem in my own data:

/home/lengliang/lzy/mambaforge/lib/python3.9/site-packages/anndata/_core/anndata.py:798: UserWarning: 
AnnData expects .obs.index to contain strings, but got values like:
    [0, 1]

    Inferred to be: integer

  value_idx = self._prep_dim_index(value.index, attr)
................. some code
---------------------------------------------------------------------------ZeroDivisionError                         Traceback (most recent call last)Input In [2], in <cell line: 1>()----> 1 test = sg.de_gene_test(obs_col, obs_conditions=obs_conditions)
..............some code
File ~/lzy/mambaforge/lib/python3.9/site-packages/dask/array/core.py:3142, in auto_chunks(chunks, shape, limit, dtype, previous_chunks)
   3138     raise ValueError(
   3139         "auto-chunking with dtype.itemsize == 0 is not supported, please pass in `chunks` explicitly"
   3140     )
   3141 print(limit , dtype.itemsize , largest_block)
-> 3142 size = (limit / dtype.itemsize / largest_block) ** (1 / len(autos))
   3143 small = [i for i in autos if shape[i] < size]
   3144 if small:
ZeroDivisionError: float division by zero

I find the semilariwarning in https://github.com/mortazavilab/swan_paper/blob/master/swan/swan_driver.ipynb, But the error is warning by what I got is stop. What should I do?

/Users/fairliereese/miniconda3/lib/python3.7/site-packages/dask/array/core.py:2622: RuntimeWarning: divide by zero encountered in true_divide
  size = (limit / dtype.itemsize / largest_block) ** (1 / len(autos))

Issues differential expression

I noticed that in my diff expression files, there are no negative values for log2fc even when I do not impose filters. I use exactly the same code that you provide in your tutorial except I do not filter e.g. de_genes = sg.get_de_genes(obs_col, obs_conditions=obs_conditions). Not sure what's going wrong..

Also, another maybe more nuanced issue... with long read seq there are a lot of cases where for example the total number of transcripts/reads for a particular gene are very low. Swan still runs diff expression tests on these, resulting in a lot of log2fc that are huge and the same. In the tutorial on the swanvis website you can see this even happens to you, see the +/-297.776029 log2fc numbers in your sample tabes. As these are probably not so interesting, is there a reason they still included in swan output?

MemoryError: Unable to allocate 113. GiB for an array with shape (9992, 3033596) and data type float32

Hi,

I am keen to use swanvis to explore the results of running TALON on our single cell dataset. Unfortunately, when reading in the filtered abundance information into the swan graph, I get a memory error.

We have 9992 cells and 17,808 transcripts. The error is due to trying to create an array with shape (9992, 3033596). I am not sure what the 3033596 refers to. I tried increasing the memory allocation to 200Gb on the HPC but it still fails. Increasing the memory further does not work as I am not granted the resources for my job on the HPC. Do you have any suggestions about how to get around this?

Many thanks,
Catherine

replace talon.db

Hello,
I have my data preprocessed using SQANIT3. I was wondering if I don't want to use the talon.db, what data from SQANTI3 could replace it? If I don't need the quantification, but only the visualization of my transcripts, is it still necessary? Thanks
Best
Xiaona

AttributeError: 'SwanGraph' object has no attribute 'ass_transcriptome'

Hi,
I basically ran the script as the website presents. while I received below error. Any idea to solve it? Thanks

Adding annotation to the SwanGraph
Traceback (most recent call last):
File "/Swan/Swan.py", line 24, in
sg.ass_transcriptome(data_gtf)
AttributeError: 'SwanGraph' object has no attribute 'ass_transcriptome'

Index contains duplicate entries, cannot reshape

Hi,

Thanks for a gorgeous tool!

I've been trying to trial swan on my samples but I seem to be encountering this error:

Adding annotation to the SwanGraph

Adding transcriptome to the SwanGraph
/users/k19022845/.local/lib/python3.8/site-packages/anndata/_core/anndata.py:1830: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")

Adding abundance for datasets NPCBC01, NPCBC02, NPCBC03, NPCBC04, NPCBC05... (and 31 more) to SwanGraph
Calculating TPM...
Calculating PI...
Traceback (most recent call last):
  File "swan_trial.py", line 17, in <module>
    sg.add_abundance(ab_file)
  File "/users/k19022845/.local/lib/python3.8/site-packages/swan_vis/swangraph.py", line 571, in add_abundance
    self.merge_adata_abundance(adata, how=how)
  File "/users/k19022845/.local/lib/python3.8/site-packages/swan_vis/swangraph.py", line 424, in merge_adata_abundance
    sg_adata.layers['pi'] = sparse.csr_matrix(calc_pi(sg_adata, self.t_df)[0].to_numpy())
  File "/users/k19022845/.local/lib/python3.8/site-packages/swan_vis/utils.py", line 427, in calc_pi
    df = df.pivot(columns=obs_col, index=id_col, values='pi')
  File "/users/k19022845/.local/lib/python3.8/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/users/k19022845/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 8567, in pivot
    return pivot(self, index=index, columns=columns, values=values)
  File "/users/k19022845/.local/lib/python3.8/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/users/k19022845/.local/lib/python3.8/site-packages/pandas/core/reshape/pivot.py", line 540, in pivot
    return indexed.unstack(columns_listlike)  # type: ignore[arg-type]
  File "/users/k19022845/.local/lib/python3.8/site-packages/pandas/core/series.py", line 4455, in unstack
    return unstack(self, level, fill_value)
  File "/users/k19022845/.local/lib/python3.8/site-packages/pandas/core/reshape/reshape.py", line 489, in unstack
    unstacker = _Unstacker(
  File "/users/k19022845/.local/lib/python3.8/site-packages/pandas/core/reshape/reshape.py", line 137, in __init__
    self._make_selectors()
  File "/users/k19022845/.local/lib/python3.8/site-packages/pandas/core/reshape/reshape.py", line 189, in _make_selectors
    raise ValueError("Index contains duplicate entries, cannot reshape")
ValueError: Index contains duplicate entries, cannot reshape

I'm not entirely sure where the duplicate entry issue is coming from, so any advice on that would be great!

sg.add_adata(adata_file) not working

Hi @fairliereese,

I'm following the tutorial (https://freese.gitbook.io/swan/tutorials/getting_started) for SWAN with my data but when I run:

adata_file = '/home/jupyter/talon_output/WGBR_20240330_transcript_adata.h5ad'
sg.add_adata(adata_file)

I get:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_9583/3669442909.py in <module>
      1 adata_file = '/home/jupyter/talon_output/WGBR_20240330_transcript_adata.h5ad'
----> 2 sg.add_adata(adata_file)

AttributeError: 'SwanGraph' object has no attribute 'add_adata'

The functions:

sg = swan.SwanGraph()
sg.add_annotation(annot_gtf)
sg.add_transcriptome(data_gtf)
sg.add_abundance(ab_file)

All seem to work perfectly fine so far. Any help you can provide would be greatly appreciated.

Thanks,

wdg118

ES IR error

Hi @fairliereese

I am having the following error while looking for exon skipping and intron retention events. Other modules are working fine. Can you please look into it.

Thanks

es_genes = sg.find_es_genes()
print(es_genes[:5])



~/.local/lib/python3.7/site-packages/swan_vis/swangraph.py in find_es_genes(self)
   1092                         sub_G = self.G.subgraph(sub_nodes)
   1093                         sub_edges = list(sub_G.edges())
-> 1094                         sub_edges = self.edge_df.loc[sub_edges]
   1095                         sub_edges = sub_edges.loc[sub_edges.edge_type == 'exon']
   1096 

~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
   1422 
   1423             maybe_callable = com.apply_if_callable(key, self.obj)
-> 1424             return self._getitem_axis(maybe_callable, axis=axis)
   1425 
   1426     def _is_scalar_access(self, key: Tuple):

~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1837                     raise ValueError("Cannot index with multidimensional key")
   1838 
-> 1839                 return self._getitem_iterable(key, axis=axis)
   1840 
   1841             # nested tuple slicing

~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_iterable(self, key, axis)
   1131         else:
   1132             # A collection of keys
-> 1133             keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
   1134             return self.obj._reindex_with_indexers(
   1135                 {axis: [keyarr, indexer]}, copy=True, allow_dups=True

~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
   1090 
   1091         self._validate_read_indexer(
-> 1092             keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
   1093         )
   1094         return keyarr, indexer

~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
   1175                 raise KeyError(
   1176                     "None of [{key}] are in the [{axis}]".format(
-> 1177                         key=key, axis=self.obj._get_axis_name(axis)
   1178                     )
   1179                 )

KeyError: "None of [Index([(519693, 519694)], dtype='object', name='edge_id')] are in the [index]"


ir_genes = sg.find_ir_genes()
print(ir_genes[:5])


~/.local/lib/python3.7/site-packages/swan_vis/swangraph.py in find_ir_genes(self)
   1019                         sub_G = self.G.subgraph(sub_nodes)
   1020                         sub_edges = list(sub_G.edges())
-> 1021                         sub_edges = self.edge_df.loc[sub_edges]
   1022                         sub_edges = sub_edges.loc[sub_edges.edge_type == 'intron']
   1023 

~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
   1422 
   1423             maybe_callable = com.apply_if_callable(key, self.obj)
-> 1424             return self._getitem_axis(maybe_callable, axis=axis)
   1425 
   1426     def _is_scalar_access(self, key: Tuple):

~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1837                     raise ValueError("Cannot index with multidimensional key")
   1838 
-> 1839                 return self._getitem_iterable(key, axis=axis)
   1840 
   1841             # nested tuple slicing

~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_iterable(self, key, axis)
   1131         else:
   1132             # A collection of keys
-> 1133             keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
   1134             return self.obj._reindex_with_indexers(
   1135                 {axis: [keyarr, indexer]}, copy=True, allow_dups=True

~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
   1090 
   1091         self._validate_read_indexer(
-> 1092             keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
   1093         )
   1094         return keyarr, indexer

~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
   1175                 raise KeyError(
   1176                     "None of [{key}] are in the [{axis}]".format(
-> 1177                         key=key, axis=self.obj._get_axis_name(axis)
   1178                     )
   1179                 )

KeyError: "None of [Index([(521455, 521457)], dtype='object', name='edge_id')] are in the [index]"

self.create_dfs_db(fname, annot, whitelist, 'hepg2_1')

Hi @fairliereese

Its a really nice package. I was looking for such visualization and switcinging tool for some. However, I am having some issues. I am having an error and I am trying to use talon database. Its mainly an error for not finding hepg2_1 in database. I think its because of the following line. Can you look into it?

self.create_dfs_db(fname, annot, whitelist, 'hepg2_1')

Adding transcriptome in a TALON-independent way?

Hi your tutorial page mentioned that users could use other tools that yield transcriptomes as input to Swan; however, I saw that in the step of "Adding transcript models from a GTF", a gtf file ("all_talon_observedOnly.gtf") seems to be generated from TALON. I wonder if you could suggest a way to do the same without the need to use TALON. Many thanks!

Intron retention and exon skipping analysis

Hi
This is more of a recommendation than an issue. I'm really interested in intron retention and exon skipping for my research and I find it a bit confusing how this analysis is set up in swanvis now. By running ir_genes, ir_transcripts, ir_edges = sg.find_ir_genes() as is in the tutorial, you get 3 lists that you actually cannot easily relate to one another. For any particular tuple in the ir_edges list, there is no way for me to know which transcript or gene it is associated to. Would it not make more sense to have as output a pandas dataframe with 3 columns (gene, transcript, edge) instead of 3 lists?
Thanks for hearing me out!

Switch diffexp to pydeseq2 for differential gene and transcript expression testing

cannot load swan into python using recommended installation method

Hi, first of all thank you for developing this wonderful software package! I am using a trying to use Swan on a 2019 MacBook Pro (Catalina), and I came across an issue while installing Swan.

I tried installing this package using pip install swan_vis as recommended, however I could not import the library into python. I actually had to use pip3 install swan_vis in order to get it to work properly.

Thanks,

Sam

mortazavilab / swan_vis Goto Github PK

swan_vis's People

Contributors

Stargazers

Watchers

Forkers

swan_vis's Issues

Recommend Projects

Recommend Topics

Recommend Org