Git Product home page Git Product logo

recan's Introduction

DOI PyPI version

Recan

Recan web version

Requirements

Intallation

Usage example

Some notes on usage

Automated tests

Example datasets

References

recan citations

Recan

recan [9] is a Python package which allows to construct genetic distance plots to explore and discover recombination events in viral genomes. This method has been previously implemented in desktop software tools: RAT[1], Simplot[2] and RDP4 [8].

Recan web version

Recan django-based web version is currently under development https://github.com/babinyurii/recan_gui

It is available at: http://yuriyb.pythonanywhere.com/

Requirements

To use recan, you will need:

  • Python 3
  • Biopython
  • plotly
  • pandas
  • Jupyter notebook

Intallation

To install the package via pip run :

$ pip install recan

If you are going to use recan in JupyterLab, follow the insctructions to install the Jupyter Lab Plotly renderer

Usage example

The package is intended to be used in Jupyter notebook.
Import Simgen class from the recan package:

from recan.simgen import Simgen

create an object of the Simgen class. To initialize the object pass your alignment in fasta format as an argument:

sim_obj = Simgen("./datasets/hbv_C_Bj_Ba.fasta")

The input data are taken from the article by Sugauchi et al.(2002). This paper describes recombination event observed in hepatitis B virus isolates.

The object of the Simgen class has method get_info() which shows information about the alignment.

sim_obj.get_info()
index:	sequence id:
0	AB048704.1_genotype_C_
1	AB033555.1_Ba
2	AB010291.1_Bj
alignment length:  3215

We have three sequences in our alignment. Simgen class is based upon the MultipleSequenceAlignment class of the Biopython library. So, we treat our alignment as the array with n_samples and n_features, where 'samples' are sequences themselves, and the features are columns of nucleotides in the alignment. Index corresponds to the sequence. Note, that indices start with 0.

After you've created the object you can draw the similarity plot. Call the method simgen() of the Simgen object to draw the plot. Pass the following parameters to the method:

  • window: sliding window size. The number of nucleotides the sliding window will span. It has the value of 500 by default.
  • shift: this is the step our window slides downstream the alignment. It's value is set to 250 by default
  • pot_rec: the index of the potential recombinant. All the other sequences will be plotted as function of distance to that sequence. Use method get_info() to get the indices, especially if your alignment has many sequences.

The isolate of Ba genotype is the recombinant between the virus of C genotype and genotype Bj. Let's plot it. We set genotype Ba as the potential recombinant :

sim_obj.simgen(window=200, shift=50, pot_rec=1)

hbv_1

Potential recombinant is not shown in the plot, as the distances are calculated relative to it. The higher is the distance function (i.e. the closer to 1), the closer is the sequence to the recombinant and vice versa.

We can see typical 'crossover' of the distances which is the indicator of the possible recombination event. The distance of one isolate 'drops down' whereas the distance of the other remains the same of even gets closer to the potential recombinant, this abrupt drop shows that recombination could take place.

The picture from the article is shown below. It's just turned upside down relative to our plot, and instead of distance drop we see distance rising. Here Bj 'goes away' from the genotype C, whereas Ba keeps the same distance

Ba_Bj_C

By default simgen() method plots the whole alignment. But after initial exploration, we can take a closer look at a particular region by passing the region parameter to the simgen method. We can slice the alignment by using this parameter. region must be a tuple or a list with two integers: the start and the end position of the alignment slice.

region = (start, end)
sim_obj.simgen(window=200, shift=50, pot_rec=1, region=(1000, 2700))

hbv_slice_1

To customize the plot or just to export and store the data, use get_data() method. get_data() returns pandas DataFrame object with sequences as samples, and distances at given points as features.

sim_obj.get_data()

hbv_df_example

If optional paremeter df is set to False, get_data() returns a tuple containing list of ticks and a dictionary of lists. Each dictionary key is the sequence id, and lists under the keys contain the corresponding distances.

positions, data = sim_obj.get_data(df=False)
print(positions)
[1050, 1100, 1150, 1200, 1250, 1300, 1350, 1400, 1450, 1500, 1550, 1600, 1650, 1700, 1750, 1800, 1850, 1900, 1950, 2000, 2050, 2100, 2150, 2200, 2250, 2300, 2350, 2400, 2450, 2500, 2550, 2600, 2650, 2700]

print(data)
{'AB048704.1_genotype_C_': [0.88, 0.935, 0.925, 0.955, 0.955, 0.965, 0.95, 0.935, 0.94, 0.92, 0.9299999999999999, 0.945, 0.925, 0.945, 0.96, 0.95, 0.975, 0.9733333333333334, 0.96, 0.96], 'AB010291.1_Bj': [0.98, 0.975, 0.97, 0.97, 0.965, 0.95, 0.91, 0.88, 0.85, 0.83, 0.825, 0.865, 0.885, 0.9299999999999999, 0.98, 0.97, 0.98, 0.9733333333333334, 0.96, 0.96]}

Once you've returned the data, you can easily customize the plot by using your favourite plotting library:

dist_data = sim_obj.get_data()

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

fig_dist1 = plt.figure(figsize=(20, 8))
plt.plot(df.loc["AB048704.1_genotype_C_", : ], lw=7, alpha=0.7, label="AB048704.1_genotype_C_")
plt.plot(df.loc["AB010291.1_Bj", : ], lw=7, alpha=0.7, label="AB010291.1_Bj")

plt.ylim(0.75, 1.05)
plt.title("similarity distance plot", fontsize=25)
plt.ylabel("distance relative to Ba", fontsize=20)
plt.xlabel("nucleotide position", fontsize=20)
plt.xticks(fontsize=15) 
plt.yticks(fontsize=15)

plt.axvline(1750, alpha=0.5, color="red", lw=3,
            linestyle="dashed", label="putative recombination break points")
plt.axvline(2250, alpha=0.5, color="red", lw=3,
            linestyle="dashed"  )

plt.legend(prop={"size":20})
plt.show()

hbv_matplotlib

simgen() method has optional parameter dist which denoted method used to calculate pairwise distance. By default its value is set to pdist, so simgen() calculates simple pairwise distance.

Parameters for distance calculation methods:

  • pdist : pairwise distance (default)
  • jcd : Jukes-Cantor distance
  • k2p : Kimura 2-parameter distance
  • td : Tamura distance
sim_obj.simgen(window=200, shift=50, pot_rec=1, region=(1000, 2700), dist='k2p')

to save the distance data in csv format use the method save_data():

sim_obj.save_data(out_name="hbv_distance_data")

If there are about 20 or 30 sequences in the input file and their names are long, legend element may hide the plot. So, to be able to analyze many sequences at once, it's better to use short consice sequence names instead of long ones. Like this:

hbv_short_names

To illustrate how typical breakpoints may look like, here are shown some examples of previously described recombinations in the genomes of different viruses. The fasta alignments used are available at datasets folder.

Putative recombinations in the of 145000 bp genome of lumpy skin disease virus [4]:

lsdv

Recombination in HIV genome [5]: hiv

HCV intergenotype recombinant 2k/1b [6]: hcv

Norovirus recombinant isolate [7]: norovirus

Some notes on usage

  • the optimal window size is about 200-250 bp, the optimal window shift is typicall about 50-150 bp
  • now distance calculation skips degenerate nucleotides and gaps and they do not influence the distance values

Automated tests

To verify the installation, go to the recan/test/ folder and run:

$ pytest test.py

Example datasets

To download the datasets use the following link: https://drive.google.com/drive/folders/1v2lg5yUDFw_fgSiulsA1uFeuzoGz0RjH?usp=sharing

References

  1. Recombination Analysis Tool (RAT): a program for the high-throughput detection of recombination. Bioinformatics, Volume 21, Issue 3, 1 February 2005, Pages 278–281, https://doi.org/10.1093/bioinformatics/bth500
  2. https://sray.med.som.jhmi.edu/SCRoftware/simplot/
  3. Hepatitis B Virus of Genotype B with or without Recombination with Genotype C over the Precore Region plus the Core Gene. Fuminaka Sugauchi et al. JOURNAL OF VIROLOGY, June 2002, p. 5985–5992. 10.1128/JVI.76.12.5985-5992.2002 https://jvi.asm.org/content/76/12/5985
  4. Sprygin A, Babin Y, Pestova Y, Kononova S, Wallace DB, Van Schalkwyk A, et al. (2018) Analysis and insights into recombination signals in lumpy skin disease virus recovered in the field. PLoS ONE 13(12): e0207480. https://doi.org/10.1371/journal.pone.0207480
  5. Liitsola, K., Holm K., Bobkov, A., Pokrovsky, V., Smolskaya,T., Leinikki,P., Osmanov,S. and Salminen,M. (2000) An AB recombinant and its parental HIV type 1 strains in the area of the former Soviet Union: low requirements for sequence identity in recombination. UNAIDS Virus Isolation Network. AIDS Res. Hum. Retroviruses, 16, 1047–1053.
  6. Smith, D. B., Bukh, J., Kuiken, C., Muerhoff, A. S., Rice, C. M., Stapleton, J. T., & Simmonds, P. (2014). Expanded classification of hepatitis C virus into 7 genotypes and 67 subtypes: Updated criteria and genotype assignment web resource. Hepatology, 59(1), 318–327. https://doi.org/10.1002/hep.26744
  7. Jiang,X., Espul,C., Zhong,W.M., Cuello,H. and Matson,D.O. (1999) Characterization of a novel human calicivirus that may be a naturally occurring recombinant. Arch. Virol., 144, 2377–2387.
  8. Martin, D. P., Murrell, B., Golden, M., Khoosal, A., & Muhire, B. (2015). RDP4: Detection and analysis of recombination patterns in virus genomes. Virus Evolution, 1(1), 1–5. https://doi.org/10.1093/ve/vev003
  9. Babin, Y., (2020). Recan: Python tool for analysis of recombination events in viral genomes. Journal of Open Source Software, 5(49), 2014. https://doi.org/10.21105/joss.02014

recan citations

  1. Zimerman RA, Ferrareze PAG, Cadegiani FA, Wambier CG, Fonseca DdN, de Souza AR, Goren A, Rotta LN, Ren Z and Thompson CE (2022) Comparative Genomics and Characterization of SARS-CoV-2 P.1 (Gamma) Variant of Concern From Amazonas, Brazil. Front. Med. 9:806611. doi: 10.3389/fmed.2022.806611
  2. In book: Proceedings of the 4th International Conference on Big Data Analytics for Cyber-Physical System in Smart City - Volume 2. Chapter: Python Data Analysis Techniques in Administrative Information Integration Management System April 2023 DOI: 10.1007/978-981-99-1157-8_35

recan's People

Contributors

babinyurii avatar emollier avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

recan's Issues

JoSS Review: Statement of need

In conjunction with the review of your package to JoSS (available at openjournals/joss-reviews#2014), here is an issue for you to address for your submission.

My only comment here is that a clear problem that recan solves. I do see that it is meant for researchers that use Python and that it has some advantages over RDP4 and RAT. However, a concise statement of need seems absent.

Distance Calculation

As it stands, simgen only allows pdist and k2p as its distance formula, but your program doesn't allow to user to define their own: such as a Jukes-Cantor or Tajima-Nei distance.

Would it be possible to allow simgen to take a function as an argument for its dist= kwarg?

JoSS Review: Automated tests

In conjunction with the review of your package to JoSS (available at openjournals/joss-reviews#2014), here is an issue for you to address for your submission.

Regarding automated testing, I appreciate that you already have some in place. However, there are some points to be made:

  1. You do not detail how to run the automated tests in the instructions/README
  2. One of the industry leaders in running automated tests is pytest, and if a naive user were to attempt to run your test suite using it, the naming conventions you have in place for the test folder and files would inhibit test discovery:
    • Consider renaming test/ to tests/ and test.py to recan_test.py or basic_test.py
  3. Your tests require the user to run tests from the test/ folder itself. Running the automated tests from the top-level directory causes tests to fail. If this behavior is desired, please detail it in the instructions for running the tests.
  4. Lastly, with regard to test coverage:
$ pytest --cov=recan .
================================================= test session starts ==================================================platform linux -- Python 3.6.7, pytest-5.3.2, py-1.8.1, pluggy-0.12.0
rootdir: /.../recan/recan
plugins: cov-2.8.1
collected 13 items

recan_test.py .............                                                                                      [100%]       
=================================================== warnings summary ===================================================/.../miniconda3/envs/recan/lib/python3.6/site-packages/Bio/Alphabet/__init__.py:26
  /.../miniconda3/envs/recan/lib/python3.6/site-packages/Bio/Alphabet/__init__.py:26: PendingDeprecationWarning:

  We intend to remove or replace Bio.Alphabet in 2020, ideally avoid using it explicitly in your code. Please get in touch if you will be adversely affected by this. https://github.com/biopython/biopython/issues/2046

-- Docs: https://docs.pytest.org/en/latest/warnings.html

----------- coverage: platform linux, python 3.6.7-final-0 -----------
Name                                  Stmts   Miss  Cover
---------------------------------------------------------
/.../recan/recan/recan/__init__.py       0      0   100%
/.../recan/recan/recan/simgen.py       147     48    67%
---------------------------------------------------------
TOTAL                                   147     48    67%

============================================ 13 passed, 1 warning in 3.23s ============================================= 11:32

JoSS Review: README Documentation

In conjunction with the review of your package to JoSS (available at openjournals/joss-reviews#2014), here is an issue for you to address for your submission.

While your GitHub repo contains a README file, this file is not included in your package on PyPI. As it stands, any person looking for your package through that venue will not be able to view your package details unless they were to navigate to the repo itself. Consider packaging your README with your PyPI package: https://packaging.python.org/guides/making-a-pypi-friendly-readme/

k2p: ZeroDivisionError: float division by zero

Hi developer,

I'm trying to run recan to detect a potential recombination event. However, after I changed the 'dist=k2p', the program crashed out. But the default pdist could normally run. Could you please help figure it out?

Here is the running output:

Traceback (most recent call last):
  File "test_recan.py", line 6, in <module>
    sim_obj.simgen(window=200, shift=20, pot_rec=1,  dist='k2p')
  File "/home/zjl/anaconda3/lib/python3.7/site-packages/recan/simgen.py", line 241, in simgen
    self._distance = self._move_window(window, pot_rec, shift, dist)
  File "/home/zjl/anaconda3/lib/python3.7/site-packages/recan/simgen.py", line 117, in _move_window
    distance = self._K2Pdistance(seq1, seq2)
  File "/home/zjl/anaconda3/lib/python3.7/site-packages/recan/simgen.py", line 153, in _K2Pdistance
    p = float(ts_count) / length
ZeroDivisionError: float division by zero

My MSA squences have some N bases and degenerate bases, is this the reason? How could prepare the input alignment? Thanks.

JoSS Review: Quality of writing

In conjunction with the review of your package to JoSS (available at openjournals/joss-reviews#2014), here is an issue for you to address for your submission.

This is potentially the most difficult issue to bring up. While the implementation of recan is tremendous work, the readability of this submission could use significant revisions.

I planned on making a series of revision notes to pass on to the author but, after looking at the list, I will refrain as I know what it feels like to have a giant list of revisions to come from a reviewer.

Generally speaking, this manuscript would benefit from improvements in grammar and sentence structure.

JoSS Review: State of the field

In conjunction with the review of your package to JoSS (available at openjournals/joss-reviews#2014), here is an issue for you to address for your submission.

While recan allows the plotting of genetic distances (much like RDP4 and RAT), and it does so interactively and ad hoc, it foregoes the ability to detect potential recombination events. This is one of the primary reasons others use those programs.

Are there any future plans to implement potential recombination event detection into recan?

JoSS Review: Example Usage

In conjunction with the review of your package to JoSS (available at openjournals/joss-reviews#2014), here is an issue for you to address for your submission.

While I appreciate the wealth of examples that were added to the documentation, none of these examples explain what is being visualized thoroughly. A lay-person, or early scientist in the field, may look at these plots and not understand where the events are occurring and what they look like. Consider adding brief annotations to the screen shots explaining what/where these events are.

JoSS Review: Community Guidelines

In conjunction with the review of your package to JoSS (available at openjournals/joss-reviews#2014), here is an issue for you to address for your submission.

In the spirit of Open Source, programs submitted to JoSS should iterate on the concept of developing a Community of Practice. Therefore, most submissions should contain a set of Community Guidelines. These guidelines should detail how to contribute, how to report issues, and how to seek support.

These can easily be detailed in a CONTRIBUTING.md file at the top-level of the package and consider setting up Issue/PR templates for your repository.

Copyright statement

Hi,

I'm trying to packaging this into Debian as a Debian package, and now I'm writing the copyright information.

Copyright (c) 2018 The Python Packaging Authority (PyPA)

I found the copyright statement in LICENCE.txt which seems incorrect. (The Python Packaging Authority (PyPA) suppose a template or placeholder?)

Could you please let me know this statement is correct or not? If not, could you please fix this?

Thank you~

ValueError: Sequences must all be the same length

Describe the bug
A clear and concise description of what the bug is.

When we added two distinct sequences of viruses, we got an error of the same length

Here is our code

from recan.simgen import Simgen
sim_obj = Simgen("./virus_genome/all_virus_genome.fasta")

Here is the error we have received.

~/anaconda3/lib/python3.7/site-packages/Bio/Align/init.py in _append(self, record, expected_length)
589 # raise ValueError("New sequence is not of length %i"
590 # % self.get_alignment_length())
--> 591 raise ValueError("Sequences must all be the same length")
592
593 # Using not self.alphabet.contains(record.seq.alphabet) needs fixing

ValueError: Sequences must all be the same length

JoSS Review: Performance

In conjunction with the review of your package to JoSS (available at openjournals/joss-reviews#2014), here is an issue for you to address for your submission.

In your paper, you have stated that one of the advantages of recan over other methods of recombination detection (e.g. RAT and RDP4) is speed. However, I do not think the speed of recan should be compared to that of the aforementioned programs as the recombination auto-search function (which consumes a lot of run-time and is the main focus of RAT and RDP4) has not been implemented in recan.

save_data: TypeError: 'NoneType' object is not subscriptable

Describe the bug
When using the save_data method I'm getting an error saying TypeError: 'NoneType' object is not subscriptable

Full error:

Traceback (most recent call last):
  File "test.py", line 7, in <module>
    sim_obj.save_data(out="csv", out_name="test_data")
  File "/home/jeremy/.local/lib/python3.8/site-packages/recan/simgen.py", line 293, in save_data
    df = pd.DataFrame(data=self._distance, index=self._ticks[1:]).T
TypeError: 'NoneType' object is not subscriptable

To Reproduce

from recan.simgen import Simgen
sim_obj = Simgen("test.fasta")
sim_obj.save_data(out="csv", out_name="test_data")

Expected behavior
CSV file containing plot data is written

Additional context
Add any other context about the problem here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.