Git Product home page Git Product logo

jumitti / tfinder Goto Github PK

View Code? Open in Web Editor NEW
8.0 2.0 14.0 62.07 MB

Python script to quickly extract promoter and terminator regions with the NCBI API. Search for the presence of individual pattern or transcription factor responsive elements with manual sequence (IUPAC) or JASPAR API.

Home Page: https://tfinder-ipmc.streamlit.app/

License: MIT License

Python 100.00%
api dna dna-sequences iupac ncbi transcription-factors responsive-elements promoter-sequences python transcription-factor-binding-site

tfinder's Introduction

TFinder ๐Ÿงฌ๐Ÿ” Streamlit App

Overview

TFinder is a Python easy-to-use web tool for identifying Transcription Factor Binding Sites (TFBS) and Individual Motif (IM). Using the NCBI API, it can easily extract either the promoter or terminal regions of a gene through a simple query of NCBI gene name or ID. It enables simultaneous analysis across five different species for an unlimited number of genes. The tool searches for TFBS and IM in different formats, including IUPAC codes and JASPAR entries. Moreover, TFinder also allows the generation and use of a Position Weight Matrix (PWM). Finally, the data may be recovered in a tabular form and a graph showing the relevance of the TFBSs and IMs as well as its location relative to the Transcription Start Site (TSS) or gene end. The results may be sent by email to the user facilitating the ulterior analysis and data sharing.

TFinder is written in Python and is freely available on GitHub under the MIT license: https://github.com/Jumitti/TFinder and can be accessed as a web application implemented in Streamlit at https://tfinder-ipmc.streamlit.app.

DOI: https://doi.org/10.21203/rs.3.rs-3782387/v1

Description

Transcription factors (TFs) are proteins that bind to DNA to regulate gene expression. They specifically recognize a nucleotide sequence called a transcription factor binding site (TFBS) in the promoter and terminator regions of genes. The search of these TFBSs is an empirical discipline in the field of genomics that concerns a key step before TFBS functional validation by gel shift assays (EMSA) and chromatin immunoprecipitation (ChIP) that allow the examination of the interaction between a TF and DNA (Jayaram, Usvyat and R. Martin 2016)

The in-silico research of TFBS can be tedious and time-consuming at various stages, especially for novices in the discipline. Thus, first, it is necessary to retrieve the promoter or terminator nucleotide sequence of a gene. This step may be achieved by the utilization of several databases such as NCBI, UCSC and Ensembl, but they are not intuitive and user-friendly. Next, after identifying the promoter sequence of interest, one may use TF databases such as JASPAR (Castro-Mondragon et al. 2022) and TRANSFAC (Matys 2006), but they have their limitations. For example, these platforms do not allow the search of TFBS from an unreferenced TF and may be subject to a fee. Other tools such as PROMO (Farre 2003), TFBIND (Tsunoda and Takagi 1999), TFsitescan make it possible to find all the TFs binding to a nucleotide sequence; nevertheless, they all use JASPAR and TRANSFAC databases and do not allow use of personal TF and their TFBS. Moreover, these tools are rather archaic and not very user-friendly. There is only MEME that allows research with of your โ€œownโ€ TFBS (Bailey et al. 2015). MEME has a large tool library but is a niche software suite. FIMO is their most similar tool to TFinder (Grant, Bailey and Noble 2011).

TFinder is an ultra-intuitive, easy-to-use and fast analysis open source and free tool that allows both the retrieval and search of TFBS in a unique site. TFinder allows the analysis of an unlimited number of genes; the selection of up to five different species (human, mouse, rat, drosophila, zebrafish); the choice and examination of either promoter or terminator gene regions; the configuration of an upstream downstream window of sequence analysis and the search of TFBS in different formats including IUPAC code, a JASPAR ID or a Position Weight Matrix. TFinder, searches for TFBS on the sense and antisense strand but also considers the search with the complementary forms. The software takes care of everything in record time.

How to install/use

No installation is required. You can access it by clicking here Streamlit App

A beta version of TFinder exists here. Streamlit App

Functions

Browser compatibility

  • Opera GX
  • Chrome (also Chromium)
  • Safari
  • Edge
  • Mozilla
  • Phone

Gene regulatory regions Extractor

  • Extract mutliple regulatory regions (promoter/terminator) using ENTREZ_GENE_ID or NCBI Gene Name in FASTA format (NCBI API)

  • Extract regions of sliced variant (NM, XM, NR, XR) and can extract all sliced variants from a gene name or ENTREZ_GENE_ID

  • Species: Human ๐Ÿ™‹๐Ÿผโ€โ™‚๏ธ, Mouse ๐Ÿ–ฑ, Rat ๐Ÿ€, Drosophila ๐ŸฆŸ, Zebrafish ๐ŸŸ

  • Set Upstream and Downstream from Transcription Start Site (TSS) and Gene End

  • Mode "Advance": allows to extract for the same gene the promoter and terminator regions for several species

Individual Motif Finder

  • Support multiple DNA sequences in FASTA format
  • Find Individual Motif occurrences (like TFBS, enzymes restriction site, specific pattern... whatever you want)
  • Support IUPAC code
  • Generate PWM Individual Motif
  • Support PWM transcription factor of JASPAR (JASPAR API)
  • Calculation of the distance of the found element to TSS or Gene End
  • Relative Score calculation:

relscore equation

  • p-value: 1000000 random sequences of reactive element length are generated based on the proportion of A, T, G, C in the analysed sequence. p-value is the number of random sequences generated having a relative score greater than or equal to the relative score of the element found divided by the number of random sequences generated

relscore equation

  • Interactive graph
  • Download results as excel (.xlsx)
  • Export results via e-mail

graph_webui graph_webui graph_webui

Working on...

  • Cleaning code
  • Fixing bug
  • Improvements...

More

Report an issue/bug ๐Ÿ†˜ โžก๏ธ Click here

Want to talk ? ๐Ÿ™‹๐Ÿผโ€โ™‚๏ธ -> Chat Room

Banner was generated with Adobe Firefly

Artwork made by Minniti Pauline

Credit & Licence & Citation

Copyright (c) 2023 Minniti Julien.

This software is distributed under an MIT licence. Please consult the LICENSE file for more details.

PREPRINT: https://www.researchsquare.com/article/rs-3782387/v1

tfinder's People

Contributors

jumitti avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

tfinder's Issues

[BUG] Wrong Gene ID leads to an error and not a warning

Describe the bug
Wrong Gene ID leads to an error and not a warning

To Reproduce
Steps to reproduce the behavior:

  1. Add an ID invalid (4847)
  2. Extract promoter or terminator
  3. See error
Traceback (most recent call last):
  File "/home/adminuser/venv/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script
    exec(code, module.__dict__)
  File "/mount/src/tfinder/TFinder-v1.py", line 108, in <module>
    allapp_page()
  File "/mount/src/tfinder/navigation/allapp.py", line 67, in allapp_page
    aio_page()
  File "/mount/src/tfinder/navigation/aio.py", line 251, in aio_page
    all_slice_forms=True if all_variants else False).find_sequences()
                                                     ^^^^^^^^^^^^^^^^

  File "/mount/src/tfinder/tfinder/__init__.py", line 131, in find_sequences
    gene_name, chraccver, chrstart, chrstop, species_API = NCBIdna.get_gene_info(entrez_id)
                                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mount/src/tfinder/tfinder/__init__.py", line 222, in get_gene_info
    return gene_name, _, _, _, _
                      ^
NameError: name '_' is not defined

Crash report

Describe the bug
When looking for patterns in too many sequences the site disconnects and is down.
I used 100 then 50 sequences and in the 2 cases the site does not hold

To Reproduce
Steps to reproduce the behavior:

  1. Search pattern in 50 sequences and wait until crash

PWM detection issue? [HELP] [QUESTION]

Describe the bug
Using individual Motif finder it appears to detect binding sites that don't match the PWM.

Here are two results from a search.
1518 | aatAAATCAGAGCTAaag | 0.769912 | + | โ†’ | n.d. | n.d | n.d
463 | gtcAAACTAAAGGACcgg | 0.769912 | + | โ†’ | n.d. | n.d | n.d

The G (7th Position) and C (10th position) are absolutely required in the PWM. So not sure why site 463 is found?

PWM = MA0451.1

Seq
atatcccaaggccgcaaagtcaacaagtcggcagcaaatttccctttgtccggcgatgtgttttttttttagccataactcgctgcattgtttgggccaagtttttcttctgccaaattgcggagatgatgcggggattatgcgctgattgcgtgcaattatggacatcctgcgaggccccgaggaacttcctgctaaatcctttcatccgcctacagaacccctttgtgtcccgttcgccgggagtccttgacgggtccttcgactattcgcttacagcagcttgcgtaaaatttcataaccctacgagcggctcttccgcggaatccctggcattatcctttttacctcttgccaatccgttggctaaaaaacggcttcgacttccgcgtaactgctggacaacaaagacaaaaaacggcgaaaggacggcgatttccaggtagcattgcgaattccgtcaaactaaaggaccggttatataacgggtttatatggccagaatctctgcatctccacgaccgccagaagctgcgtaaaactgcaggctctgttttgatttctgcaacttcagttaattgcccgggatggccagcaattgccggcaattataaaacagcgcagatgtgactcagcttccatatctaactctatatctcatgccgaaaatcGagggtggggagcggaggggcggggtgcgtgggtgacttgcctgccagggaaagggggcgggggttcagcgggtgataaatgtgcgtgatttggaatgaatgcgcatcgattaaaaccgcagggcaatcaatttagcgccttttacgccaaattggctcgtacacaaccaattaatgtcagcgggtgaactgacaccatcgcccaccaccgcatcccccttCcccctgttggccatccacccccgaaaaacaattacaacaacgaagacaagcagagggactgctgcagattccgctcaataaacctccaataaagcgaatccagcgtgaggcgtcgacgtctaattgctgttaactcgtcaactaggagaacgctccatcctcgccgttgtgcggctccttggacgcctgattaaacggattggagatgcgaggtgtacagtcgagcctccgtaagggcaaccaaaagtaaaaaacatcgactatttgaaatacaaagttttatatgtacatataatttatcaggctccggatgtaacttaattaaaacatttccttttcataaaatattgctagctgatagctgctcaaaagaacaataaaggtaataaattatgtttgcttgcaaacaattttcaatcaaaaaagtatgcgttccatcttagttaataattaattacctggataaagacttttgaaacatatcatagcgtttctttgcatattcaatactaaccaattttttataaatgAagttacaccgtttgtcgtcttgtcaagtagtatcttcacaataagtataatacagaatcaagatagtaaaataaaacaaaaaaCcgtgtgaataaatcagagctaaagacgtcggac

[BUG] Sequence extraction does not work if I have already extracted and edit manually

Describe the bug
TFinder allows to extract sequences in order to analyze them but also to use these own sequences.

However, if you first extract sequences and manually modify the sequences (or put a personal sequence) you can no longer extract a sequence. Extraction messages are there but to no avail

To Reproduce
Steps to reproduce the behavior:

  1. Extract sequences
  2. Modify or replace sequence manually
  3. Extract sequence again
  4. Messages of extraction appear but sequence not displays un corresponding section

**Temporary solution **
This bug is caused by the way Streamlit Cloud runs the script. With each modification it restarts the script from top to bottom. This bug is known but I haven't found a solution yet.

However, if you change a parameter, the button works again and extracts perfectly.

[FEATURE] multiple TF search

Is it possible to integrate the search for multiple TF? possibly all from TRANSFAC database. I am particularly interested in human ones.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.