Git Product home page Git Product logo

tcc's Introduction

Python Dependencies Status

Feature extraction in snoRNAs using mathematical approach

The number of biological sequences available has increased significantly in recent years due to several scientific discoveries about the genetic code that composes living beings, creating a huge volume of data. Consequently, new computational methods were shaped to analyze and extract information from these genetic sequences. The learning methods (ML) have shown wide applicability in bioinformatics and proven to be essential for the selection of useful information from the secondary structures of genomes by perfecting his techniques based on the mathematical archetype in contrast to the biological model standard of analysis. Therefore, this work aims to analyze the mathematical models for feature extraction, mainly extraction techniques that were verified efficient in classifying C/D box snoRNAs in vertebrate and invertebrate organisms with an F-score of 98% and in classifier snoRNAs as H/ACA box with an F-score of 95%. Algorithms such as Fourier Numerical Transformation and Complex Networks reached a score greater than 90% in classifying C/D box and H/ACA box snoRNAs in genetics sequences of Homo Sapiens, Platypus, Gallus gallus, Nematodes, Drosophila and Leishmania proving to be promising and useful methods for feature extraction in non-coding RNA (ncRNA) molecules of the class of snoRNAs.

Author

Dependencies

  • Python (>=3.10.6)
  • Pip (>= 22.0.2)
  • Biopython
  • Igraph
  • NumPy
  • Pandas
  • SciPy
  • Scikit-Learn
  • Matplotlib
  • Seaborn
  • Requests

Mathematical Approaches used in Extraction

  • Numerical Mapping with Fourier Transform (Real and Z-Curve)
  • Entropy (Tsallis and Shannon)
  • Complex Networks

Description of scripts and classifier

  • In folder scripts

    • feature_extraction.sh: Used to automate feature extraction stage from all samples (positive, negative, validation data)
    • extract_sequences_count.sh: Used to extract the amount of sequences from all samples
    • extract_average_data.py: Used to extract the average data to be extracted of each family defined in pre-processing phase. The objective is to balance the positive and negative sample with similar amount of sequences.
    • rfam.py: Used to make a get request from rfam repository which will get the family from snoRNAs family and output to a file with fasta extension
    • shuffle.py: Used to shuffle all pyrimidines and purines based on a parameter known as "k" (similar to the number of codons in a sequence), this parameter will shuffle the sequence until k-th codon
  • In folder classifier

    • train.py: The learning algorithm itself. Used to train and test the data across all samples including the validation dataset which has been extracted.
    • utils.py: Utility file helper with auxiliary functions to plot graph, calculate the deviation, standard deviation, arithmetic average, etc.

Setting up

$ git clone https://github.com/marcos-c1/tcc 
$ cd tcc/classifier
$ pip install -r requirements.txt
$ apt-get -y install python3-igraph

Graduation Project

The monography can be found named as TCC.pdf.

tcc's People

Contributors

marcos-c1 avatar

Watchers

 avatar

tcc's Issues

- @ Conjunto negativo

Gerar um conjunto negativo para os grupos de snoRNAs

  1. CD BOX
  • Mediana de dados
  • Dados completos
  1. H/ACA BOX
  • Mediana de dados
  • Dados completos

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.