Git Product home page Git Product logo

utils_hiv's Introduction

HIV Project util functions

This python package regroups useful functions and classes, that were used for the study of resistance mutations, and the search for potentially resistance-associated mutations in HIV-1 Reverse transcriptase sequences, using machine learning methods.
You can read about this project here and here

Module description

This module is separated into 5 different submodules

DRM utils

This submodule contains all functions to get different subsets of DRMs (ie. NRTIs, NNRTIs, accessory DRMs, SDRMs, etc...). Each of the functions returns a list of selected DRMs.

data utils

This submodule contains useful functions and classes to pre-process the encoded dataset before model training. You can remove features corresponding to known DRMs, remove sequences that have DRMs, balance target classes by sub-sampling or over-sampling, or creating cross-validation folds.

learning utils

This submodule contains useful functions and classes to use classifiers needed during the study. It also contains custom classifiers based on exact fisher tests. It contains functions to train classifiers, get predictions from these classifiers and extract coefficients / weights from these classifiers.

param utils

This submodule contains functions useful for the generation and selection of the best hyper-parameter set via random search.

metrics

This submodule contains a set of custom performance metrics that we devised in an attempt to take into account class imbalance and the differing importance given to False positives (more important) and False negatives (less important).

independent scripts

Additionally, two useful scripts are present.

compute_fisher_values.py

This script allows us to compute p-values for Fisher exact tests comparing the prevalence of mutations w.r.t a binary character like RTI treatment status or presence/absence of any DRM. This outputs a table with each considered mutation in a row and the raw p-value, as well as p-values corrected for multiple testing with the Bonferroni, Benjamini-Hochberg or Benjamini-Yekutieli methods. This script was used to generate the table: utils_hiv/data/fisher_p_values.tsv

data_encoder.py

This script is used to create the OneHot encoded dataset from HIVDB files and an additional metadata file.
To run this script you need the PrettyRTAA_naive.tsv and PrettyRTAA_treated.tsv generated by submitting the naive.fa and treated.fa fasta alignments to the HIVDB sequence program. This also outputs ResistanceSummary_naive.tsv and ResistanceSummary_treated.tsv which are needed for the script to run.
This script can be used to specify starting and ending positions.

data files

These files are in utils_hiv/data and are used by submodules.

DRM files

NRTI.tab and NNRTI.tab are local copies of HIVDB files (1, 2).
mutation_characteristic.tab is used by the DRM_utils submodule and contains known DRMs with their type (NRTI,NNRTI,Other), their SDRM status. This was obtained through the HIVDB program and hand-curated. The accessory/primary role of each mutation was determined by the HIVDB program comment.

consensus.fa

This file contains the reference sequences for the main HIV-1 subtypes present in our datasets. These sequences were obtained from the Los Alamos HIV sequence database, they are used to determine what features to remove when encoding sequences.

fisher_p_values.tsv

This file contains the results of fisher exact tests for all mutations in the datasets w.r.t to treatment or DRM presence/absence, with raw and corrected (for multiple testing) p-values. These p-values are used to build our "Fisher classifiers".

dependencies

This module depends on the following python packages:

  • python 3.7.6
  • pandas 0.25.3
  • scikit-learn 0.20.3
  • biopython 1.74
  • statsmodels 0.9.0
  • category_encoders 1.3.0
  • scipy 1.4.1
  • numpy 1.18.1

utils_hiv's People

Contributors

dependabot[bot] avatar lucblassel avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.