Git Product home page Git Product logo

qdata / fastsk Goto Github PK

View Code? Open in Web Editor NEW
21.0 9.0 9.0 116.35 MB

Bioinformatics 2020: FastSK: Fast and Accurate Sequence Classification by making gkm-svm faster and scalable. https://fastsk.readthedocs.io/en/master/

Home Page: https://fastsk.readthedocs.io/en/master/

License: Apache License 2.0

C++ 27.95% Makefile 0.22% C 2.15% Python 69.58% Shell 0.02% CMake 0.08% Batchfile 0.01%
cpp python-library string-kernel sequence-classification string-classification gkm-svm

fastsk's Introduction

FastSK: fast sequence analysis with gapped string kernels (Fast-GKM-SVM)

This Github repo provides improved algorithms for implementing gkm-svm string kernel calculations. We provide C++ version of the algorithm implementation and a python wrapper (making to a python package) for the C++ implementation. Our package provides fast and accuate gkm-svm based training SVM classifiers and regressors for gkm string kernel based sequence analysis.

This Github is built with a novel and fast algorithm design for implementing gapped k-mer algorithm, pybind11, and LIBSVM.

More details of algorithms and results now in: Bioinformatics 2020

Prerequisites

  • Python 3.6+
  • setuptools version 42 or greater (run pip install --upgrade setuptools)
  • pybind11 (run pip install pybind11)

Installation via Pip Install (Linux and MacOS)

Way 1: from Pypi

pip install fastsk

Way 2: Clone this repository and run:

git clone https://github.com/QData/FastSK.git
cd FastSK
pip install -r requirements.txt
pip install .

The pip intallation of FastSK has been tested successfully on CentOS, Red Hat, MacOS.

Python Version Tutorial

Example Jupyter notebook

  • 'docs/2demo/fastDemo.ipynb'

You can check if fastsk library is installed correctly in python shell:

from fastsk import FastSK

## Compute kernel matrix
fastsk = FastSK(g=10, m=6, t=1, approx=True)

Example python usage script: (assuming you have cloned FastSK.git)

cd test
python run_check.py 

Experimental Results, Baselines, Utility Codes and Setup

  • We have provided all datasets we used in the subfolder "data"
  • We have provided all scripts we used to generate results under the subfolder "results"

Grid Search for FastSK and gkm-svm baseline

To run a grid search over the hyperparameter space (g, m, and C) to find the optimal parameters, e.g, one utility code:

cd results/
python run_gridsearch.py

When comparing with Deep Learning baselines

  • You do need to have pytorch installed
pip install torch torchvision
  • One utility code: on all datasets with hyperparameter tuning of charCNN and each configure with 5 random-seeding repeats:
cd results/neural_nets
python run_cnn_hyperTrTune.py 
  • We have many other utility codes helping users to run CNN and RNN baselines

Some of our exprimental results comparing FastSK with baselines wrt performance and speed

Some of our exprimental results comparing FastSK with Character based Convolutional Neural Nets (CharCNN) when varying training size.

To Do:

  • a detailed user document, with example input files, output files, code, and perhaps a user group where people can post their questions

Citations

If you find this tool useful, please cite us!

@article{fast-gkm-svm,
    author = {Blakely, Derrick and Collins, Eamon and Singh, Ritambhara and Norton, Andrew and Lanchantin, Jack and Qi, Yanjun},
    title = "{FastSK: fast sequence analysis with gapped string kernels}",
    journal = {Bioinformatics},
    volume = {36},
    number = {Supplement_2},
    pages = {i857-i865},
    year = {2020},
    month = {12},
    abstract = "{Gapped k-mer kernels with support vector machines (gkm-SVMs) have achieved strong predictive performance on regulatory DNA sequences on modestly sized training sets. However, existing gkm-SVM algorithms suffer from slow kernel computation time, as they depend exponentially on the sub-sequence feature length, number of mismatch positions, and the task’s alphabet size.In this work, we introduce a fast and scalable algorithm for calculating gapped k-mer string kernels. Our method, named FastSK, uses a simplified kernel formulation that decomposes the kernel calculation into a set of independent counting operations over the possible mismatch positions. This simplified decomposition allows us to devise a fast Monte Carlo approximation that rapidly converges. FastSK can scale to much greater feature lengths, allows us to consider more mismatches, and is performant on a variety of sequence analysis tasks. On multiple DNA transcription factor binding site prediction datasets, FastSK consistently matches or outperforms the state-of-the-art gkmSVM-2.0 algorithms in area under the ROC curve, while achieving average speedups in kernel computation of ∼100× and speedups of ∼800× for large feature lengths. We further show that FastSK outperforms character-level recurrent and convolutional neural networks while achieving low variance. We then extend FastSK to 7 English-language medical named entity recognition datasets and 10 protein remote homology detection datasets. FastSK consistently matches or outperforms these baselines.Our algorithm is available as a Python package and as C++ source code at https://github.com/QData/FastSKSupplementary data are available at Bioinformatics online.}",
    issn = {1367-4803},
    doi = {10.1093/bioinformatics/btaa817},
    url = {https://doi.org/10.1093/bioinformatics/btaa817},
    eprint = {https://academic.oup.com/bioinformatics/article-pdf/36/Supplement\_2/i857/35337038/btaa817.pdf},
}

Legacy: If you prefer using the executable made from the Pure C++ source code (without python wrapper or R wrapper)

  • you can clone this repository:
git clone --recursive https://github.com/QData/FastSK.git

then run

cd FastSK
make

A fastsk executable will be installed to the bin directory, which you can use for kernel computation and inference. For example:

./bin/fastsk -g 10 -m 6 -C 1 -t 1 -a data/EP300.train.fasta data/EP300.test.fasta

This will run the approximate kernel algorithm on the EP300 TFBS dataset using a feature length of g = 10 with up to m = 6 mismatches. It will then train and evaluate an SVM classifier with the SVM parameter C = 1.

fastsk's People

Contributors

dblakely avatar ec3bd avatar jacklanchantin avatar k-ivey avatar kevinsunofficial avatar qiyanjun avatar rs3zz avatar yuy898 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fastsk's Issues

Regression mode?

Hi, if we have continuous data rather than a binary classification task, is there any easy way to run this as a support vector regression (SVR) problem rather than typical SVM? Or is this package intended purely as a classifier? Thanks!

iGaKCo core dumps in some cases

I ran iGaKCo as below and it core dumped:

$ ./iGakco -g 10 -m 8 -k kernel.txt -h 1 temp.fasta temp.fasta dict.txt labels.txt 
Input file : temp.fasta
Dictionary size = 5 (+1 for uknown character)
Reading temp.fasta
Read 2 strings of length = 12
numF=6, sumLen=24
(10,2): 6 features
Weights (hm):45 36 28 21 15 10 6 3 1 
Computing mismatch profiles using 8 threads...
*** Error in `./iGakco': free(): invalid size: 0x00007f40e4000c70 ***
*** Error in `./iGakco': free(): invalid size: 0x00007f40ec000c70 ***
*** Error in `./iGakco': free(): invalid size: 0x00007f40e0000c70 ***
*** Error in `./iGakco': free(): invalid size: 0x00007f40d4000c70 ***
*** Error in `./iGakco': free(): invalid size: 0x00007f40d8000c70 ***
======= Backtrace: =========
======= Backtrace: =========
======= Backtrace: =========
======= Backtrace: =========
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7f40f373737a]
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f40f372e7e5]
*** Error in `./iGakco': free(): invalid size: 0x00007f40c8000c70 ***
Aborted (core dumped)

Input file looks like this:

>1
ACGACGTGACGA
>0
AGTCCCAAATCC

dict.txt is:

A
C
G
T


Obtain k-mer sequences and weights

Is there a way to retrieve a list of the most contributing features (i.e., k-mer sequences) and their associated weights? I have attempted to pull them by using the sklearn.feature_extraction module to no avail. However, I did notice (with my incredibly limited familiarity with c++) that there appears to be the tokenized feature array objects being passed to fastsk_kernel.cpp. Is this a potential source of extraction? Any help would be greatly appreciated!

what is 0 or 1 mean in the training data?

Hi,

I am trying to use FastSK, but a little confused about how to prepare the training dataset.

For example, I want to train a model of ATAC-seq in K562 cells, should the positive data be real ATAC-seq data and negative data be randomized seqs?

Does 0 represent positive seqs and 1 represent randomized negative seqs ?

Thanks a lot.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.