Git Product home page Git Product logo

bert-dna's Introduction

bert-dna

Incorporating Pre-training of Deep Bidirectional Transformers and Convolutional Neural Networks to Interpret DNA Sequences

Recently, language representation models have drawn a lot of attention in natural language processing (NLP) field due to their remarkable results. Among them, Bidirectional Encoder Representations from Transformers (BERT) has proven to be a simple, yet powerful language model that achieved novel state-of-the-art performance. BERT adopted the concept of contextualized word embeddings to capture the semantics and context of the words in which they appeared. In this study, we present a novel technique namely BERT-DNA by incorporating BERT-base multilingual model in bioinformatics to interpret the information of DNA sequences. We treated DNA sequences as sentences and transformed them into fixed-length meaningful vectors where 768- vector represents each nucleotide. We observed that our BERT-base features improved more than 5-10% in terms of sensitivity, specificity, accuracy, and MCC compared to the current state-of-the-art features in bioinformatics. Moreover, advanced experiments show that deep learning (as represented by convolutional neural networks) hold potential in learning BERT features better than other traditional machine learning techniques. In conclusion, we suggest that BERT and deep convolutional neural networks could open a new avenue in bioinformatic modeling using sequence information.

Dependencies

Prediction step-by-step:

Step 1

Use "extract_seq.py" file to generate JSON files

  • python extract_seq.py

Step 2

Use command line in "bert2json.txt" to train BERT model and extract features

Step 3

Use "jsonl2csv.py" to transfrom JSON to CSV files:

  • python jsonl2csv.py json_file csv_file

Step 4

Use 6mAtraining.py to train CNN model on CSV files

bert-dna's People

Contributors

khanhlee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

bert-dna's Issues

Feature Extraction Code

Dear Authors! I am trying to reproduce your results but I can see that you haven't uploaded the feature extraction codes like "kmer, DAC, DCC, etc..", instead you are reading the csv
I would appreciate it if you can upload them so that I can generate the csv files and run the program.

data files

@khanhlee , are you able to share uniprot.fasta and vocabs.txt file? can explain more on how we get these files?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.