Git Product home page Git Product logo

dna2vec's Introduction

dna2vec

Build Status

Dna2vec is an open-source library to train distributed representations of variable-length k-mers.

For more information, please refer to the paper: dna2vec: Consistent vector representations of variable-length k-mers

Installation

Note that this implementation has only been tested on Python 3.5.3, but we welcome any contributions or bug reporting to make it more accessible.

  1. Clone the dna2vec repository: git clone https://github.com/pnpnpn/dna2vec
  2. Install Python dependencies: pip3 install -r requirements.txt
  3. Test the installation: python3 ./scripts/train_dna2vec.py -c configs/small_example.yml

Training dna2vec embeddings

  1. Download hg38 from http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chromFa.tar.gz. This will take a while as it's 938MB.
  2. Untar with tar -zxvf hg38.chromFa.tar.gz. You should see FASTA files for chromosome 1 to 22: chr1.fa, chr2.fa, ..., chr22.fa.
  3. Move the 22 FASTA files to folder inputs/hg38/
  4. Start the training with: python3 ./scripts/train_dna2vec.py -c configs/hg38-20161219-0153.yml
  5. Wait for a couple of days ...
  6. Once the training is done, there should be a dna2vec-<ID>.w2v and a corresponding dna2vec-<ID>.txt file in your results/ directory.

Reading pretrained dna2vec

You can read pretrained dna2vec vectors pretrained/dna2vec-*.w2v using the class MultiKModel in dna2vec/multi_k_model.py. For example:

from dna2vec.multi_k_model import MultiKModel

filepath = 'pretrained/dna2vec-20161219-0153-k3to8-100d-10c-29320Mbp-sliding-Xat.w2v'
mk_model = MultiKModel(filepath)

You can fetch the vector representation of AAA with:

>>> mk_model.vector('AAA')
array([ 0.023137  ,  0.156295 , ...

Compute the cosine distance between two k-mers via dna2vec:

>>> mk_model.cosine_distance('AAA', 'GCT')
0.14546435594464155
>>> mk_model.cosine_distance('AAA', 'AAAA')
0.89000147450211231

FAQ

Does the pre-trained dna2vec data (w2v file) cover all k-mers?

The pre-trained data should cover all k-mers for 3 ≤ k ≤ 8

>>> [len(mk_model.model(k).vocab) for k in range(3,9)]
[64, 256, 1024, 4096, 16384, 65536]
>>> [4**k for k in range(3,9)]
[64, 256, 1024, 4096, 16384, 65536]

Contribute

I would love for you to fork and send me pull request for this project. Please contribute.

License

This software is licensed under the MIT license

dna2vec's People

Contributors

aldro61 avatar dependabot[bot] avatar pnpnpn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dna2vec's Issues

Pretrained set?

Hi,
What genome/sequence was the pretraining set done on? Can you make this available? I am running some initial experiments and would rather not lose time to training dna2vec for my proof of concept.

Thank you!

length longest string I can encode

Hi,
I would like to know what parameters should I use in order to be able to get the vector representation of a string of length 45.
At the moment I can go beyond 25.

AttributeError: 'Word2Vec' object has no attribute 'wv'

Describtion:

python3 ./scripts/train_dna2vec.py -c configs/small_example.yml

Then:

File "./scripts/train_dna2vec.py", line 55, in write_vec
    self.model.wv.save_word2vec_format(out_filename, binary=False)
AttributeError: 'Word2Vec' object has no attribute 'wv'

Exception:

Traceback (most recent call last):
  File "./scripts/train_dna2vec.py", line 142, in <module>
    main()
  File "./scripts/train_dna2vec.py", line 139, in main
    run_main(args, inputs, out_fileroot)
  File "./scripts/train_dna2vec.py", line 88, in run_main
    learner.write_vec()
  File "./scripts/train_dna2vec.py", line 55, in write_vec
    self.model.wv.save_word2vec_format(out_filename, binary=False)
AttributeError: 'Word2Vec' object has no attribute 'wv'

env: using pip install -r requirements.txt

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
_openmp_mutex             4.5                       2_gnu    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
arrow                     0.8.0                    pypi_0    pypi
biopython                 1.68                     pypi_0    pypi
boto                      2.46.1                   pypi_0    pypi
bz2file                   0.98                     pypi_0    pypi
bzip2                     1.0.8                h7f98852_4    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
ca-certificates           2022.6.15            ha878542_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
certifi                   2022.6.15                pypi_0    pypi
chardet                   3.0.4                    pypi_0    pypi
configargparse            0.11.0                   pypi_0    pypi
gensim                    0.13.2                   pypi_0    pypi
idna                      2.7                      pypi_0    pypi
ld_impl_linux-64          2.36.1               hea4e1c9_2    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libffi                    3.4.2                h7f98852_5    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libgcc-ng                 12.1.0              h8d9b700_16    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libgomp                   12.1.0              h8d9b700_16    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libnsl                    2.0.0                h7f98852_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libstdcxx-ng              12.1.0              ha89aaad_16    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libuuid                   2.32.1            h7f98852_1000    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libzlib                   1.2.12               h166bdaf_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
logbook                   1.0.0                    pypi_0    pypi
ncurses                   6.3                  h27087fc_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
numpy                     1.16.0                   pypi_0    pypi
openssl                   1.1.1p               h166bdaf_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
pep8                      1.7.0                    pypi_0    pypi
pip                       21.2.4             pyhd8ed1ab_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
pluggy                    0.4.0                    pypi_0    pypi
py                        1.4.33                   pypi_0    pypi
pytest                    3.0.7                    pypi_0    pypi
python                    3.6.15          hb7a2778_0_cpython    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
python-dateutil           2.6.0                    pypi_0    pypi
readline                  8.1.2                h0f457ee_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
requests                  2.20.0                   pypi_0    pypi
scipy                     0.19.0                   pypi_0    pypi
setuptools                36.4.0                   py36_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
six                       1.10.0                   pypi_0    pypi
smart-open                1.5.1                    pypi_0    pypi
sqlite                    3.39.0               h4ff8645_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
tk                        8.6.12               h27826a3_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
tox                       2.7.0                    pypi_0    pypi
tox-pyenv                 1.0.3                    pypi_0    pypi
tzdata                    2022a                h191b570_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
urllib3                   1.24.3                   pypi_0    pypi
virtualenv                15.1.0                   pypi_0    pypi
wheel                     0.37.1             pyhd8ed1ab_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
xz                        5.2.5                h516909a_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
zlib                      1.2.12               h166bdaf_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge

Pre-image / component mapping?

Hi,
Can you please make it explicit how to obtain a pre-image from a mapped vector?
Additionally, can you explain how the components v_j of the vectors in V are related to the sequence components s_i in the sequence space S?

Best wishes

Incorrect embedding dimension after training

I want to use dna2vec for E. coli genome.
When I set 2<=k<=8, I got (86479,100);
When I set 3<=k<8, I got (86614,100), and the correct dimension should be (87360,100) that $87360+16=4^2+4^3+4^4+4^5+4^6+4^7+4^8$.
So I don' know why I got 2 different results.
I also check every Kmer from 2 to 8, I find the dimension is correct from 2 to 7.
However, in k=8, the dimension is (64450,100) rather than (65536,100), and $65536-64450 != 87630-86614$.
This is horrible! There is nowhere to match.

installation/training fails unless run from scripts folder

$ python3 ./scripts/train_dna2vec.py -c configs/small_example.yml
Traceback (most recent call last):
  File "./scripts/train_dna2vec.py", line 12, in <module>
    from attic_util.time_benchmark import Benchmark
ImportError: No module named 'attic_util'

this is executed from ~/dna2vec

The reason for this is intrain_dna2vec.pythe relative path to attic_util and dna2vec are appended to sys.path. Idiosyncratically, python appends the '../' from the folder that the script was called from.

the work around is easy - just call the script from within ./scripts

for cleaner implementation though, it might be better to consider using an egg or some other setup that allows attic_utils and dna2vec to be called from elsewhere

mm10

Do you have a pretained vectors for mm10?

Thanks

dna2vec against large dataset

We are trying to run dna2vec against a large db (the ncbi nt dataset) which has ~47m sequences in it. Do you know of any issues with doing this (aside from it taking a really long time)?

I am seeing that the PROGRESS message report we are on sentence 105m, but we are still on epoch #1. I think we should be on epoch 3 based on the progress messages.

Do you have any thoughts on why this might be the case?

Encoding longer sequences

Is there already any implemented function to encode longer sequences (such as sequencing reads) using their k-mers embeddings?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.