pnpnpn / dna2vec Goto Github PK

View Code? Open in Web Editor NEW

179.0 10.0 60.0 32.73 MB

dna2vec: Consistent vector representations of variable-length k-mers

License: MIT License

Python 100.00%

computational-biology bioinformatics word-embeddings embeddings machine-learning neural-network word2vec ml python nlp

dna2vec's Introduction

dna2vec

Dna2vec is an open-source library to train distributed representations of variable-length k-mers.

For more information, please refer to the paper: dna2vec: Consistent vector representations of variable-length k-mers

Installation

Note that this implementation has only been tested on Python 3.5.3, but we welcome any contributions or bug reporting to make it more accessible.

Clone the dna2vec repository: git clone https://github.com/pnpnpn/dna2vec
Install Python dependencies: pip3 install -r requirements.txt
Test the installation: python3 ./scripts/train_dna2vec.py -c configs/small_example.yml

Training dna2vec embeddings

Download hg38 from http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chromFa.tar.gz. This will take a while as it's 938MB.
Untar with tar -zxvf hg38.chromFa.tar.gz. You should see FASTA files for chromosome 1 to 22: chr1.fa, chr2.fa, ..., chr22.fa.
Move the 22 FASTA files to folder inputs/hg38/
Start the training with: python3 ./scripts/train_dna2vec.py -c configs/hg38-20161219-0153.yml
Wait for a couple of days ...
Once the training is done, there should be a dna2vec-<ID>.w2v and a corresponding dna2vec-<ID>.txt file in your results/ directory.

Reading pretrained dna2vec

You can read pretrained dna2vec vectors pretrained/dna2vec-*.w2v using the class MultiKModel in dna2vec/multi_k_model.py. For example:

from dna2vec.multi_k_model import MultiKModel

filepath = 'pretrained/dna2vec-20161219-0153-k3to8-100d-10c-29320Mbp-sliding-Xat.w2v'
mk_model = MultiKModel(filepath)

You can fetch the vector representation of AAA with:

>>> mk_model.vector('AAA')
array([ 0.023137  ,  0.156295 , ...

Compute the cosine distance between two k-mers via dna2vec:

>>> mk_model.cosine_distance('AAA', 'GCT')
0.14546435594464155
>>> mk_model.cosine_distance('AAA', 'AAAA')
0.89000147450211231

FAQ

Does the pre-trained dna2vec data (`w2v` file) cover all k-mers?

The pre-trained data should cover all k-mers for 3 ≤ k ≤ 8

>>> [len(mk_model.model(k).vocab) for k in range(3,9)]
[64, 256, 1024, 4096, 16384, 65536]
>>> [4**k for k in range(3,9)]
[64, 256, 1024, 4096, 16384, 65536]

Contribute

I would love for you to fork and send me pull request for this project. Please contribute.

License

This software is licensed under the MIT license

dna2vec's People

Contributors

Stargazers

Watchers

dna2vec's Issues

How to train word embedding with dimensions less than 100

Pretrained set?

Hi,
What genome/sequence was the pretraining set done on? Can you make this available? I am running some initial experiments and would rather not lose time to training dna2vec for my proof of concept.

Thank you!

length longest string I can encode

Hi,
I would like to know what parameters should I use in order to be able to get the vector representation of a string of length 45.
At the moment I can go beyond 25.

AttributeError: 'Word2Vec' object has no attribute 'wv'

Describtion:

python3 ./scripts/train_dna2vec.py -c configs/small_example.yml

Then:

File "./scripts/train_dna2vec.py", line 55, in write_vec
    self.model.wv.save_word2vec_format(out_filename, binary=False)
AttributeError: 'Word2Vec' object has no attribute 'wv'

Exception:

Traceback (most recent call last):
  File "./scripts/train_dna2vec.py", line 142, in <module>
    main()
  File "./scripts/train_dna2vec.py", line 139, in main
    run_main(args, inputs, out_fileroot)
  File "./scripts/train_dna2vec.py", line 88, in run_main
    learner.write_vec()
  File "./scripts/train_dna2vec.py", line 55, in write_vec
    self.model.wv.save_word2vec_format(out_filename, binary=False)
AttributeError: 'Word2Vec' object has no attribute 'wv'

env: using pip install -r requirements.txt

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
_openmp_mutex             4.5                       2_gnu    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
arrow                     0.8.0                    pypi_0    pypi
biopython                 1.68                     pypi_0    pypi
boto                      2.46.1                   pypi_0    pypi
bz2file                   0.98                     pypi_0    pypi
bzip2                     1.0.8                h7f98852_4    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
ca-certificates           2022.6.15            ha878542_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
certifi                   2022.6.15                pypi_0    pypi
chardet                   3.0.4                    pypi_0    pypi
configargparse            0.11.0                   pypi_0    pypi
gensim                    0.13.2                   pypi_0    pypi
idna                      2.7                      pypi_0    pypi
ld_impl_linux-64          2.36.1               hea4e1c9_2    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libffi                    3.4.2                h7f98852_5    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libgcc-ng                 12.1.0              h8d9b700_16    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libgomp                   12.1.0              h8d9b700_16    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libnsl                    2.0.0                h7f98852_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libstdcxx-ng              12.1.0              ha89aaad_16    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libuuid                   2.32.1            h7f98852_1000    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libzlib                   1.2.12               h166bdaf_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
logbook                   1.0.0                    pypi_0    pypi
ncurses                   6.3                  h27087fc_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
numpy                     1.16.0                   pypi_0    pypi
openssl                   1.1.1p               h166bdaf_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
pep8                      1.7.0                    pypi_0    pypi
pip                       21.2.4             pyhd8ed1ab_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
pluggy                    0.4.0                    pypi_0    pypi
py                        1.4.33                   pypi_0    pypi
pytest                    3.0.7                    pypi_0    pypi
python                    3.6.15          hb7a2778_0_cpython    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
python-dateutil           2.6.0                    pypi_0    pypi
readline                  8.1.2                h0f457ee_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
requests                  2.20.0                   pypi_0    pypi
scipy                     0.19.0                   pypi_0    pypi
setuptools                36.4.0                   py36_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
six                       1.10.0                   pypi_0    pypi
smart-open                1.5.1                    pypi_0    pypi
sqlite                    3.39.0               h4ff8645_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
tk                        8.6.12               h27826a3_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
tox                       2.7.0                    pypi_0    pypi
tox-pyenv                 1.0.3                    pypi_0    pypi
tzdata                    2022a                h191b570_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
urllib3                   1.24.3                   pypi_0    pypi
virtualenv                15.1.0                   pypi_0    pypi
wheel                     0.37.1             pyhd8ed1ab_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
xz                        5.2.5                h516909a_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
zlib                      1.2.12               h166bdaf_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge

Pre-image / component mapping?

Hi,
Can you please make it explicit how to obtain a pre-image from a mapped vector?
Additionally, can you explain how the components v_j of the vectors in V are related to the sequence components s_i in the sequence space S?

Best wishes

DeprecationWarning("Deprecated. Use model.wv.save_word2vec_format instead."

scripts/train_dna2vec.py line 55

Incorrect embedding dimension after training

I want to use dna2vec for E. coli genome.
When I set 2<=k<=8, I got (86479,100);
When I set 3<=k<8, I got (86614,100), and the correct dimension should be (87360,100) that $87360+16=4^2+4^3+4^4+4^5+4^6+4^7+4^8$.
So I don' know why I got 2 different results.
I also check every Kmer from 2 to 8, I find the dimension is correct from 2 to 7.
However, in k=8, the dimension is (64450,100) rather than (65536,100), and $65536-64450 != 87630-86614$.
This is horrible! There is nowhere to match.

Increasing the embedding vector from 100 to 200

Increasing the embedding vector from 100 to 200
Hello, good time, is there a way to increase the vector?
For example, fill the vector from 100 to 200 with zero

ImportError: cannot import name 'Mapping' from 'collections'

I am using python 3.10 and I am having trouble training data when testing the installation.

Why I failed after followed all the steps with none gived a error.

Nothing happen. After just one moment, the script ended. And no result file in the directory. I downloaded the .fa files and extracted them to the input folder. And install the environment all followed it. What is most likely the cause？

installation/training fails unless run from scripts folder

$ python3 ./scripts/train_dna2vec.py -c configs/small_example.yml
Traceback (most recent call last):
  File "./scripts/train_dna2vec.py", line 12, in <module>
    from attic_util.time_benchmark import Benchmark
ImportError: No module named 'attic_util'

this is executed from ~/dna2vec

The reason for this is intrain_dna2vec.pythe relative path to attic_util and dna2vec are appended to sys.path. Idiosyncratically, python appends the '../' from the folder that the script was called from.

the work around is easy - just call the script from within ./scripts

for cleaner implementation though, it might be better to consider using an egg or some other setup that allows attic_utils and dna2vec to be called from elsewhere

mm10

Do you have a pretained vectors for mm10?

Thanks

dna2vec against large dataset

We are trying to run dna2vec against a large db (the ncbi nt dataset) which has ~47m sequences in it. Do you know of any issues with doing this (aside from it taking a really long time)?

I am seeing that the PROGRESS message report we are on sentence 105m, but we are still on epoch #1. I think we should be on epoch 3 based on the progress messages.

Do you have any thoughts on why this might be the case?

Encoding longer sequences

Is there already any implemented function to encode longer sequences (such as sequencing reads) using their k-mers embeddings?

model = super(BaseWordEmbeddingsModel, cls).load(*args, **kwargs) TypeError: load() missing 1 required positional argument: 'fname_or_handle'

DeprecationWarning: Deprecated. Use gensim.models.KeyedVectors.load_word2vec_format instead.

dna2vec/dna2vec/multi_k_model.py

Line 17 in 8d033e9

self.aggregate = word2vec.Word2Vec.load_word2vec_format(filepath, binary=False)