hazyresearch / hyena-dna Goto Github PK

Official implementation for HyenaDNA, a long-range genomic foundation model built with Hyena

Home Page: https://arxiv.org/abs/2306.15794

License: Apache License 2.0

C++ 1.83% Cuda 0.74% Python 1.68% C 0.01% CSS 0.05% JavaScript 0.10% HTML 2.29% CMake 0.07% Makefile 0.01% Assembly 87.03% PHP 0.27% Pawn 4.87% POV-Ray SDL 1.03% Dockerfile 0.01%

foundation-models genomics language-models

hyena-dna's People

Contributors

Stargazers

Watchers

hyena-dna's Issues

Model weights trained on the Genomics Benchmark Dataset

Are there weights that are fine-tuned on the GenomicsBenchmark dataset?
This would be helpful to just perform inference without requiring us to train the models on these datasets.
Is there a plan to release this?

Some doubts about downstream tasks

Hello, first of all, thank you for your open-source contributions and the detailed README file!

I'm an undergraduate student who has just started exploring deep learning. My research focus is on DNA tokenization, but there are very few datasets available. My idea is to use prompt learning to tackle this task.

Given the limited research in this area, I've come across the possibility of using DNA tokenization as a downstream task for your model.However, after carefully reading your paper and the repository's README file, especially the "More advanced stuff below" section, I find it challenging to understand all the content due to my limited expertise. I'm still unsure if DNA tokenization can be used as a downstream task for your model. I would like to ask if this is possible？

I would greatly appreciate it if you could provide some advice or guidance on this.

I understand that this may not be within your obligations, so if you're too busy to respond, please feel free to close this issue. Thank you for taking the time to consider my request!

Nucleotide vs codons

Have you ever consider to use codons instead of single nucleotide prediction.
I mean convert sequence to codon code, train and try to predict next codon?

ImportError: Error loading 'src.models.sequence.dna_embedding.DNAEmbeddingModel':

When running the code as instructed (quick entry "python -m train wandb=null experiment=hg38/genomic_benchmark_scratch"), there was an error: "ImportError: Error loading 'src.models.sequence.dna_embedding.DNAEmbeddingModel':". Can you please look into it? Thanks!

Thank you for doing this, and thank you for open-sourcing it!

Thank you for creating this repo. I have been working on this problem for around 6 months, without much success on my end. Thank you for doing all this! This is a historic achievement.

MCC value problem on Nucleotide Transformer

I really appreciate your work, but when I train model on the downstream task Nucleotide Transformer, it shows that no MCC result returned, the details are as follows, could you please help me to solve it, thank you very mcuh!
Traceback (most recent call last):
File "/root/code/dnabert3/hyena-dna/train.py", line 692, in main
train(config)
File "/root/code/dnabert3/hyena-dna/train.py", line 673, in train
trainer.fit(model)
File "/root/miniconda3/envs/hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
call._call_and_handle_interrupt(
File "/root/miniconda3/envs/hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/root/miniconda3/envs/hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/root/miniconda3/envs/hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run
results = self._run_stage()
File "/root/miniconda3/envs/hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage
self._run_train()
File "/root/miniconda3/envs/hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_train
self.fit_loop.run()
File "/root/miniconda3/envs/hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.on_advance_end()
File "/root/miniconda3/envs/hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 295, in on_advance_end
self.trainer._call_callback_hooks("on_train_epoch_end")
File "/root/miniconda3/envs/hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1380, in _call_callback_hooks
fn(self, self.lightning_module, *args, **kwargs)
File "/root/miniconda3/envs/hyena-dna/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 312, in on_train_epoch_end
self._save_topk_checkpoint(trainer, monitor_candidates)
File "/root/miniconda3/envs/hyena-dna/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 367, in _save_topk_checkpoint
raise MisconfigurationException(m)
lightning_lite.utilities.exceptions.MisconfigurationException: ModelCheckpoint(monitor='val/mcc') could not find the monitored key in the returned metrics: ['trainer/loss', 'trainer/epoch', 'val/accuracy', 'val/loss', 'train/accuracy', 'train/loss', 'epoch', 'step']. HINT: Did you call log('val/mcc', value) in the LightningModule?

Unable to download the Human Reference Genome data

Unable to download hg38 dataset:AccessDeniedException: 403 user does not have service usage.services.use access to the Google Cloud project. Permission 'serviceusage.services.use' denied on resource (or it may not exist).
Is there anyway we could download/access the dataset?. Thanks

Model output changes randomly

First of all, thanks for making this great tool available!

I was trying to obtain embeddings for some sequences, and noticed that the model output changed when calling it repeatedly with the same input:

tokenizer = CharacterTokenizer(
        characters=['A', 'C', 'G', 'T', 'N'],
        model_max_length=32768 + 2,
        add_special_tokens=False,
        padding_side='left',
    )

model = HyenaDNAPreTrainedModel.from_pretrained(
            './checkpoints',
            'hyenadna-small-32k-seqlen',
            download=True,
            config=None,
            device="cpu",
            use_head=False,
            n_classes=2,
        )

sequence = "ATCG"

model(torch.LongTensor(tokenizer(sequence)["input_ids"]).unsqueeze(0))[0,0,:5]
# [-0.6169,  1.0377,  0.0526, -1.0487,  0.9169]
model(torch.LongTensor(tokenizer(sequence)["input_ids"]).unsqueeze(0))[0,0,:5]
# [-0.5084,  0.6375,  0.4707, -0.8912,  1.1417]
model(torch.LongTensor(tokenizer(sequence)["input_ids"]).unsqueeze(0))[0,0,:5]
# [-0.5762,  0.3669,  0.1919, -0.7438,  1.0702]

I would have expected the model to be deterministic... what am I missing here?

Questions on running as module

i have completed the conda-environment setting. And the code in "Quick Entry point" , which is
python -m train wandb=null experiment=hg38/genomic_benchmark_scratch , has been run correctly.

But when i run the code in "In-context Learning“ section, something wrong happened as following. Could you help me figure out what wrong? Thank you for your time ~

(hyena-dna) zhguo@Dell:~/git/hyena-dna$ python -m evals/soft_prompting_genomics
/home/zhguo/app/miniconda/envs/hyena-dna/bin/python: No module named evals/soft_prompting_genomics

nucleotide finetuning

when I ran python -m train wandb=null experiment=hg38/nucleotide_transformer dataset_name=enhancer dataset.max_length=500 model.layer.l_max=1026,
something wrong,
Could not override 'dataset_name'.
To append to your config use +dataset_name=enhancer
Key 'dataset_name' is not in struct
full_key: dataset_name
object_type=dict

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Segmentation fault (core dumped)

Genome build versions are inconsistent in reference and chromatin profile (DeepSEA benchmark)

I am attempting to replicate the validation process for the DeepSEA benchmark. The original DeepSEA version is hg19, while the reference genome is hg38. I've noticed that liftover is available in the source code, specifically within the ChromatinProfileDataset class. However, using this liftover functionality seems to be restricted unless I directly use the ChromatinProfileDataset.

Within the class ChromatinProfile, arguments for ChromatinProfileDataset are:

ref_genome_version = self.ref_genome_version
coords_target_path = f'{self.data_path}/{split}_{self.ref_genome_version}_coords_targets.csv'

This code forces the genome version of the reference and dataset to be the same.

My question is whether it's possible to introduce flexibility into the package or provide an updated version of the DeepSEA benchmark that supports the hg38 reference genome?

Error in Pretraining on Human Genome

Hello, I was trying to follow your directions on pretraining on the human genome (as a test before I try to pretrain on my own data) and I keep getting this error:

RuntimeError: Trying to resize storage that is not resizable

The first time it happened after training epoch 40 and the second time after training epoch 60. Do you know what the error could be?

Thanks for any help. I do not seem to have any problems with Fine tuning.

Thanks,
LeAnn

Epoch 60: 95%|▉| 135/142 [00:15<00:00, 8.94it/s, loss=1.17, val/loss=1.170, val/num_tokens=1.37e+8, val/perplexity=3.230, test/loss=1.170, test/num_tokens=1.2e+8, test/perplexError executing job with overrides: ['wandb=null', 'experiment=hg38/hg38_hyena', 'model.d_model=128', 'model.n_layer=2', 'dataset.batch_size=256', 'train.global_batch_size=256', 'dataset.max_length=1024', 'optimizer.lr=6e-4', 'trainer.devices=1']
Traceback (most recent call last):
File "/uufs/chpc.utah.edu/common/home/sundar-group2/PHAGE/MODELS/P100_HYENA/hyena-dna/train.py", line 691, in main
train(config)
File "/uufs/chpc.utah.edu/common/home/sundar-group2/PHAGE/MODELS/P100_HYENA/hyena-dna/train.py", line 672, in train
trainer.fit(model)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
call._call_and_handle_interrupt(
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run
results = self._run_stage()
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage
self._run_train()
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_train
self.fit_loop.run()
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.on_advance_end()
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 251, in on_advance_end
self._run_validation()
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 310, in _run_validation
self.val_loop.run()
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 121, in advance
batch = next(data_fetcher)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in next
return self.fetching_function()
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 265, in fetching_function
self._fetch_next_batch(self.dataloader_iter)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 280, in _fetch_next_batch
batch = next(iterator)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in next
data = self._next_data()
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
return self._process_data(data)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
data.reraise()
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/_utils.py", line 543, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 2.
Original Traceback (most recent call last):
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch
return self.collate_fn(data)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 265, in default_collate
return collate(batch, collate_fn_map=default_collate_fn_map)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 143, in collate
return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed] # Backwards compatibility.
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 143, in
return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed] # Backwards compatibility.
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 120, in collate
return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/utils/collate.py", line 162, in collate_tensor_fn
out = elem.new(storage).resize(len(batch), *list(elem.size()))

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

The tokenizer's bug in Huggingface

The code in tokenizer

has a bug, it seem to be missing the [CLS]

the code in huggingface is

    def build_inputs_with_special_tokens(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        sep = [self.sep_token_id]
        # cls = [self.cls_token_id]
        result = token_ids_0 + sep
        if token_ids_1 is not None:
            result += token_ids_1 + sep
        return result

but I think it should be

    def build_inputs_with_special_tokens(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        sep = [self.sep_token_id]
        cls = [self.cls_token_id]
        result = cls + token_ids_0 + sep
        if token_ids_1 is not None:
            result += token_ids_1 + sep
        return result

according to the code from github

How to correctly provide padding tokens to forward pass of pretrained model?

Hi there, thanks for this repo and the pretrained models.

I have a question on batching sequences of varying length. I've found the padding token and tokenizer to work effectively, but I see no input of an attention mask to the forward pass of the model.

I've tried passing a padded sequence, e.g. padded with 4s as output by the tokenizer, and a non-padded sequence. The resulting embeddings of at least the last few tokens are very different between these two examples.

The common pattern is to also provide an attention mask. I try to pass this like model(input_ids, attn_mask=attn_mask) but it this isn't how it's set up. I looked through the source code and can't find the way that an attention mask mechanism would work in it.

Is there a way to batch sequences of varying length and how should I do this?

git-lfs missing from container

Hi!

Thanks for this very cool repository, the preprint is very cool, too!

I am a total beginner to using relevant machine learning libraries, but I think the git-lfs is missing from the docker container of hyena-dna. Admittedly, I did a bit of weird stuff with it because I can't execute docker on the HPC. I converted it to Singularity. Nevertheless, I'd expect to be able to call git-lfs from there, too, and it seems to be missing.

Here's what I did:

# build image from existing docker container
singularity build hyena-dna.sif docker://hyenadna/hyena-dna-public:latest

# can not use the hyena-dna inside the container because it tries to write into the same folder, therefore using a local clone, container only holds the dependencies
git clone https://github.com/HazyResearch/hyena-dna.git
cd hyena-dna

# SINGULARITYENV_CUDA_VISIBLE_DEVICES=1 says "use GPU_1" (it will otherwise use GPU_0 by default)
SINGULARITYENV_CUDA_VISIBLE_DEVICES=1 singularity exec --nv ~/images/hyena-dna.sif python -m train wandb=null experiment=hg38/genomic_benchmark_scratch # works like charm! 

SINGULARITYENV_CUDA_VISIBLE_DEVICES=1 singularity exec --nv ~/images/hyena-dna.sif python -m huggingface # fails

Error:

Using device: cuda
git: 'lfs' is not a git command. See 'git --help'.

The most similar command is
	log
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 185, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/opt/conda/lib/python3.8/runpy.py", line 111, in _get_module_details
    __import__(pkg_name)
  File "/home/hoffk83/images/git/hyena-dna/huggingface.py", line 251, in <module>
    inference_single()
  File "/home/hoffk83/images/git/hyena-dna/huggingface.py", line 209, in inference_single
    model = HyenaDNAPreTrainedModel.from_pretrained(
  File "/home/hoffk83/images/git/hyena-dna/huggingface.py", line 106, in from_pretrained
    config = json.load(open(os.path.join(pretrained_model_name_or_path, 'config.json')))
FileNotFoundError: [Errno 2] No such file or directory: './checkpoints/hyenadna-small-32k-seqlen/config.json'
(base) hoffk83@vision-05:~/images/git/hyena-dna$ ls ./checkpoints/hyenadna-small-32k-seqlen/config.json
ls: cannot access './checkpoints/hyenadna-small-32k-seqlen/config.json': No such file or directory
(base) hoffk83@vision-05:~/images/git/hyena-dna$ ls ./checkpoints/hyenadna-small-32k-seqlen/config.json
ls: cannot access './checkpoints/hyenadna-small-32k-seqlen/config.json': No such file or directory

I think it could be fixed by adding git-lfs to the requirements.txt, and then rebuilding the container.

tokenizer do not padding sequence to same length autoly.

Pretraining runtimes from the paper

Hi! Great work, and also great youtube presentation, thanks for making that public.

I have a question about the runtimes. In the Table A.2 it says that pre-training took 80min for the model with 1.6M parameters. When I pretrain on my dataset the model with 3.3M parameters (input size 16k, 3 Hyena layers, emb.dim. 256) it takes me around 16 hours for the dataset with only 21000 samples. Anything is wrong with my setup? Could you please specify more explicitly, what data size went into the Table A.2? Like, how many samples of which sequence length with what batch size.
And if it's possible to tell, what share of nucleotides of human genome the pretrained model (like, the one with the batch size 32k) ended up seeing?

Thank you for the nice work!

Could this HyenaDNA model be used for a pure language task?

Could this HyenaDNA model be used for a pure language task? Of course with some changes, such as a tokenizer for language! And maybe some other things ? Which other things would that be?

If this can be done, than that would give an enormous advantage of being able to work with a giant context size while still having an acceptable (or even a very good) performance for training and inference! Am I correct here?

Also I saw a mention of the HyenaDNA model being able of in-context learning, which is a very important prerequisite of such a model!

I have not read the paper, but could you show a table with a comparison between the pros and cons of a standard normal transformer and a HyenaDNA model?

But my main question is:
Could this HyenaDNA model be used for a pure language task? And how exactly to go about actually implementing that? What would be gained in comparison with using a conventional Transformer for that?

Thank you for this amazing development! What a time to be alive!

Happy New Year for you and the entire team of HyenaDNA!

Next token prediction - head code location / config to pass

Hello! Great work and thanks for the opensource! I'm trying to check the pretraining and see the next token prediction on a dataset we have.

I find the standalone model probably better for loading pretrained weights and changing the dataloading for this purpose. But in the standalone model, there is no head provided for the next token prediction. It is said to be in the main hyenaDNA code but it is a bit hard for me to find where it lies and how I can modify the standalone .py for it. Can you help me with where this part of the code is or maybe how we can modify the config file for this purpose?

Also, I'm not an expert in genome study so I'm not familiar with the data structure if it is assumed to be well known. Since I don't have access to the main data you use but with some other single chromosome genome sequences, can you let me know the data structure so I can generate the correct files needed? (the .fa, I think consists of a line of info and a line of sequence for each record? And for .bed is the starting and ending position information?)

Thanks!

hyenaDNA for regression?

I would like to use the pre-trained hyenaDNA model for a regression task. I took your code from Google colab and have adapted it successfully for a classification task on the same dataset. However, when I attempted to modify my code for regression (essentially just setting n_classes = 1 and changing the dataloader to not encode the labels) the model train runs without errors but the training loss does not decrease at all over time.

Have you tried to use hyenaDNA for a regression task before? I can provide more code if you expect it to work for this type of task

CUDA out of memory with hyena-1m on A100-80G

Hello,

Thank you for sharing such great work!
Based on the A100-80G, I tried to use hyena-1m on a species classification task but got the error "CUDA out of memory."
Here is my training command
python -m train wandb=null experiment=hg38/species dataset.species=[human,mouse,hippo,pig,lemur] train.global_batch_size=256 optimizer.lr=6e-5 trainer.devices=4 dataset.batch_size=1 dataset.max_length=1000000 dataset.species_dir=/data/species_cls/ model.layer.l_max=1000002 model.d_model=256 model.n_layer=8 trainer.max_epochs=150 decoder.mode=last train.pretrained_model_path=null train.pretrained_model_state_hook=null

I noticed that the A100-80G should be able to train 1m models. Is there anything extra I should be aware of?

Pre-training on local genome data

Hi Authors,
this work is interesting and worth following.
Can you offer an example of how to pre-train a model on a local genome data, let's say, a species dataset like human or mouse?
Thanks!

ICL question

Hi! Thanks for the great work!
I'm currently investigating your ICL solutions (soft tokens and instruction tuning) and can't understand what actually "shots" means and what is it for. Can you please add more details?
Thanks

Trouble reproducing Nucleotide Transformer Benchmarks Result

Human Reference Genome questions

Hello, thank you very much for open-sourcing your work. I have a couple of questions about the Human Genome Reference dataset.

[Q1] In your instructions for downloading Human Reference Genome you mention:

First step is download the Human Reference Genome data. It's comprised of 2 files, 1 with all the sequences (the .fasta file), and with the intervals we use (.bed file).
However, you'll need to have a GCP account to download the exact files we used (from the Enformer), and it cost a little to download. At some point we'll try to upload somewhere to share that data.

As far as I know Enformer was using a mix of Basenji dataset and Human Reference Genome. Specifically in their paper they say:

We modified the Basenji2 dataset by extending the input sequence to 196,608bp from the original 131,072bp using the hg38 reference genome.

In the filenames for Human Reference Genome you also have reference to Basenji, i.e.

Download fasta (.fa format) file (of the entire human genome) into hyena-dna/data/hg38. ~24 chromosomes in the whole genome (merged into 1 file), each chromosome is a continuous sequence, basically
gsutil -u hai-gcp-hippo cp gs://basenji_barnyard/hg38.ml.fa.gz ./ && gunzip hg38.ml.fa.gz

Hyena-DNA paper does not contain any reference to Basenji. So I wonder how do these two datasets relate to each other? Do you train on mix of them as Enformer or only on Human Reference Genome? Also in the Appendix you mention:

For pretraining, we use a single human reference genome (Genome Reference Consortium, 2013), and leverage the training and validation intervals (start and end) from (Avsec et al., 2021).

Does it imply that you use completely the same data as Enformer and the same train/val splits?

[Q2] Since accessing the data on GCP incurs a cost, you mentioned plans to make the data more accessible. Do you have a timeline for this? Maybe there is script to convert the original data into Enformer format?

CUFFT-type error when running huggingface.py to generate embeddings

Hello,
I am using a slightly modified version of the huggingface.py script to generate embeddings from fasta files. I am using the largest model (1Mb window size), and running it on a A100 80Gb.

I just added a loop ad the end of the huggingface.py which loads fasta files and gets embeddings:

for record in records:
            print(record.id)
            sequence = str(record.seq)[0:max_length]
            tok_seq = tokenizer(sequence)
            tok_seq = tok_seq["input_ids"]  # grab ids

            # place on device, convert to tensor
            tok_seq = torch.LongTensor(tok_seq).unsqueeze(0)  # unsqueeze for batch dim
            tok_seq = tok_seq.to(device)

            # prep model and forward
            model.to(device)
            model.eval()
            with torch.inference_mode():
                embeddings = model(tok_seq)

However, after a few hundred iterations I get the following CUFFT error, which seems related to out of memory issues:

Traceback (most recent call last):
  File "huggingface_1Mbp.py", line 271, in <module>
    embeddings = model(tok_seq)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hyena-dna/standalone_hyenadna.py", line 914, in forward
    hidden_states = self.backbone(input_ids, position_ids=position_ids)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hyena-dna/standalone_hyenadna.py", line 728, in forward
    hidden_states, residual = layer(hidden_states, residual)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hyena-dna/standalone_hyenadna.py", line 530, in forward
    hidden_states = self.mixer(hidden_states, **mixer_kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hyena-dna/standalone_hyenadna.py", line 288, in forward
    v = self.filter_fn(v, l_filter, k=k[o], bias=bias[o])
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hyena-dna/standalone_hyenadna.py", line 222, in forward
    y = fftconv(x, k, bias)
  File "/home/hyena-dna/standalone_hyenadna.py", line 53, in fftconv
    k_f = torch.fft.rfft(k, n=fft_size) / fft_size
RuntimeError: cuFFT error: CUFFT_ALLOC_FAILED

So I was wondering, if there is a way to flush the memory between iterations, in order to prevent this kind of error?
Thanks!

ImportError: dropout_add_layer_norm is not installed

Hi,

Great work! When I was trying to run python -m train wandb=null experiment=hg38/genomic_benchmark_scratch, I encountered the following error:

Error executing job with overrides: ['wandb=null', 'experiment=hg38/genomic_benchmark_scratch']
Traceback (most recent call last):
File "/home/fankunjie/toxic_gene/hyena-dna/train.py", line 691, in main
train(config)
File "/home/fankunjie/toxic_gene/hyena-dna/train.py", line 653, in train
model = SequenceLightningModule(config)
File "/home/fankunjie/toxic_gene/hyena-dna/train.py", line 148, in init
self.setup() ## Added by KS
File "/home/fankunjie/toxic_gene/hyena-dna/train.py", line 172, in setup
self.model = utils.instantiate(registry.model, self.hparams.model)
File "/home/fankunjie/toxic_gene/hyena-dna/src/utils/config.py", line 104, in instantiate
return obj()
File "/home/fankunjie/toxic_gene/hyena-dna/src/models/sequence/dna_embedding.py", line 36, in init
self.backbone = LMBackbone(
File "/home/fankunjie/toxic_gene/hyena-dna/src/models/sequence/long_conv_lm.py", line 305, in init
raise ImportError("dropout_add_layer_norm is not installed")
ImportError: dropout_add_layer_norm is not installed

I followed exactly the same steps mentioned in the Readme, except the installation of flash attention. I had trouble installing flash attention, and successfully installed it using pip install flash-attn --no-build-isolation following this post: Dao-AILab/flash-attention#246.

Thanks for your help!

Chromatin Preprocessing

Im working on and using the Chromatin portion of the project; however, I am running into issues with the preprocessing. Can you provide any more details on how you generated the initial data? I have followed Sei and deepsea but I seem to be missing the train_hg38_coords_targets.csv. There been several versions of Sei and deepsea so Im currently going through the old versions but when I follow previous steps I seem to be missing that file?

Language model inference

Thank you for your great work!

While the Huggingface examples seem to be about embeddings, I would love to do inference with the language modeling head. I'm particularly interested in doing variant effect prediction using the log-likelihood of a sequence (as in this protein paper). It would also be helpful if you could implement the AutoModelForCausalLM API.

Trouble reproducing Genomics Benchmark Result

Hello,

Thank you for the great work and repo!

I am trying to reproduce the HyenaDNA column of Table 4.1 (GenomicsBenchmark). I am using the weights from LongSafari/hyenadna-tiny-1k-seqlen. However, I am unable to reproduce the results.

Would it be possible for you to specify which hyperparms from Table A.3 (screenshot below) were used for each of the datasets in this benchmark:

Clarifying the models available on HF

Hi,

On the LongSafari HF space there appear to be 2 copies of each model, one with -hf at the end of the name and one without.

I was wondering what the difference is between these models (other than one being compatible with AutoModel), because despite the names being the same and the variables in the config files looking almost identical (i.e., same d_model and n_layers), they have very different number of parameters. For example,

LongSafari/hyenadna-medium-160k-seqlen has 6.6M parameters
LongSafari/hyenadna-medium-160k-seqlen-hf has 12.9M parameters

Which version of these models corresponds to the ones used in the paper experiments? If I am not mistaken, it should be the first one (i.e., the one without -hf in the name)?

Flash Attention 2

Hi team Hazy Research,
It's just a matter of time before you get this question but is HyennaDNA going to use Flash Attention 2 vs 1? The improvements listed on repo for v2 seem pretty significant but v. 1 is linked in HyennaDNA.
I also see that you work with Flash Attention team based on author/ contributor list so probably won't be long until we see this change...

Pre-trained model for genomic benchmarks

Hello. We've been working through replicating the results from your arXiv paper to better understand HyenaDNA and how we can use it for our own purposes. Using the hyena-dna-nt6 Docker image, we were able to successfully replicate the "Nucleotide Transformer" experiments. However, we've been unable to replicate the "Genomic Benchmark" experiments because the pretrained_model_path in the configs/experiment/hg38/genomic_benchmark.yaml file is set to a path that doesn't exist in the container: /local-scratch/nigam/projects/mwornow/projects/safari-internal/outputs/2023-04-14/2_128_1024.ckpt.

Is this file available somewhere that we can access it? This appears to be the file that's generated by the Quick Entry point experiment, but we would like to use a pre-trained model if possible before generating our own for this first round of validation.

Thanks!

How to define which dataset to use in command?

for example, if I want to run model on H3_txt in EMP dataset how do I specify it in command?

How to recreate the result of DNABERT in paper

Hi,
I wonder is there code to recreate the result of DNABERT in paper

cannot run hg38_hyena_seqlen_warmup_reload

Hi,

Thanks for your great work! I am trying to do hg38/hg38_hyena_seqlen_warmup_reload.yaml experiment. Got the following error msg:

I had some initial search on this issue and found this. I set monitor: test/loss and it still doesn't work. But i have no problem running 'g38/hg38_hyena.yaml'.

Do you have any insights on this issue? Is this related to the sentence length warmup callback? because I can run 'g38/hg38_hyena.yaml' without this callback. i am using pytorch_lightning v1.8.6

Predicting probability vectors of equal length to input sequence

Hi Eric and HazyResearch team

Thank you for providing such exciting research to the public!

I am currently interested in whether HyenaDNA fine-tuning can operate without the binary classification decoder.

The reason is because I wish to predict at least two probabilities per input nucleotide, for instance:
input: [A, T, C, G, ...]
output: [ [0.8, 0, 0, 0], [0, 0, 0, 0.7] ]

May I seek your advice on whether this is possible, and if so, how to do so?

Thank you for your time.

Best Regards
WY

Addition to Transformers

First of all, excellent work! This model holds so much promise!

Is this model already in hugging face transformers? If not, I would like to volunteer in adding this model to Hugging face Transformers framework so it can be used by researchers easily.

Thanks!

Question about HyenaDNA working

Hello,

I have been using HyenaDNA to pre-train it on my custom dataset, and I wanted some insight into its working based on the output logs from the pre-training run. For every epoch, after updating model weights based on the training data, the tool evaluates the performance on the validation data by calculating val/loss. In the process, it loads validation data from two different dataloaders (Validation Dataloader 0 and Validation Dataloader 1). I provided the test data separately from the validation data, and after varying the train : validation : test splits, I found that the size of Validation Dataloader 1 changes in proportion to the size of the test set. Why does this happen, given that the test set is separate from the validation set, and that the tool reports a separate value of test/loss after every epoch?

pretrain model weights miss pos_emb

Hi,exnx

I'm sorry to bother you. I want to use the pre-trained model weights you provided to fine-tune for downstream tasks. I found that some model parameters are missing. Is there anything wrong with my approach?

Some refactoring of the model parameter names was done using the code you provided before loading the model weights

Here is a more detailed translation:

pretrained_model_name = 'hyenadna-small-32k-seqlen'

max_lengths = {
'hyenadna-tiny-1k-seqlen': 1024,
'hyenadna-small-32k-seqlen': 32768,
'hyenadna-medium-160k-seqlen': 160000,
'hyenadna-medium-450k-seqlen': 450000, # T4 up to here
'hyenadna-large-1m-seqlen': 1_000_000, # only A100 (paid tier)
}

max_length = max_lengths[pretrained_model_name] # auto selects

%# data settings:
use_padding = True
rc_aug = False # reverse complement augmentation
add_eos = False # add end of sentence token

%# we need these for the decoder head, if using
use_head = True
n_classes = 1 # not used for embeddings only

%# you can override with your own backbone config here if you want,
%# otherwise we'll load the HF one in None
backbone_cfg = json.load(open(f'hungging_face_models/hyenadna/{pretrained_model_name}/config.json', 'r'))

model = hyenadna.HyenaDNAModel(**backbone_cfg, use_head=use_head, n_classes=n_classes)
pretrain_w = torch.load(f'hungging_face_models/hyenadna/{pretrained_model_name}/pytorch_model.bin')

RuntimeError Traceback (most recent call last)
Cell In[12], line 1
----> 1 model.load_state_dict(pretrain_w)

File /Data/luokai/biotools/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py:1667, in Module.load_state_dict(self, state_dict, strict)
1662 error_msgs.insert(
1663 0, 'Missing key(s) in state_dict: {}. '.format(
1664 ', '.join('"{}"'.format(k) for k in missing_keys)))
1666 if len(error_msgs) > 0:
-> 1667 raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
1668 self.class.name, "\n\t".join(error_msgs)))
1669 return _IncompatibleKeys(missing_keys, unexpected_keys)

RuntimeError: Error(s) in loading state_dict for HyenaDNAModel:
Missing key(s) in state_dict: "backbone.layers.0.mixer.filter_fn.pos_emb.z", "backbone.layers.0.mixer.filter_fn.pos_emb.t", "backbone.layers.0.mixer.filter_fn.implicit_filter.3.freq", "backbone.layers.0.mixer.filter_fn.implicit_filter.5.freq", "backbone.layers.0.mixer.filter_fn.modulation.deltas", "backbone.layers.1.mixer.filter_fn.pos_emb.z", "backbone.layers.1.mixer.filter_fn.pos_emb.t", "backbone.layers.1.mixer.filter_fn.implicit_filter.3.freq", "backbone.layers.1.mixer.filter_fn.implicit_filter.5.freq", "backbone.layers.1.mixer.filter_fn.modulation.deltas", "backbone.layers.2.mixer.filter_fn.pos_emb.z", "backbone.layers.2.mixer.filter_fn.pos_emb.t", "backbone.layers.2.mixer.filter_fn.implicit_filter.3.freq", "backbone.layers.2.mixer.filter_fn.implicit_filter.5.freq", "backbone.layers.2.mixer.filter_fn.modulation.deltas", "backbone.layers.3.mixer.filter_fn.pos_emb.z", "backbone.layers.3.mixer.filter_fn.pos_emb.t", "backbone.layers.3.mixer.filter_fn.implicit_filter.3.freq", "backbone.layers.3.mixer.filter_fn.implicit_filter.5.freq", "backbone.layers.3.mixer.filter_fn.modulation.deltas", "head.output_transform.weight", "head.output_transform.bias".

Thanks,

Cuda out of memory for huggingface pre-trained model on A100-80GB

I'm using the standalone_hyenadna.py script and loading the pre-trained weights of the large-1m model from huggingface in order to have a standalone code similar to the colab.
When performing fine-tuning and testing on the dummy_mouse_enhancers_ensembl dataset from Genomic Benchmark with a sequence max_length of 3400, I get a "CUDA out of memory" error on an A100 GPU.
As you suggest in the readme I tried to modify the downloaded config.json file found in checkpoints_path/hyenadna-large-1m-seqlen by setting to True the fields:
checkpoint_mixer: True
checkpoint_mlp: True
But now I'm getting the error:
scratch_dict[key] = pretrained_dict[key_loaded]
KeyError: 'model.backbone.layers.0.mixer.layer.in_proj.weight'
As you suggest I tried to toggle on/off these params in order to find the working combination, but I either get this key error or the cuda out of memory error.

It seems to me that since the pre-trained model loaded from huggingface probably has been trainined with those flags set to False, now there's a configuration mismatch.
Am I missing something? How can I work with ultralong sequences without getting memory errors by using pre-trained models downloaded from huggingface?

Thanks in advance for your response and for your valuable contribution to this research field!

Sanity Checking DataLoader error

I am running the quick start command to check my hyena-dna installation.
python -m train wandb=null experiment=hg38/genomic_benchmark_scratch
and I got the error report below, does anyone know how to fix this?

Sanity Checking DataLoader 0:   0%|                                                                      | 0/2 [00:00<?, ?it/s]
CUDA Error: invalid device function /PATH/hyena-dna/flash-attention/csrc/layer_norm/ln_fwd_kernels.cuh 271

Best
Dié

flash-attention installment

Hi All,

Thank you for building this tool, we are excited to try. I'm having an issue with flash-attention installment

i have torch installed, although with the pip install -e ., i'm getting a module error.

cd flash-attention/
git submodule update --init
pip install -e .

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Obtaining file:///home/rohan/rohan/hyena-dna/flash-attention
Installing build dependencies ... done
Checking if build backend supports build_editable ... done
Getting requirements to build editable ... error
ERROR: Command errored out with exit status 1:
command: /home/rohan/anaconda3/bin/python /home/rohan/anaconda3/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_editable /tmp/tmpdk8hywoz
cwd: /home/rohan/rohan/hyena-dna/flash-attention
Complete output (17 lines):
Traceback (most recent call last):
File "/home/rohan/anaconda3/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 363, in
main()
File "/home/rohan/anaconda3/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 345, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "/home/rohan/anaconda3/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 144, in get_requires_for_build_editable
return hook(config_settings)
File "/tmp/pip-build-env-n_q31tvm/overlay/lib/python3.9/site-packages/setuptools/build_meta.py", line 450, in get_requires_for_build_editable
return self.get_requires_for_build_wheel(config_settings)
File "/tmp/pip-build-env-n_q31tvm/overlay/lib/python3.9/site-packages/setuptools/build_meta.py", line 341, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
File "/tmp/pip-build-env-n_q31tvm/overlay/lib/python3.9/site-packages/setuptools/build_meta.py", line 323, in _get_build_requires
self.run_setup()
File "/tmp/pip-build-env-n_q31tvm/overlay/lib/python3.9/site-packages/setuptools/build_meta.py", line 338, in run_setup
exec(code, locals())
File "", line 13, in
ModuleNotFoundError: No module named 'torch'

hazyresearch / hyena-dna Goto Github PK

hyena-dna's People

Contributors

Stargazers

Watchers

Forkers

hyena-dna's Issues

Recommend Projects

Recommend Topics

Recommend Org