Comments (5)
thanks!
from hyena-dna.
and what's difference between huggingface trainer provided in colab? when i use colab to finetuning hyenadna in nucleotide transformer dataset, the performence is a bit low....
from hyena-dna.
In your first post, the correct flag is: dataset.dataset_name
. You forgot the prefix dataset
before dataset_name
.
Regarding your second post, the main differences are:
- lack of cosine decay learning rate scheduler (very important)
- lack of optimizer parameter groups, that allow allow for different hyperparameters per layer (eg, the Hyena layer requires weight decay to be 0). Also very important!
- using gradient clip (with value of 1.0)
- automatic mixed precision handling in Pytorch Lighting (eg, not available on colab)
from hyena-dna.
Thanks,I use colab file to finetune..like DNABRET2 dataset ,just the same as nucleotide dataset. I found the mcc was a bit low . I don't know the error... could you help me take a look?
def run_train():
# experiment settings:
num_epochs = 80 # ~100 seems fine
max_length = 500 # max len of sequence of dataset (of what you want)
use_padding = True
data_path = './DNABERT_2/eval/GUE/EMP/H3'
batch_size = 256
learning_rate = 6e-4 # good default for Hyena
rc_aug = True # reverse complement augmentation
add_eos = False # add end of sentence token
weight_decay = 0.1
# for fine-tuning, only the 'tiny' model can fit on colab
pretrained_model_name = 'hyenadna-tiny-1k256d-seqlen' # use None if training from scratch
# we need these for the decoder head, if using
use_head = True
n_classes = 1
# you can override with your own backbone config here if you want,
# otherwise we'll load the HF one by default
backbone_cfg = None
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Using device:", device)
# instantiate the model (pretrained here)
if pretrained_model_name in ['hyenadna-tiny-1k256d-seqlen']:
# use the pretrained Huggingface wrapper instead
model = HyenaDNAPreTrainedModel.from_pretrained(
'./checkpoints',
pretrained_model_name,
download=False,
config=backbone_cfg,
device=device,
use_head=use_head,
n_classes=n_classes,
)
# from scratch
else:
model = HyenaDNAModel(**backbone_cfg, use_head=use_head, n_classes=n_classes)
# create tokenizer
tokenizer = CharacterTokenizer(
characters=['A', 'C', 'G', 'T', 'N'], # add DNA characters, N is uncertain
model_max_length=max_length + 2, # to account for special tokens, like EOS
add_special_tokens=False, # we handle special tokens elsewhere
padding_side='left', # since HyenaDNA is causal, we pad on the left
)
# create datasets
ds_train = SupervisedDataset(tokenizer=tokenizer,
data_path = os.path.join(data_path, "train.csv"),
kmer = -1)
ds_test = SupervisedDataset(tokenizer=tokenizer,
data_path= os.path.join(data_path,'test.csv'),
kmer = -1)
train_loader = DataLoader(ds_train, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(ds_test, batch_size=batch_size, shuffle=False)
# loss function
loss_fn = nn.BCEWithLogitsLoss()
#loss_fn = nn.MSELoss()
# create optimizer
optimizer = optim.AdamW(model.parameters(), lr=learning_rate, weight_decay=weight_decay)
model.to(device)
for epoch in range(num_epochs):
train(model, device, train_loader, optimizer, epoch, loss_fn)
test(model, device, test_loader, loss_fn)
optimizer.step()
if name == "main":
run_train()
from hyena-dna.
As mentioned, the colab is missing a lot of stuff to get competitive results. The colab is for education purposes mainly. To get good results, you'll need to use the main repo for finetuning. Also, the hyperparameters will matter a lot too, which is something only you will find by running sweeps of different hyperparameters on the actual datasets.
Unfortunately we're not able to support you in finetuning on your own different datasets. We mainly support on reproducing results from the paper.
But maybe this new docker image will help, which has the Nucleotide Transformer datasets and the exact launch commands and hyperparameters used to get best performance. The environment, pretrained weights, datasets, launch commands are all inside the Docker image, you just need to pull and launch it. Perhaps you can "reverse engineer" those settings for what works best for you on your own datasets.
docker pull hyenadna/hyena-dna-nt6:latest
docker run --gpus all -it -p80:3000 hyenadna/hyena-dna-nt6 /bin/bash
This will land you inside the /wdr
, which has a file named launch_commands_nucleotide_transformer
with all the launch commands for the 18 NT datasets.
from hyena-dna.
Related Issues (20)
- Question about HyenaDNA working HOT 2
- Model weights trained on the Genomics Benchmark Dataset HOT 2
- Cuda out of memory for huggingface pre-trained model on A100-80GB HOT 6
- MCC value problem on Nucleotide Transformer HOT 7
- Error in Pretraining on Human Genome HOT 6
- Genome build versions are inconsistent in reference and chromatin profile (DeepSEA benchmark) HOT 3
- How to correctly provide padding tokens to forward pass of pretrained model? HOT 1
- Trouble reproducing Genomics Benchmark Result HOT 2
- Trouble reproducing Nucleotide Transformer Benchmarks Result HOT 2
- Clarifying the models available on HF HOT 4
- CUFFT-type error when running huggingface.py to generate embeddings HOT 3
- Could this HyenaDNA model be used for a pure language task? HOT 1
- Sanity Checking DataLoader error HOT 1
- Chromatin Preprocessing HOT 8
- Pre-training on local genome data HOT 1
- cannot run hg38_hyena_seqlen_warmup_reload HOT 2
- Pretraining runtimes from the paper HOT 1
- hyenaDNA for regression? HOT 1
- How to define which dataset to use in command?
- The tokenizer's bug in Huggingface HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hyena-dna.