when I ran python -m train wandb=null experiment=hg38/nucleotide_transformer dataset_n

nucleotide finetuning about hyena-dna HOT 5 CLOSED

hazyresearch commented on May 12, 2024

nucleotide finetuning

from hyena-dna.

Comments (5)

wawpaopao commented on May 12, 2024 1

thanks!

from hyena-dna.

wawpaopao commented on May 12, 2024

and what's difference between huggingface trainer provided in colab? when i use colab to finetuning hyenadna in nucleotide transformer dataset, the performence is a bit low....

from hyena-dna.

exnx commented on May 12, 2024

In your first post, the correct flag is: dataset.dataset_name. You forgot the prefix dataset before dataset_name.

Regarding your second post, the main differences are:

lack of cosine decay learning rate scheduler (very important)
lack of optimizer parameter groups, that allow allow for different hyperparameters per layer (eg, the Hyena layer requires weight decay to be 0). Also very important!
using gradient clip (with value of 1.0)
automatic mixed precision handling in Pytorch Lighting (eg, not available on colab)

from hyena-dna.

wawpaopao commented on May 12, 2024

Thanks，I use colab file to finetune..like DNABRET2 dataset ,just the same as nucleotide dataset. I found the mcc was a bit low . I don't know the error... could you help me take a look?
def run_train():

# experiment settings:
num_epochs = 80  # ~100 seems fine
max_length = 500  # max len of sequence of dataset (of what you want)
use_padding = True
data_path = './DNABERT_2/eval/GUE/EMP/H3'

batch_size = 256
learning_rate = 6e-4  # good default for Hyena
rc_aug = True  # reverse complement augmentation
add_eos = False  # add end of sentence token
weight_decay = 0.1

# for fine-tuning, only the 'tiny' model can fit on colab
pretrained_model_name = 'hyenadna-tiny-1k256d-seqlen'  # use None if training from scratch

# we need these for the decoder head, if using
use_head = True
n_classes = 1

# you can override with your own backbone config here if you want,
# otherwise we'll load the HF one by default
backbone_cfg = None

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Using device:", device)

# instantiate the model (pretrained here)
if pretrained_model_name in ['hyenadna-tiny-1k256d-seqlen']:
    # use the pretrained Huggingface wrapper instead
    model = HyenaDNAPreTrainedModel.from_pretrained(
        './checkpoints',
        pretrained_model_name,
        download=False,
        config=backbone_cfg,
        device=device,
        use_head=use_head,
        n_classes=n_classes,
    )

# from scratch
else:
    model = HyenaDNAModel(**backbone_cfg, use_head=use_head, n_classes=n_classes)
 # create tokenizer
tokenizer = CharacterTokenizer(
    characters=['A', 'C', 'G', 'T', 'N'],  # add DNA characters, N is uncertain
    model_max_length=max_length + 2,  # to account for special tokens, like EOS
    add_special_tokens=False,  # we handle special tokens elsewhere
    padding_side='left', # since HyenaDNA is causal, we pad on the left
)

# create datasets
ds_train = SupervisedDataset(tokenizer=tokenizer,
                             data_path = os.path.join(data_path, "train.csv"),
                             kmer = -1)
ds_test = SupervisedDataset(tokenizer=tokenizer,
                            data_path= os.path.join(data_path,'test.csv'),
                            kmer = -1)
train_loader = DataLoader(ds_train, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(ds_test, batch_size=batch_size, shuffle=False)

# loss function
loss_fn = nn.BCEWithLogitsLoss()
#loss_fn = nn.MSELoss()
# create optimizer
optimizer = optim.AdamW(model.parameters(), lr=learning_rate, weight_decay=weight_decay)

model.to(device)

for epoch in range(num_epochs):
    train(model, device, train_loader, optimizer, epoch, loss_fn)
    test(model, device, test_loader, loss_fn)
    optimizer.step()

if name == "main":
run_train()

from hyena-dna.

exnx commented on May 12, 2024

As mentioned, the colab is missing a lot of stuff to get competitive results. The colab is for education purposes mainly. To get good results, you'll need to use the main repo for finetuning. Also, the hyperparameters will matter a lot too, which is something only you will find by running sweeps of different hyperparameters on the actual datasets.

Unfortunately we're not able to support you in finetuning on your own different datasets. We mainly support on reproducing results from the paper.

But maybe this new docker image will help, which has the Nucleotide Transformer datasets and the exact launch commands and hyperparameters used to get best performance. The environment, pretrained weights, datasets, launch commands are all inside the Docker image, you just need to pull and launch it. Perhaps you can "reverse engineer" those settings for what works best for you on your own datasets.

docker pull hyenadna/hyena-dna-nt6:latest 
docker run --gpus all -it -p80:3000 hyenadna/hyena-dna-nt6 /bin/bash

This will land you inside the /wdr, which has a file named launch_commands_nucleotide_transformer with all the launch commands for the 18 NT datasets.

from hyena-dna.

nucleotide finetuning about hyena-dna HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent