abhishekkrthakur / tez Goto Github PK

Tez is a super-simple and lightweight Trainer for PyTorch. It also comes with many utils that you can use to tackle over 90% of deep learning projects in PyTorch.

License: Apache License 2.0

Python 99.48% Makefile 0.52%

tez pytorch deep-learning deep-neural-networks neural-networks

tez's Introduction

Tez: a simple pytorch trainer

NOTE: Currently, we are not accepting any pull requests! All PRs will be closed. If you want a feature or something doesn't work, please create an issue.

tez (तेज़ / تیز) means sharp, fast & active. This is a simple, to-the-point, library to make your pytorch training easy.

This library is in early-stage currently! So, there might be breaking changes.

Idea around tez is simple:

keep things as simple as possible
make it as customizable as possible
clean code
faster prototyping
production ready

Currently, tez supports cpu, single gpu and multi-gpu & tpu training. More coming soon!

Using tez is super-easy. We don't want you to be far away from pytorch. So, you do everything on your own and just use tez to make a few things simpler.

tez's People

Contributors

Stargazers

Watchers

Forkers

misalraj iamsantoshkumar cule thomasbrandon itsmeakapa amitkayal kanwarkelide suitup anishdelft genesandatshirt yest cuulee dshomin urmi22 bhurkesiddhesh hash-ir husnejahan manish007700 mardom demarioasquitt devanshu125 mr-siddy mmagithub rajputjay41 zeta1999 azhura rifat963 iamshant sanardi stjordanis vortexash plaban1981 bousejin karenkyu scalefreeus kaidduong anil-gurbuz rahul-art shadab4150 rocky-ai purushottamkumar-hub alan-ai-learner geometrylearner victen18 saviour1001 shashi29 kendreaditya shafaypro rmoumni15 rohitpandey13 jurjsorinliviu learningthemachine sunny11286 aagarwal937 fagan2888 adbmd gaoyz0625 chaitanyakasaraneni ziviland sarthakkmishraa albertweichselbraun karndeepsingh ianakoto iketutg poipiii sirbarbouchi dionysiskokkoris hrithikmehdiratta 3rmyaz 1chimarugin yseeker amittomar dfyzmaster fisherzhongyi fastdaima karynaur suhaneshivam hercules261188 jyotirm0y trendingtechnology adawolfs jboru taka-albert aidenzich noobpro004 mr-mainak rashmibanthia welcomechennai matrox1000 zjhellofss techthiyanes pavanalluri 2narayana seanpresent sophiezang pouyazhd linj6669 kristopher-smith geniusnhu tatsuyassk

tez's Issues

Suggestion: Object detection example

A few object detection competitions are going on in Kaggle so I believe it will be a good time to create a video and/or code example using Tez for the same. It will help many and also will be a way to popularize the library.

🔥 Require Tez + Apple M1 GPU Support

Here is the full traceback:

Any plan on implementing PyTorch Lightning !!!

Hello, @abhishekkrthakur thanks for this awesome library.
Do you have any plan on including PyTorch Lightning in this library?
Cheers.

How do you pass the parameters when using the "reducelrplateon" scheduler?

Val_loss is required as a parameter, how can I construct it? Thank you

.

tried to follow you got this error "init() got an unexpected keyword argument 'resize'".

Saving oofs file when fitting

When we train a model by calling model.fit it would be nice that we can save the predictions of the validation set on the run as oof csv files.
Please see to it.

AttributeError: module 'tez' has no attribute 'Model'

NotADirectortyerror

NotADirectoryError Traceback (most recent call last)
in
----> 1 train_dataset[0]

/opt/conda/lib/python3.7/site-packages/tez/datasets/image_classification.py in getitem(self, item)
39 targets = self.targets[item]
40 if self.backend == "pil":
---> 41 image = Image.open(self.image_paths[item])
42 if self.resize is not None:
43 image = image.resize(

/opt/conda/lib/python3.7/site-packages/PIL/Image.py in open(fp, mode, formats)
2889
2890 if filename:
-> 2891 fp = builtins.open(filename, "rb")
2892 exclusive_fp = True
2893

NotADirectoryError: [Errno 20] Not a directory: '../input/cassava-leaf-disease-classification/train.csv/1724663202.jpg'

Small error in image_classification.py

If augmentation is None then we face error as , variable augmented referenced before assignment
UnboundLocalError: local variable 'augmented' referenced before assignment

elif self.backend == "cv2":
            image = cv2.imread(self.image_paths[item])
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
            if self.resize is not None:
                image = cv2.resize(
                    image,
                    (self.resize[1], self.resize[0]),
                    interpolation=cv2.INTER_CUBIC,
                )
            if self.augmentations is not None:
                augmented = self.augmentations(image=image)
                image = augmented["image"]

If the indendation is fixed we can solve this error.

ValueError: operands could not be broadcast together with shapes (256,256,4) (3,) (256,256,4)

I am trying to use this package, and it is throwing as below. I am using the same pipeline from cassava lead detection problem but on different set where image size is (256, 256)

Could you please help here.

Downloading: "https://github.com/lukemelas/EfficientNet-PyTorch/releases/download/1.0/efficientnet-b4-6ed6700e.pth" to /root/.cache/torch/hub/checkpoints/efficientnet-b4-6ed6700e.pth
100%
74.4M/74.4M [00:00<00:00, 107MB/s]

Loaded pretrained weights for efficientnet-b4
0%| | 0/51 [00:00<?, ?it/s]

ValueError Traceback (most recent call last)
in ()
11 epochs=10,
12 callbacks=[es],
---> 13 fp16=True,
14 )
15 model.save("model.bin")

6 frames
/usr/local/lib/python3.6/dist-packages/tez/model/model.py in fit(self, train_dataset, valid_dataset, train_sampler, valid_sampler, device, epochs, train_bs, valid_bs, n_jobs, callbacks, fp16)
295 self.train_state = enums.TrainingState.EPOCH_START
296 self.train_state = enums.TrainingState.TRAIN_EPOCH_START
--> 297 train_loss = self.train_one_epoch(self.train_loader, device)
298 self.train_state = enums.TrainingState.TRAIN_EPOCH_END
299 if self.valid_loader:

/usr/local/lib/python3.6/dist-packages/tez/model/model.py in train_one_epoch(self, data_loader, device)
176 losses = AverageMeter()
177 tk0 = tqdm(data_loader, total=len(data_loader))
--> 178 for b_idx, data in enumerate(tk0):
179 self.train_state = enums.TrainingState.TRAIN_STEP_START
180 loss, metrics = self.train_one_step(data, device)

/usr/local/lib/python3.6/dist-packages/tqdm/std.py in iter(self)
1102 fp_write=getattr(self.fp, 'write', sys.stderr.write))
1103
-> 1104 for obj in iterable:
1105 yield obj
1106 # Update and possibly print the progressbar.

/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py in next(self)
433 if self._sampler_iter is None:
434 self._reset()
--> 435 data = self._next_data()
436 self._num_yielded += 1
437 if self._dataset_kind == _DatasetKind.Iterable and \

/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
1083 else:
1084 del self._task_info[idx]
-> 1085 return self._process_data(data)
1086
1087 def _try_put_index(self):

/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py in _process_data(self, data)
1109 self._try_put_index()
1110 if isinstance(data, ExceptionWrapper):
-> 1111 data.reraise()
1112 return data
1113

/usr/local/lib/python3.6/dist-packages/torch/_utils.py in reraise(self)
426 # have message field
427 raise self.exc_type(message=msg)
--> 428 raise self.exc_type(msg)
429
430

ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.6/dist-packages/tez/datasets/image_classification.py", line 48, in getitem
augmented = self.augmentations(image=image)
File "/usr/local/lib/python3.6/dist-packages/albumentations/core/composition.py", line 171, in call
data = t(**data)
File "/usr/local/lib/python3.6/dist-packages/albumentations/core/transforms_interface.py", line 38, in call
res[key] = target_function(arg, **dict(params, **target_dependencies))
File "/usr/local/lib/python3.6/dist-packages/albumentations/augmentations/transforms.py", line 808, in apply
return F.normalize(image, self.mean, self.std, self.max_pixel_value)
File "/usr/local/lib/python3.6/dist-packages/albumentations/augmentations/functional.py", line 93, in normalize
img -= mean
ValueError: operands could not be broadcast together with shapes (256,256,4) (3,) (256,256,4)

Resuming the training through checkpoint with tez

Hi,

I am wondering if it is possible to pick up a saved model and resume/continue the training with tez. I am new to pytorch. Here is what I tried:

class Bert(tez.Model):

    def __init__(self, num_classes, num_train_steps=None):

        super().__init__()
        self.bert = transformers.BertModel.from_pretrained(
           'bert-base-uncased, 
            return_dict=False
            )

        if config.RETRAINING: # set to True
            self.bert.load(
            'demo.bin', 
            device='cuda')    

        self.bert_drop = nn.Dropout(0.3)
        self.out = nn.Linear(self.bert.config.hidden_size, num_classes)

and it doesn't work. I am not sure what I am missing. I found this for pytorch:

https://discuss.pytorch.org/t/loading-a-saved-model-for-continue-training/17244

but I am not sure how to use this together with tez.

Can it work without CUDA

I am getting error when I executed the code with CPU configuration.

Traceback (most recent call last):
File "c:\Users\Hemanth\Desktop\Data Analytics analyticvidya\recommender system\recommender.py", line 88, in
train()
File "c:\Users\Hemanth\Desktop\Data Analytics analyticvidya\recommender system\recommender.py", line 82, in train
model.fit(
File "C:\Users\Hemanth\Desktop\Data Analytics analyticvidya\recommender system\venv\lib\site-packages\tez\model\model.py", line 309, in fit self._init_model(
File "C:\Users\Hemanth\Desktop\Data Analytics analyticvidya\recommender system\venv\lib\site-packages\tez\model\model.py", line 93, in _init_model
self.to(self.device)
File "C:\Users\Hemanth\Desktop\Data Analytics analyticvidya\recommender system\venv\lib\site-packages\torch\nn\modules\module.py", line 852, in to
return self._apply(convert)
File "C:\Users\Hemanth\Desktop\Data Analytics analyticvidya\recommender system\venv\lib\site-packages\torch\nn\modules\module.py", line 530, in _apply
module._apply(fn)
File "C:\Users\Hemanth\Desktop\Data Analytics analyticvidya\recommender system\venv\lib\site-packages\torch\nn\modules\module.py", line 552, in apply
param_applied = fn(param)
File "C:\Users\Hemanth\Desktop\Data Analytics analyticvidya\recommender system\venv\lib\site-packages\torch\nn\modules\module.py", line 850, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
File "C:\Users\Hemanth\Desktop\Data Analytics analyticvidya\recommender system\venv\lib\site-packages\torch\cuda_init.py", line 166, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

Text classification examples - Tokenizer is defined twice

The tokenizer is defined both in the model and the dataset in the BERT text classification examples.

multi_class.py, line 50:
self.tokenizer = transformers.BertTokenizer.from_pretrained( "bert-base-uncased", do_lower_case=True )

Saving validation score

Is it possible to save somehow a list of the validation scores (on epochs or batches) after training? I have some problems with output on my server, it deletes usually, but I really need validation scores to compare models, it would be really convenient, if I could get them in one file, for example.

run example code error

when I run example code:

accelerate launch   imdb_sentiment_classification.py

after run some epoch get error info

INFO:tez.callbacks.early_stopping:EarlyStopping counter: 4/5
[train] accuracy=0.9915, loss=0.0269 [valid] accuracy=0.8953, loss=0.4287 [e=5 steps=2112]                                                                                                 
 30%|████████████████████████████████▍                                                                           | 2112/7040 [05:45<06:40, 12.32it/s, accuracy=0.991, epoch=5, loss=0.0269]2022-09-17 07:55:02,832 INFO EarlyStopping counter: 5/5
INFO:tez.callbacks.early_stopping:EarlyStopping counter: 5/5
 30%|████████████████████████████████▍                                                                           | 2112/7040 [05:47<13:31,  6.07it/s, accuracy=0.991, epoch=5, loss=0.0269]




[E ProcessGroupNCCL.cpp:719] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34532, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1808970 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:719] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34532, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1808984 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:719] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34532, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809275 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34532, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809275 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34532, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1808970 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34532, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1808984 milliseconds before timing out.
Traceback (most recent call last):
  File "/root/miniconda3/envs/lightning/lib/python3.9/multiprocessing/resource_sharer.py", line 138, in _serve
    with self._listener.accept() as conn:
  File "/root/miniconda3/envs/lightning/lib/python3.9/multiprocessing/connection.py", line 470, in accept
    deliver_challenge(c, self._authkey)
  File "/root/miniconda3/envs/lightning/lib/python3.9/multiprocessing/connection.py", line 745, in deliver_challenge
    response = connection.recv_bytes(256)        # reject large message
  File "/root/miniconda3/envs/lightning/lib/python3.9/multiprocessing/connection.py", line 221, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/root/miniconda3/envs/lightning/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
    buf = self._recv(4)
  File "/root/miniconda3/envs/lightning/lib/python3.9/multiprocessing/connection.py", line 384, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 113654 closing signal SIGTERM
Traceback (most recent call last):
  File "/root/miniconda3/envs/lightning/lib/python3.9/multiprocessing/resource_sharer.py", line 138, in _serve
    with self._listener.accept() as conn:
  File "/root/miniconda3/envs/lightning/lib/python3.9/multiprocessing/connection.py", line 470, in accept
    deliver_challenge(c, self._authkey)
  File "/root/miniconda3/envs/lightning/lib/python3.9/multiprocessing/connection.py", line 745, in deliver_challenge
    response = connection.recv_bytes(256)        # reject large message
  File "/root/miniconda3/envs/lightning/lib/python3.9/multiprocessing/connection.py", line 221, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/root/miniconda3/envs/lightning/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
    buf = self._recv(4)
  File "/root/miniconda3/envs/lightning/lib/python3.9/multiprocessing/connection.py", line 384, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 2 (pid: 113655) of binary: /root/miniconda3/envs/lightning/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/lightning/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
  File "/root/miniconda3/envs/lightning/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/envs/lightning/lib/python3.9/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/root/miniconda3/envs/lightning/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/root/miniconda3/envs/lightning/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/lightning/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
imdb_sentiment_classification.py FAILED
-------------------------------------------------------
Failures:
[1]:
  time      : 2022-09-17_08:25:22
  host      : dy-a100-779-tlzrv
  rank      : 3 (local_rank: 3)
  exitcode  : -6 (pid: 113656)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 113656
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-09-17_08:25:22
  host      : dy-a100-779-tlzrv
  rank      : 2 (local_rank: 2)
  exitcode  : -6 (pid: 113655)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 113655
=======================================================
Traceback (most recent call last):
  File "/root/miniconda3/envs/lightning/bin/accelerate", line 33, in <module>
    sys.exit(load_entry_point('accelerate==0.12.0.dev0', 'console_scripts', 'accelerate')())
  File "/root/miniconda3/envs/lightning/lib/python3.9/site-packages/accelerate-0.12.0.dev0-py3.9.egg/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/root/miniconda3/envs/lightning/lib/python3.9/site-packages/accelerate-0.12.0.dev0-py3.9.egg/accelerate/commands/launch.py", line 734, in launch_command
    multi_gpu_launcher(args)
  File "/root/miniconda3/envs/lightning/lib/python3.9/site-packages/accelerate-0.12.0.dev0-py3.9.egg/accelerate/commands/launch.py", line 374, in multi_gpu_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['torchrun', '--nproc_per_node', '4', 'imdb_sentiment_classification.py']' returned non-zero exit status 1.

Can I set the device id of GPUs?

I want to train the model using gpus, so can I set the id of device? Since I have multiple gpus on my computer.

Explain Tez model using SHAP

import numpy as np
from transformers import AutoTokenizer, AutoModelForCausalLM
import shap
import torch

tokenizer = AutoTokenizer.from_pretrained("gpt2", use_fast=True)
model = AutoModelForCausalLM.from_pretrained("gpt2").cuda()

set model decoder to true

model.config.is_decoder=True

set text-generation params under task_specific_params

model.config.task_specific_params["text-generation"] = {
"do_sample": True,
"max_length": 50,
"temperature": 0.7,
"top_k": 50,
"no_repeat_ngram_size": 2
}
Define initial text
s = ['I enjoy walking with my cute dog']
Create an explainer object
explainer = shap.Explainer(model,tokenizer)

This is transformer model

When I am trying same with Tez model, its giving error.

class Classifier(tez.Model):
def init(self, num_train_steps, num_classes):
super().init()
self.bert = transformers.SqueezeBertModel.from_pretrained("squeezebert/squeezebert-uncased")
self.bert_drop = nn.Dropout(0.3)
self.out = nn.Linear(768, num_classes)
self.num_train_steps = num_train_steps
self.step_scheduler_after = "batch"

def fetch_optimizer(self):
    param_optimizer = list(self.named_parameters())
    no_decay = ["bias", "LayerNorm.bias"]
    optimizer_parameters = [
        {
            "params": [
                p for n, p in param_optimizer if not any(nd in n for nd in no_decay)
            ],
            "weight_decay": 0.001,
        },
        {
            "params": [
                p for n, p in param_optimizer if any(nd in n for nd in no_decay)
            ],
            "weight_decay": 0.0,
        },
    ]
    opt = AdamW(optimizer_parameters, lr=3e-5)
    return opt


def fetch_scheduler(self):
    sch = get_linear_schedule_with_warmup(
        self.optimizer, num_warmup_steps=0, num_training_steps=self.num_train_steps
    )
    return sch

def loss(self, outputs, targets):
    if targets is None:
        return None
    return nn.BCEWithLogitsLoss()(outputs, targets.float())


def monitor_metrics(self, outputs, targets):
    if targets is None:
        return {}
    
    outputs = torch.sigmoid(outputs)
    outputs = outputs.cpu().detach().numpy()
    targets = targets.cpu().detach().numpy()
    
    fpr_micro, tpr_micro, _ = metrics.roc_curve(targets.ravel(), outputs.ravel())
    auc_micro = metrics.auc(fpr_micro, tpr_micro)
    return {"auc": auc_micro}


def forward(self, ids, mask, targets=None):
    o_2 = self.bert(ids, attention_mask=mask)[1]
    b_o = self.bert_drop(o_2)
    output = self.out(b_o)
    loss = self.loss(output, targets)
    acc = self.monitor_metrics(output, targets)
    return output, loss, acc

n_train_steps = 13565
model = Classifier(n_train_steps, 28)
optimizer = model.fetch_optimizer()
checkpoint = torch.load('C:/Users/Jay/Downloads/model (1).bin', map_location=torch.device('cpu'))
model.load_state_dict(checkpoint['state_dict'])
optimizer.load_state_dict(checkpoint['optimizer'])

explainer = shap.Explainer(model, tokenizer)
shap_values = explainer('The waiter shook his head in horror and left.')

Error : TypeError: forward() missing 1 required positional argument: 'mask'

Can you please let me know how can we resolve this?

ValueError: only one element tensors can be converted to Python scalars

Hi, I'm trying to run the example code for sentiment classification but it gives me this error. It looks like the issue arises due to the way tensor size is calculated.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-2a229acd2f86> in <module>
    171         valid_dataset=valid_dataset,
    172         callbacks=[es],
--> 173         config=config,
    174     )

~/anaconda3/envs/bertopic/lib/python3.7/site-packages/tez/model/tez.py in fit(self, train_dataset, valid_dataset, config, **kwargs)
    455         for _ in range(self.config.epochs):
    456             self.train_state = enums.TrainingState.EPOCH_START
--> 457             self.train(self.train_loader, losses)
    458             if self.valid_loader and self.config.val_strategy == "epoch":
    459                 self.validate(self.valid_loader)

~/anaconda3/envs/bertopic/lib/python3.7/site-packages/tez/model/tez.py in train(self, data_loader, losses)
    399             self.train_state = enums.TrainingState.TRAIN_STEP_START
    400             loss, metrics = self.train_step(data)
--> 401             losses, monitor = self._update_loss_metrics(losses, loss, metrics, data_loader)
    402             self.train_state = enums.TrainingState.TRAIN_STEP_END
    403             if self.valid_loader and self.config.val_strategy == "batch":

~/anaconda3/envs/bertopic/lib/python3.7/site-packages/tez/model/tez.py in _update_loss_metrics(self, losses, loss, metrics, data_loader)
    366     def _update_loss_metrics(self, losses, loss, metrics, data_loader):
    367         if self.model_state == enums.ModelState.TRAIN:
--> 368             losses.update(loss.item() * self.config.gradient_accumulation_steps, data_loader.batch_size)
    369         else:
    370             losses.update(loss.item(), data_loader.batch_size)

ValueError: only one element tensors can be converted to Python scalars

How can we access the input_ids/attention mask in each train batch loop?

I tried using a train step callback but I am not sure how to get access to the dataloader input_ids and attention mask during each train step. Is this possible?

BTW Thanks for the library!

zero_grad for accumulation_steps = 1 not working as expected

As far as I know, in normal execution flow for zero_grad and forward pass, first we zero_gard for each batch and then do the forward pass but I investigated that in code, it is not happening in this way when accumualtion_steps =1 and batch =1, first forward pass executes first without doing zero_grad.

I tried to reproduce it and it is doing the same which I explained above.

Also, I think we can fix this by removing condition in the tez.py file on line # 330, 331.

how to give the metric for ReduceLROnPlateau optimizer

I imported the optimizer on the fetch_optimizer but it throughing me error of
TypeError : step() missing 1 required positional argument: 'metrics'

Saving after training an epoch

How to save the model after each epoch training? I use fit method for 5 epochs and do not really understand hot to save after each one. not only after the last one.

how to use multi gpu

Hello, how to specify multiple GPUs during model fit?

Not able to converge in local laptop.

Hello Again,

I created a pipeline for classification problem in Colab Notebook with pytorch version 1.7.0+cu101 and the model is coverging very well there. But if I use the same pipeline to execute in my local laptop, the accuracy never goes above 10% with pytorch version 1.7.1

I have no clue, what am I missing here. Do you have any idea?

NOTE: This is old Model class and is deprecated. It will no longer be maintained! Please use version > 0.5.1. Its much better and supports multi-gpu training too!

I have 0.7.2 installed via pip and I am getting the version error message. Is that expected?

$ pip list | grep tez
tez                        0.7.2

Load models

Thanks for this great project.

I started using Tez con Kaggle competition and found that model loading is not appropriately handled when training/infering on different devices: GPU -> CPU. This one possible solution

    def load(self, model_path, device="cuda"):
        self.device = device
        if next(self.parameters()).device != self.device:
            self.to(self.device)
        model_dict = torch.load(model_path, map_location=torch.device(device))
        self.load_state_dict(model_dict["state_dict"])

Applying metrics after the epoch

Dears, I am using tez to classify melanoma images (kaggle SIIM binary classification). With wtfml is possible to get AUC ~ 0.85. With tez, I am only getting AUC ~ 0.6. I saw that this happens, in tez, when using metrics.roc_auc_score(...) inside monitor_metrics method. This gives some ValueError exceptions, that must be handled returning auc = 0.5 (this error occurs when the data have only 1 class).

In the wtfml, the metrics.roc_auc_score(...) method is used only after Engine.evaluate. In this case, the data always have two classes (because the KStratified gives that).

I am wondering if it is possible, in tez, to apply the metrics.roc_auc_score(...) only after the epoch, and not in each train_bs. With that, the data always will have two classes, avoiding the ValueError exceptions.

PS.

In the class definition init I am using:
self.step_scheduler_after = "epoch"
self.step_scheduler_metric = "valid_auc"
In the monitor_metrics method:
try: auc = metrics.roc_auc_score(targets, outputs.ravel())
except ValueError: auc = 0.5
return {"auc": auc}
My model.fit is defined as:
model.fit(train_dataset, valid_dataset=valid_dataset, train_bs=32, valid_bs=16,
device="cuda", epochs=50, callbacks=[es], fp16=False, n_jobs=2)

Is it possible to set variable Lr per epoch

@abhishekkrthakur Was finding this framework great and easy to use . But as fairly new to it was thinking if there is a way to pass variable Lr for training say for every epoch as an example.

Also is there a way to say continue training from a particular epoch if say the local system crashed or got disturbed during the training process.

Suggestion: plotting train/valid loss and metrix

logging in text file

Hi Abhishek, thanks for making such a wonderful library. Is there a way to log the validation score in a text file that will be saved as output in last?
context -> https://www.kaggle.com/abhishek/tez-pawpular-training

Documentation improvement - How is tez faster?

Great to see a nice Pytorch training library.

I think it would help users use it maybe to show what kind of performance improvements come from the box with Tez. For example comparing how fp16 is enabled in tez vs vanilla pytorch could ben informative or just a quick list of optimisations that are easy to do with Tez such as fp16.

Add citation link.

Please add citation link to the repository in APA format.

how can i use multi gpus for training?

Metrics update

I think there is a small bug here:
losses.update(loss.item(), data_loader.batch_size)
because last batch can be smaller, than 'batch size' defined in DataLoader. So metrics are going to be computed with some error

Getting error while importing enums from tez.

Traceback (most recent call last):
File "/content/tez/tez/model/model.py", line 12, in
from tez import enums
File "/content/tez/tez/model/tez.py", line 11, in
from tez import enums
ImportError: cannot import name 'enums' from 'tez' (/content/tez/tez/model/tez.py)

Waiting for positive reply.

I want this recommender training model please.

Atrribute Error while importing tez

when i try to import tez using - import tez. I am encountering this error
AttributeError: module 'torch.optim.lr_scheduler' has no attribute 'SAVE_STATE_WARNING'

README.md should reference the video uploaded on the YT channel

Currently, the README.md has no link to the YT video uploaded on Abhishek's YT channel. Adding a link might help beginners get an idea of how to use the library since the video has explanations for the library itself.

PR not created because of,
'
NOTE: Currently, we are not accepting any pull requests! All PRs will be closed. If you want a feature or something doesn't work, please create an issue.
'

Small error in model.py

Hi! Love this library.
In tez/model/model.py there is probably a mistake in line 90:

self.train_loader = torch.utils.data.DataLoader(
                train_dataset,
                batch_size=train_bs,
                num_workers=n_jobs,
                sampler=valid_sampler,
                shuffle=True,
            )

I guess train_sampler is meant to be used here, not valid_sampler.

Error in Multiclass TypeError: dropout(): argument 'input' (position 1) must be Tensor, not str

/usr/local/lib/python3.7/dist-packages/torch/cuda/amp/grad_scaler.py:116: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.
warnings.warn("torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.")
0%| | 0/2939 [00:00<?, ?it/s]/usr/local/lib/python3.7/dist-packages/torch/cuda/amp/autocast_mode.py:118: UserWarning: torch.cuda.amp.autocast only affects CUDA ops, but CUDA is not available. Disabling.
warnings.warn("torch.cuda.amp.autocast only affects CUDA ops, but CUDA is not available. Disabling.")

TypeError Traceback (most recent call last)
in ()
143 epochs=3,
144 callbacks=[tb_logger, es],
--> 145 fp16=True,
146 )
147 model.save("model.bin")

8 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in dropout(input, p, training, inplace)
1074 if p < 0.0 or p > 1.0:
1075 raise ValueError("dropout probability has to be between 0 and 1, " "but got {}".format(p))
-> 1076 return VF.dropout(input, p, training) if inplace else _VF.dropout(input, p, training)
1077
1078

TypeError: dropout(): argument 'input' (position 1) must be Tensor, not str

Issue while using Auc metric on imbalanced dataset like melanoma(ValueError: Only one class present in y_true. ROC AUC score is not defined in that case)

this problem occur due to running metric calculation

I got the solution from stackoverflow:

You cannot have an ROC curve without both positive and negative examples in your dataset. With only one class in the dataset, you cannot measure your false-positive rate, and therefore cannot plot an ROC curve. This is why you get this error message.

How to handle this problem?