Git Product home page Git Product logo

tez's Introduction

Tez: a simple pytorch trainer

NOTE: Currently, we are not accepting any pull requests! All PRs will be closed. If you want a feature or something doesn't work, please create an issue.

tez (तेज़ / تیز) means sharp, fast & active. This is a simple, to-the-point, library to make your pytorch training easy.

This library is in early-stage currently! So, there might be breaking changes.

Idea around tez is simple:

  • keep things as simple as possible
  • make it as customizable as possible
  • clean code
  • faster prototyping
  • production ready

Currently, tez supports cpu, single gpu and multi-gpu & tpu training. More coming soon!

Using tez is super-easy. We don't want you to be far away from pytorch. So, you do everything on your own and just use tez to make a few things simpler.

tez's People

Contributors

abhishekkrthakur avatar philschmid avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tez's Issues

Suggestion: Object detection example

A few object detection competitions are going on in Kaggle so I believe it will be a good time to create a video and/or code example using Tez for the same. It will help many and also will be a way to popularize the library.

.

tried to follow you got this error "init() got an unexpected keyword argument 'resize'".

Saving oofs file when fitting

When we train a model by calling model.fit it would be nice that we can save the predictions of the validation set on the run as oof csv files.
Please see to it.

NotADirectortyerror

NotADirectoryError Traceback (most recent call last)
in
----> 1 train_dataset[0]

/opt/conda/lib/python3.7/site-packages/tez/datasets/image_classification.py in getitem(self, item)
39 targets = self.targets[item]
40 if self.backend == "pil":
---> 41 image = Image.open(self.image_paths[item])
42 if self.resize is not None:
43 image = image.resize(

/opt/conda/lib/python3.7/site-packages/PIL/Image.py in open(fp, mode, formats)
2889
2890 if filename:
-> 2891 fp = builtins.open(filename, "rb")
2892 exclusive_fp = True
2893

NotADirectoryError: [Errno 20] Not a directory: '../input/cassava-leaf-disease-classification/train.csv/1724663202.jpg'

Small error in image_classification.py

If augmentation is None then we face error as , variable augmented referenced before assignment
UnboundLocalError: local variable 'augmented' referenced before assignment

elif self.backend == "cv2":
            image = cv2.imread(self.image_paths[item])
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
            if self.resize is not None:
                image = cv2.resize(
                    image,
                    (self.resize[1], self.resize[0]),
                    interpolation=cv2.INTER_CUBIC,
                )
            if self.augmentations is not None:
                augmented = self.augmentations(image=image)
                image = augmented["image"]

If the indendation is fixed we can solve this error.

ValueError: operands could not be broadcast together with shapes (256,256,4) (3,) (256,256,4)

I am trying to use this package, and it is throwing as below. I am using the same pipeline from cassava lead detection problem but on different set where image size is (256, 256)

Could you please help here.

Downloading: "https://github.com/lukemelas/EfficientNet-PyTorch/releases/download/1.0/efficientnet-b4-6ed6700e.pth" to /root/.cache/torch/hub/checkpoints/efficientnet-b4-6ed6700e.pth
100%
74.4M/74.4M [00:00<00:00, 107MB/s]

Loaded pretrained weights for efficientnet-b4
0%| | 0/51 [00:00<?, ?it/s]

ValueError Traceback (most recent call last)
in ()
11 epochs=10,
12 callbacks=[es],
---> 13 fp16=True,
14 )
15 model.save("model.bin")

6 frames
/usr/local/lib/python3.6/dist-packages/tez/model/model.py in fit(self, train_dataset, valid_dataset, train_sampler, valid_sampler, device, epochs, train_bs, valid_bs, n_jobs, callbacks, fp16)
295 self.train_state = enums.TrainingState.EPOCH_START
296 self.train_state = enums.TrainingState.TRAIN_EPOCH_START
--> 297 train_loss = self.train_one_epoch(self.train_loader, device)
298 self.train_state = enums.TrainingState.TRAIN_EPOCH_END
299 if self.valid_loader:

/usr/local/lib/python3.6/dist-packages/tez/model/model.py in train_one_epoch(self, data_loader, device)
176 losses = AverageMeter()
177 tk0 = tqdm(data_loader, total=len(data_loader))
--> 178 for b_idx, data in enumerate(tk0):
179 self.train_state = enums.TrainingState.TRAIN_STEP_START
180 loss, metrics = self.train_one_step(data, device)

/usr/local/lib/python3.6/dist-packages/tqdm/std.py in iter(self)
1102 fp_write=getattr(self.fp, 'write', sys.stderr.write))
1103
-> 1104 for obj in iterable:
1105 yield obj
1106 # Update and possibly print the progressbar.

/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py in next(self)
433 if self._sampler_iter is None:
434 self._reset()
--> 435 data = self._next_data()
436 self._num_yielded += 1
437 if self._dataset_kind == _DatasetKind.Iterable and \

/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
1083 else:
1084 del self._task_info[idx]
-> 1085 return self._process_data(data)
1086
1087 def _try_put_index(self):

/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py in _process_data(self, data)
1109 self._try_put_index()
1110 if isinstance(data, ExceptionWrapper):
-> 1111 data.reraise()
1112 return data
1113

/usr/local/lib/python3.6/dist-packages/torch/_utils.py in reraise(self)
426 # have message field
427 raise self.exc_type(message=msg)
--> 428 raise self.exc_type(msg)
429
430

ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.6/dist-packages/tez/datasets/image_classification.py", line 48, in getitem
augmented = self.augmentations(image=image)
File "/usr/local/lib/python3.6/dist-packages/albumentations/core/composition.py", line 171, in call
data = t(**data)
File "/usr/local/lib/python3.6/dist-packages/albumentations/core/transforms_interface.py", line 38, in call
res[key] = target_function(arg, **dict(params, **target_dependencies))
File "/usr/local/lib/python3.6/dist-packages/albumentations/augmentations/transforms.py", line 808, in apply
return F.normalize(image, self.mean, self.std, self.max_pixel_value)
File "/usr/local/lib/python3.6/dist-packages/albumentations/augmentations/functional.py", line 93, in normalize
img -= mean
ValueError: operands could not be broadcast together with shapes (256,256,4) (3,) (256,256,4)

Resuming the training through checkpoint with tez

Hi,

I am wondering if it is possible to pick up a saved model and resume/continue the training with tez. I am new to pytorch. Here is what I tried:

class Bert(tez.Model):

    def __init__(self, num_classes, num_train_steps=None):

        super().__init__()
        self.bert = transformers.BertModel.from_pretrained(
           'bert-base-uncased, 
            return_dict=False
            )

        if config.RETRAINING: # set to True
            self.bert.load(
            'demo.bin', 
            device='cuda')    

        self.bert_drop = nn.Dropout(0.3)
        self.out = nn.Linear(self.bert.config.hidden_size, num_classes)

and it doesn't work. I am not sure what I am missing. I found this for pytorch:

https://discuss.pytorch.org/t/loading-a-saved-model-for-continue-training/17244

but I am not sure how to use this together with tez.

Can it work without CUDA

I am getting error when I executed the code with CPU configuration.

Traceback (most recent call last):
File "c:\Users\Hemanth\Desktop\Data Analytics analyticvidya\recommender system\recommender.py", line 88, in
train()
File "c:\Users\Hemanth\Desktop\Data Analytics analyticvidya\recommender system\recommender.py", line 82, in train
model.fit(
File "C:\Users\Hemanth\Desktop\Data Analytics analyticvidya\recommender system\venv\lib\site-packages\tez\model\model.py", line 309, in fit self._init_model(
File "C:\Users\Hemanth\Desktop\Data Analytics analyticvidya\recommender system\venv\lib\site-packages\tez\model\model.py", line 93, in _init_model
self.to(self.device)
File "C:\Users\Hemanth\Desktop\Data Analytics analyticvidya\recommender system\venv\lib\site-packages\torch\nn\modules\module.py", line 852, in to
return self._apply(convert)
File "C:\Users\Hemanth\Desktop\Data Analytics analyticvidya\recommender system\venv\lib\site-packages\torch\nn\modules\module.py", line 530, in _apply
module._apply(fn)
File "C:\Users\Hemanth\Desktop\Data Analytics analyticvidya\recommender system\venv\lib\site-packages\torch\nn\modules\module.py", line 552, in apply
param_applied = fn(param)
File "C:\Users\Hemanth\Desktop\Data Analytics analyticvidya\recommender system\venv\lib\site-packages\torch\nn\modules\module.py", line 850, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
File "C:\Users\Hemanth\Desktop\Data Analytics analyticvidya\recommender system\venv\lib\site-packages\torch\cuda_init
.py", line 166, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

Text classification examples - Tokenizer is defined twice

The tokenizer is defined both in the model and the dataset in the BERT text classification examples.

multi_class.py, line 50:
self.tokenizer = transformers.BertTokenizer.from_pretrained( "bert-base-uncased", do_lower_case=True )

Saving validation score

Is it possible to save somehow a list of the validation scores (on epochs or batches) after training? I have some problems with output on my server, it deletes usually, but I really need validation scores to compare models, it would be really convenient, if I could get them in one file, for example.

run example code error

when I run example code:

accelerate launch   imdb_sentiment_classification.py

after run some epoch get error info

INFO:tez.callbacks.early_stopping:EarlyStopping counter: 4/5
[train] accuracy=0.9915, loss=0.0269 [valid] accuracy=0.8953, loss=0.4287 [e=5 steps=2112]                                                                                                 
 30%|████████████████████████████████▍                                                                           | 2112/7040 [05:45<06:40, 12.32it/s, accuracy=0.991, epoch=5, loss=0.0269]2022-09-17 07:55:02,832 INFO EarlyStopping counter: 5/5
INFO:tez.callbacks.early_stopping:EarlyStopping counter: 5/5
 30%|████████████████████████████████▍                                                                           | 2112/7040 [05:47<13:31,  6.07it/s, accuracy=0.991, epoch=5, loss=0.0269]




[E ProcessGroupNCCL.cpp:719] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34532, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1808970 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:719] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34532, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1808984 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:719] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34532, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809275 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34532, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809275 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34532, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1808970 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34532, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1808984 milliseconds before timing out.
Traceback (most recent call last):
  File "/root/miniconda3/envs/lightning/lib/python3.9/multiprocessing/resource_sharer.py", line 138, in _serve
    with self._listener.accept() as conn:
  File "/root/miniconda3/envs/lightning/lib/python3.9/multiprocessing/connection.py", line 470, in accept
    deliver_challenge(c, self._authkey)
  File "/root/miniconda3/envs/lightning/lib/python3.9/multiprocessing/connection.py", line 745, in deliver_challenge
    response = connection.recv_bytes(256)        # reject large message
  File "/root/miniconda3/envs/lightning/lib/python3.9/multiprocessing/connection.py", line 221, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/root/miniconda3/envs/lightning/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
    buf = self._recv(4)
  File "/root/miniconda3/envs/lightning/lib/python3.9/multiprocessing/connection.py", line 384, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 113654 closing signal SIGTERM
Traceback (most recent call last):
  File "/root/miniconda3/envs/lightning/lib/python3.9/multiprocessing/resource_sharer.py", line 138, in _serve
    with self._listener.accept() as conn:
  File "/root/miniconda3/envs/lightning/lib/python3.9/multiprocessing/connection.py", line 470, in accept
    deliver_challenge(c, self._authkey)
  File "/root/miniconda3/envs/lightning/lib/python3.9/multiprocessing/connection.py", line 745, in deliver_challenge
    response = connection.recv_bytes(256)        # reject large message
  File "/root/miniconda3/envs/lightning/lib/python3.9/multiprocessing/connection.py", line 221, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/root/miniconda3/envs/lightning/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
    buf = self._recv(4)
  File "/root/miniconda3/envs/lightning/lib/python3.9/multiprocessing/connection.py", line 384, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 2 (pid: 113655) of binary: /root/miniconda3/envs/lightning/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/lightning/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
  File "/root/miniconda3/envs/lightning/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/envs/lightning/lib/python3.9/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/root/miniconda3/envs/lightning/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/root/miniconda3/envs/lightning/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/lightning/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
imdb_sentiment_classification.py FAILED
-------------------------------------------------------
Failures:
[1]:
  time      : 2022-09-17_08:25:22
  host      : dy-a100-779-tlzrv
  rank      : 3 (local_rank: 3)
  exitcode  : -6 (pid: 113656)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 113656
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-09-17_08:25:22
  host      : dy-a100-779-tlzrv
  rank      : 2 (local_rank: 2)
  exitcode  : -6 (pid: 113655)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 113655
=======================================================
Traceback (most recent call last):
  File "/root/miniconda3/envs/lightning/bin/accelerate", line 33, in <module>
    sys.exit(load_entry_point('accelerate==0.12.0.dev0', 'console_scripts', 'accelerate')())
  File "/root/miniconda3/envs/lightning/lib/python3.9/site-packages/accelerate-0.12.0.dev0-py3.9.egg/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/root/miniconda3/envs/lightning/lib/python3.9/site-packages/accelerate-0.12.0.dev0-py3.9.egg/accelerate/commands/launch.py", line 734, in launch_command
    multi_gpu_launcher(args)
  File "/root/miniconda3/envs/lightning/lib/python3.9/site-packages/accelerate-0.12.0.dev0-py3.9.egg/accelerate/commands/launch.py", line 374, in multi_gpu_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['torchrun', '--nproc_per_node', '4', 'imdb_sentiment_classification.py']' returned non-zero exit status 1.

Explain Tez model using SHAP

import numpy as np
from transformers import AutoTokenizer, AutoModelForCausalLM
import shap
import torch

tokenizer = AutoTokenizer.from_pretrained("gpt2", use_fast=True)
model = AutoModelForCausalLM.from_pretrained("gpt2").cuda()

set model decoder to true

model.config.is_decoder=True

set text-generation params under task_specific_params

model.config.task_specific_params["text-generation"] = {
"do_sample": True,
"max_length": 50,
"temperature": 0.7,
"top_k": 50,
"no_repeat_ngram_size": 2
}
Define initial text
s = ['I enjoy walking with my cute dog']
Create an explainer object
explainer = shap.Explainer(model,tokenizer)

This is transformer model

When I am trying same with Tez model, its giving error.

class Classifier(tez.Model):
def init(self, num_train_steps, num_classes):
super().init()
self.bert = transformers.SqueezeBertModel.from_pretrained("squeezebert/squeezebert-uncased")
self.bert_drop = nn.Dropout(0.3)
self.out = nn.Linear(768, num_classes)
self.num_train_steps = num_train_steps
self.step_scheduler_after = "batch"

def fetch_optimizer(self):
    param_optimizer = list(self.named_parameters())
    no_decay = ["bias", "LayerNorm.bias"]
    optimizer_parameters = [
        {
            "params": [
                p for n, p in param_optimizer if not any(nd in n for nd in no_decay)
            ],
            "weight_decay": 0.001,
        },
        {
            "params": [
                p for n, p in param_optimizer if any(nd in n for nd in no_decay)
            ],
            "weight_decay": 0.0,
        },
    ]
    opt = AdamW(optimizer_parameters, lr=3e-5)
    return opt


def fetch_scheduler(self):
    sch = get_linear_schedule_with_warmup(
        self.optimizer, num_warmup_steps=0, num_training_steps=self.num_train_steps
    )
    return sch

def loss(self, outputs, targets):
    if targets is None:
        return None
    return nn.BCEWithLogitsLoss()(outputs, targets.float())


def monitor_metrics(self, outputs, targets):
    if targets is None:
        return {}
    
    outputs = torch.sigmoid(outputs)
    outputs = outputs.cpu().detach().numpy()
    targets = targets.cpu().detach().numpy()
    
    fpr_micro, tpr_micro, _ = metrics.roc_curve(targets.ravel(), outputs.ravel())
    auc_micro = metrics.auc(fpr_micro, tpr_micro)
    return {"auc": auc_micro}


def forward(self, ids, mask, targets=None):
    o_2 = self.bert(ids, attention_mask=mask)[1]
    b_o = self.bert_drop(o_2)
    output = self.out(b_o)
    loss = self.loss(output, targets)
    acc = self.monitor_metrics(output, targets)
    return output, loss, acc

n_train_steps = 13565
model = Classifier(n_train_steps, 28)
optimizer = model.fetch_optimizer()
checkpoint = torch.load('C:/Users/Jay/Downloads/model (1).bin', map_location=torch.device('cpu'))
model.load_state_dict(checkpoint['state_dict'])
optimizer.load_state_dict(checkpoint['optimizer'])

explainer = shap.Explainer(model, tokenizer)
shap_values = explainer('The waiter shook his head in horror and left.')

Error : TypeError: forward() missing 1 required positional argument: 'mask'

Can you please let me know how can we resolve this?

ValueError: only one element tensors can be converted to Python scalars

Hi, I'm trying to run the example code for sentiment classification but it gives me this error. It looks like the issue arises due to the way tensor size is calculated.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-2a229acd2f86> in <module>
    171         valid_dataset=valid_dataset,
    172         callbacks=[es],
--> 173         config=config,
    174     )

~/anaconda3/envs/bertopic/lib/python3.7/site-packages/tez/model/tez.py in fit(self, train_dataset, valid_dataset, config, **kwargs)
    455         for _ in range(self.config.epochs):
    456             self.train_state = enums.TrainingState.EPOCH_START
--> 457             self.train(self.train_loader, losses)
    458             if self.valid_loader and self.config.val_strategy == "epoch":
    459                 self.validate(self.valid_loader)

~/anaconda3/envs/bertopic/lib/python3.7/site-packages/tez/model/tez.py in train(self, data_loader, losses)
    399             self.train_state = enums.TrainingState.TRAIN_STEP_START
    400             loss, metrics = self.train_step(data)
--> 401             losses, monitor = self._update_loss_metrics(losses, loss, metrics, data_loader)
    402             self.train_state = enums.TrainingState.TRAIN_STEP_END
    403             if self.valid_loader and self.config.val_strategy == "batch":

~/anaconda3/envs/bertopic/lib/python3.7/site-packages/tez/model/tez.py in _update_loss_metrics(self, losses, loss, metrics, data_loader)
    366     def _update_loss_metrics(self, losses, loss, metrics, data_loader):
    367         if self.model_state == enums.ModelState.TRAIN:
--> 368             losses.update(loss.item() * self.config.gradient_accumulation_steps, data_loader.batch_size)
    369         else:
    370             losses.update(loss.item(), data_loader.batch_size)

ValueError: only one element tensors can be converted to Python scalars

zero_grad for accumulation_steps = 1 not working as expected

As far as I know, in normal execution flow for zero_grad and forward pass, first we zero_gard for each batch and then do the forward pass but I investigated that in code, it is not happening in this way when accumualtion_steps =1 and batch =1, first forward pass executes first without doing zero_grad.

I tried to reproduce it and it is doing the same which I explained above.

image

Also, I think we can fix this by removing condition in the tez.py file on line # 330, 331.

Saving after training an epoch

How to save the model after each epoch training? I use fit method for 5 epochs and do not really understand hot to save after each one. not only after the last one.

Not able to converge in local laptop.

Hello Again,

I created a pipeline for classification problem in Colab Notebook with pytorch version 1.7.0+cu101 and the model is coverging very well there. But if I use the same pipeline to execute in my local laptop, the accuracy never goes above 10% with pytorch version 1.7.1

I have no clue, what am I missing here. Do you have any idea?

Load models

Thanks for this great project.

I started using Tez con Kaggle competition and found that model loading is not appropriately handled when training/infering on different devices: GPU -> CPU. This one possible solution

    def load(self, model_path, device="cuda"):
        self.device = device
        if next(self.parameters()).device != self.device:
            self.to(self.device)
        model_dict = torch.load(model_path, map_location=torch.device(device))
        self.load_state_dict(model_dict["state_dict"])

Applying metrics after the epoch

Dears, I am using tez to classify melanoma images (kaggle SIIM binary classification). With wtfml is possible to get AUC ~ 0.85. With tez, I am only getting AUC ~ 0.6. I saw that this happens, in tez, when using metrics.roc_auc_score(...) inside monitor_metrics method. This gives some ValueError exceptions, that must be handled returning auc = 0.5 (this error occurs when the data have only 1 class).

In the wtfml, the metrics.roc_auc_score(...) method is used only after Engine.evaluate. In this case, the data always have two classes (because the KStratified gives that).

I am wondering if it is possible, in tez, to apply the metrics.roc_auc_score(...) only after the epoch, and not in each train_bs. With that, the data always will have two classes, avoiding the ValueError exceptions.

PS.

  1. In the class definition init I am using:
    self.step_scheduler_after = "epoch"
    self.step_scheduler_metric = "valid_auc"
  2. In the monitor_metrics method:
    try: auc = metrics.roc_auc_score(targets, outputs.ravel())
    except ValueError: auc = 0.5
    return {"auc": auc}
  3. My model.fit is defined as:
    model.fit(train_dataset, valid_dataset=valid_dataset, train_bs=32, valid_bs=16,
    device="cuda", epochs=50, callbacks=[es], fp16=False, n_jobs=2)

Is it possible to set variable Lr per epoch

@abhishekkrthakur Was finding this framework great and easy to use . But as fairly new to it was thinking if there is a way to pass variable Lr for training say for every epoch as an example.

Also is there a way to say continue training from a particular epoch if say the local system crashed or got disturbed during the training process.

Documentation improvement - How is tez faster?

Great to see a nice Pytorch training library.

I think it would help users use it maybe to show what kind of performance improvements come from the box with Tez. For example comparing how fp16 is enabled in tez vs vanilla pytorch could ben informative or just a quick list of optimisations that are easy to do with Tez such as fp16.

Metrics update

I think there is a small bug here:
losses.update(loss.item(), data_loader.batch_size)
because last batch can be smaller, than 'batch size' defined in DataLoader. So metrics are going to be computed with some error

Getting error while importing enums from tez.

Traceback (most recent call last):
File "/content/tez/tez/model/model.py", line 12, in
from tez import enums
File "/content/tez/tez/model/tez.py", line 11, in
from tez import enums
ImportError: cannot import name 'enums' from 'tez' (/content/tez/tez/model/tez.py)

Waiting for positive reply.

Atrribute Error while importing tez

when i try to import tez using - import tez. I am encountering this error
AttributeError: module 'torch.optim.lr_scheduler' has no attribute 'SAVE_STATE_WARNING'

README.md should reference the video uploaded on the YT channel

Currently, the README.md has no link to the YT video uploaded on Abhishek's YT channel. Adding a link might help beginners get an idea of how to use the library since the video has explanations for the library itself.

PR not created because of,
'
NOTE: Currently, we are not accepting any pull requests! All PRs will be closed. If you want a feature or something doesn't work, please create an issue.
'

Small error in model.py

Hi! Love this library.
In tez/model/model.py there is probably a mistake in line 90:

self.train_loader = torch.utils.data.DataLoader(
                train_dataset,
                batch_size=train_bs,
                num_workers=n_jobs,
                sampler=valid_sampler,
                shuffle=True,
            )

I guess train_sampler is meant to be used here, not valid_sampler.

Error in Multiclass TypeError: dropout(): argument 'input' (position 1) must be Tensor, not str

/usr/local/lib/python3.7/dist-packages/torch/cuda/amp/grad_scaler.py:116: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.
warnings.warn("torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.")
0%| | 0/2939 [00:00<?, ?it/s]/usr/local/lib/python3.7/dist-packages/torch/cuda/amp/autocast_mode.py:118: UserWarning: torch.cuda.amp.autocast only affects CUDA ops, but CUDA is not available. Disabling.
warnings.warn("torch.cuda.amp.autocast only affects CUDA ops, but CUDA is not available. Disabling.")

TypeError Traceback (most recent call last)
in ()
143 epochs=3,
144 callbacks=[tb_logger, es],
--> 145 fp16=True,
146 )
147 model.save("model.bin")

8 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in dropout(input, p, training, inplace)
1074 if p < 0.0 or p > 1.0:
1075 raise ValueError("dropout probability has to be between 0 and 1, " "but got {}".format(p))
-> 1076 return VF.dropout(input, p, training) if inplace else _VF.dropout(input, p, training)
1077
1078

TypeError: dropout(): argument 'input' (position 1) must be Tensor, not str

Issue while using Auc metric on imbalanced dataset like melanoma(ValueError: Only one class present in y_true. ROC AUC score is not defined in that case)

this problem occur due to running metric calculation

I got the solution from stackoverflow:

You cannot have an ROC curve without both positive and negative examples in your dataset. With only one class in the dataset, you cannot measure your false-positive rate, and therefore cannot plot an ROC curve. This is why you get this error message.

How to handle this problem?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.