dusty-nv / pytorch-classification Goto Github PK

View Code? Open in Web Editor NEW

28.0 3.0 26.0 28 KB

Training of image classification models with PyTorch

Python 100.00%

pytorch-classification's People

Contributors

Stargazers

Watchers

pytorch-classification's Issues

ValueError: Invalid backend: ''nccl''

I am trying to train a single-node multi-GPU, but I get the error. It's like this even though I installed nccl, but it's installed on /usr, so I don't think there's a PATH designation, do you know why it's like this?

train.py -a resnet50 --dist-url 'tcp://127.0.0.1:9999' --dist-backend 'nccl' --multiprocessing-distributed --world-size 1 --rank 0 --model-dir=models/cat_dog data/cat_dog

Evaluation/Testing script

Hey, the Training script works perfectly fine on my server. But when I was trying to test it in my server machine(x86-64) it's not. It was a custom script that I wrote for testing.
`
import numpy as np

  import torch
  import torchvision
  from torchvision import datasets, models, transforms
  import torch.utils.data as data
  import multiprocessing
  from sklearn.metrics import confusion_matrix
  import torch.nn as nn

  import torch.optim as optim
  import torch

  import torch.nn as nn
  import torch.nn.parallel

  import torch.backends.cudnn as cudnn
  import torch.distributed as dist
  import torch.optim
  import torch.multiprocessing as mp
  import torch.utils.data
  import torch.utils.data.distributed
  import torchvision.transforms as transforms
  import torchvision.datasets as datasets
  import torchvision.models as models

  EVAL_DIR = "/home/ajithbalakrishnan/vijnalabs/My_Learning/my_workspace/pytorch-image-classification_1/dataset/gps_lock/"
  EVAL_MODEL='/home/ajithbalakrishnan/vijnalabs/My_Learning/my_workspace/pytorch-image-classification_1/checkpoint.pth.tar'
    
  model = torch.load(EVAL_MODEL)
  model.eval()
  
  num_cpu = multiprocessing.cpu_count()
  bs = 8
  
  eval_transform=transforms.Compose([
          transforms.Resize(size=256),
          transforms.CenterCrop(size=224),
          transforms.ToTensor(),
          transforms.Normalize([0.485, 0.456, 0.406],
                               [0.229, 0.224, 0.225])])
  
  eval_dataset=datasets.ImageFolder(root=EVAL_DIR, transform=eval_transform)
  eval_loader=data.DataLoader(eval_dataset, batch_size=bs, shuffle=True,
                              num_workers=num_cpu, pin_memory=True)
  
  device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
  
  num_classes=len(eval_dataset.classes)
  dsize=len(eval_dataset)
  
  class_names=["baseballdiamond","forest","golfcourse","harbor","overpass","river","storagetanks"]
  
  predlist=torch.zeros(0,dtype=torch.long, device='cpu')
  lbllist=torch.zeros(0,dtype=torch.long, device='cpu')
  
  correct = 0
  total = 0
  with torch.no_grad():
      for images, labels in eval_loader:
          images, labels = images.to(device), labels.to(device)
          outputs = model(images)
          _, predicted = torch.max(outputs.data, 1)
  
          total += labels.size(0)
          correct += (predicted == labels).sum().item()
  
          predlist=torch.cat([predlist,predicted.view(-1).cpu()])
          lbllist=torch.cat([lbllist,labels.view(-1).cpu()])
  
  overall_accuracy=100 * correct / total
  print('Accuracy of the network on the {:d} test images: {:.2f}%'.format(dsize, 
      overall_accuracy))
  
  conf_mat=confusion_matrix(lbllist.numpy(), predlist.numpy())
  print('Confusion Matrix')
  print('-'*16)
  print(conf_mat,'\n')
  
  class_accuracy=100*conf_mat.diagonal()/conf_mat.sum(1)
  print('Per class accuracy')
  print('-'*18)
  for label,accuracy in zip(eval_dataset.classes, class_accuracy):
       class_name=class_names[int(label)]
       print('Accuracy of class %8s : %0.2f %%'%(class_name, accuracy))

`
please help me here.

Error:
`
Traceback (most recent call last):
File "eval.py", line 30, in
model.eval()
AttributeError: 'dict' object has no attribute 'eval'

Classification: Data-preprocessing for Much Higher Accuracy and Confidence Level

Hi Dusty, I'm using a Jetson Nano 2GB, and using the classification pipeline. I was struggling with the accuracy and confidence level for quite some time. I'm trying to classify 3 classes, and most of the time, it just got it wrong, or sometimes right, with low confidence. I knew that it was NOT related to training, coz after training, it shows Acc@1 97.xx.

I once suspected that it was the model conversion's issue, but there's almost nothing I could do about it.

At the end I reckoned that the data-preprocessing for the inferencing data and the training data might be different, so I tried to resize and crop the center of the image before feeding it to the network, then things improves DRASTICALLY!!!

This is what I changed to imagenet.py

...

# process frames until the user exits
while True:
  # capture the next image
  img_input = input.Capture()

  img_intermediate = jetson.utils.cudaAllocMapped(width=img_input.width/img_input.height*224, 
                                         height=224, 
                                         format=img_input.format)
  
  # rescale the image (the dimensions are taken from the image capsules)
  jetson.utils.cudaResize(img_input, img_intermediate)

  crop_roi = ((img_intermediate.width - 224)/2, 0, 224 + (img_intermediate.width - 224)/2, 224)
  img = jetson.utils.cudaAllocMapped(width=224,
                                         height=224,
                                         format=img_intermediate.format)
  
  jetson.utils.cudaCrop(img_intermediate, img, crop_roi)

  # classify the image
  class_id, confidence = net.Classify(img)

...

Not sure why your S3E3 video was working so well, but mine needed a little tweak, I'd like to know as well. Was it because of different versions of the code, or different machines (Jetson Nano vs something else)?

Not able to retrain on mobilenet_v2 network

Hello,

I am trying to retrain classification model (mobilenet_v1 or mobilenet_v2), and I am following instructions at :

https://github.com/dusty-nv/jetson-inference/blob/master/docs/pytorch-cat-dog.md

In Help of train.py it shows architecture as below

  --model-dir MODEL_DIR
                        path to desired output directory for saving model
                        checkpoints (default: current directory)
  -a ARCH, --arch ARCH  model architecture: alexnet | densenet121 |
                        densenet161 | densenet169 | densenet201 | googlenet |
                        inception_v3 | mnasnet0_5 | mnasnet0_75 | mnasnet1_0 |
                        mnasnet1_3 | mobilenet_v2 | resnet101 | resnet152 |
                        resnet18 | resnet34 | resnet50 | resnext101_32x8d |
                        resnext50_32x4d | shufflenet_v2_x0_5 |
                        shufflenet_v2_x1_0 | shufflenet_v2_x1_5 |
                        shufflenet_v2_x2_0 | squeezenet1_0 | squeezenet1_1 |
                        vgg11 | vgg11_bn | vgg13 | vgg13_bn | vgg16 | vgg16_bn
                        | vgg19 | vgg19_bn | wide_resnet101_2 |
                        wide_resnet50_2 (default: resnet18)

We can see tha mobilnet_v2 is shown as supported. But when I execute below command:

python3 train.py -a mobilenet_v2 --model-dir=models/cat_dog data/cat_dog

I get Below error:

Use GPU: 0 for training
=> dataset classes:  2 ['cat', 'dog']
=> using pre-trained model 'mobilenet_v2'
Traceback (most recent call last):
  File "train.py", line 506, in <module>
    main()
  File "train.py", line 135, in main
    main_worker(args.gpu, ngpus_per_node, args)
  File "train.py", line 205, in main_worker
    model = reshape_model(model, args.arch, num_classes)
  File "/jetson-inference/python/training/classification/reshape.py", line 55, in reshape_model
    print("classifier reshaping not supported for " + args.arch)
NameError: name 'args' is not defined

Can anybody please guide How can I retrain mobilenet classification network for Nvidia Jetson devices?

dusty-nv / pytorch-classification Goto Github PK

pytorch-classification's People

Contributors

Stargazers

Watchers

Forkers

pytorch-classification's Issues

ValueError: Invalid backend: ''nccl''

Evaluation/Testing script

Classification: Data-preprocessing for Much Higher Accuracy and Confidence Level

Not able to retrain on mobilenet_v2 network

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent