dusty-nv / pytorch-classification Goto Github PK
View Code? Open in Web Editor NEWTraining of image classification models with PyTorch
Training of image classification models with PyTorch
I am trying to train a single-node multi-GPU, but I get the error. It's like this even though I installed nccl, but it's installed on /usr, so I don't think there's a PATH designation, do you know why it's like this?
train.py -a resnet50 --dist-url 'tcp://127.0.0.1:9999' --dist-backend 'nccl' --multiprocessing-distributed --world-size 1 --rank 0 --model-dir=models/cat_dog data/cat_dog
Hey, the Training script works perfectly fine on my server. But when I was trying to test it in my server machine(x86-64) it's not. It was a custom script that I wrote for testing.
`
import numpy as np
import torch
import torchvision
from torchvision import datasets, models, transforms
import torch.utils.data as data
import multiprocessing
from sklearn.metrics import confusion_matrix
import torch.nn as nn
import torch.optim as optim
import torch
import torch.nn as nn
import torch.nn.parallel
import torch.backends.cudnn as cudnn
import torch.distributed as dist
import torch.optim
import torch.multiprocessing as mp
import torch.utils.data
import torch.utils.data.distributed
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models
EVAL_DIR = "/home/ajithbalakrishnan/vijnalabs/My_Learning/my_workspace/pytorch-image-classification_1/dataset/gps_lock/"
EVAL_MODEL='/home/ajithbalakrishnan/vijnalabs/My_Learning/my_workspace/pytorch-image-classification_1/checkpoint.pth.tar'
model = torch.load(EVAL_MODEL)
model.eval()
num_cpu = multiprocessing.cpu_count()
bs = 8
eval_transform=transforms.Compose([
transforms.Resize(size=256),
transforms.CenterCrop(size=224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225])])
eval_dataset=datasets.ImageFolder(root=EVAL_DIR, transform=eval_transform)
eval_loader=data.DataLoader(eval_dataset, batch_size=bs, shuffle=True,
num_workers=num_cpu, pin_memory=True)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
num_classes=len(eval_dataset.classes)
dsize=len(eval_dataset)
class_names=["baseballdiamond","forest","golfcourse","harbor","overpass","river","storagetanks"]
predlist=torch.zeros(0,dtype=torch.long, device='cpu')
lbllist=torch.zeros(0,dtype=torch.long, device='cpu')
correct = 0
total = 0
with torch.no_grad():
for images, labels in eval_loader:
images, labels = images.to(device), labels.to(device)
outputs = model(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
predlist=torch.cat([predlist,predicted.view(-1).cpu()])
lbllist=torch.cat([lbllist,labels.view(-1).cpu()])
overall_accuracy=100 * correct / total
print('Accuracy of the network on the {:d} test images: {:.2f}%'.format(dsize,
overall_accuracy))
conf_mat=confusion_matrix(lbllist.numpy(), predlist.numpy())
print('Confusion Matrix')
print('-'*16)
print(conf_mat,'\n')
class_accuracy=100*conf_mat.diagonal()/conf_mat.sum(1)
print('Per class accuracy')
print('-'*18)
for label,accuracy in zip(eval_dataset.classes, class_accuracy):
class_name=class_names[int(label)]
print('Accuracy of class %8s : %0.2f %%'%(class_name, accuracy))
`
please help me here.
Error:
`
Traceback (most recent call last):
File "eval.py", line 30, in
model.eval()
AttributeError: 'dict' object has no attribute 'eval'
`
Hi Dusty, I'm using a Jetson Nano 2GB, and using the classification pipeline. I was struggling with the accuracy and confidence level for quite some time. I'm trying to classify 3 classes, and most of the time, it just got it wrong, or sometimes right, with low confidence. I knew that it was NOT related to training, coz after training, it shows Acc@1 97.xx.
I once suspected that it was the model conversion's issue, but there's almost nothing I could do about it.
At the end I reckoned that the data-preprocessing for the inferencing data and the training data might be different, so I tried to resize and crop the center of the image before feeding it to the network, then things improves DRASTICALLY!!!
This is what I changed to imagenet.py
...
# process frames until the user exits
while True:
# capture the next image
img_input = input.Capture()
img_intermediate = jetson.utils.cudaAllocMapped(width=img_input.width/img_input.height*224,
height=224,
format=img_input.format)
# rescale the image (the dimensions are taken from the image capsules)
jetson.utils.cudaResize(img_input, img_intermediate)
crop_roi = ((img_intermediate.width - 224)/2, 0, 224 + (img_intermediate.width - 224)/2, 224)
img = jetson.utils.cudaAllocMapped(width=224,
height=224,
format=img_intermediate.format)
jetson.utils.cudaCrop(img_intermediate, img, crop_roi)
# classify the image
class_id, confidence = net.Classify(img)
...
Not sure why your S3E3 video was working so well, but mine needed a little tweak, I'd like to know as well. Was it because of different versions of the code, or different machines (Jetson Nano vs something else)?
Hello,
I am trying to retrain classification model (mobilenet_v1 or mobilenet_v2), and I am following instructions at :
https://github.com/dusty-nv/jetson-inference/blob/master/docs/pytorch-cat-dog.md
In Help of train.py it shows architecture as below
--model-dir MODEL_DIR
path to desired output directory for saving model
checkpoints (default: current directory)
-a ARCH, --arch ARCH model architecture: alexnet | densenet121 |
densenet161 | densenet169 | densenet201 | googlenet |
inception_v3 | mnasnet0_5 | mnasnet0_75 | mnasnet1_0 |
mnasnet1_3 | mobilenet_v2 | resnet101 | resnet152 |
resnet18 | resnet34 | resnet50 | resnext101_32x8d |
resnext50_32x4d | shufflenet_v2_x0_5 |
shufflenet_v2_x1_0 | shufflenet_v2_x1_5 |
shufflenet_v2_x2_0 | squeezenet1_0 | squeezenet1_1 |
vgg11 | vgg11_bn | vgg13 | vgg13_bn | vgg16 | vgg16_bn
| vgg19 | vgg19_bn | wide_resnet101_2 |
wide_resnet50_2 (default: resnet18)
We can see tha mobilnet_v2 is shown as supported. But when I execute below command:
python3 train.py -a mobilenet_v2 --model-dir=models/cat_dog data/cat_dog
I get Below error:
Use GPU: 0 for training
=> dataset classes: 2 ['cat', 'dog']
=> using pre-trained model 'mobilenet_v2'
Traceback (most recent call last):
File "train.py", line 506, in <module>
main()
File "train.py", line 135, in main
main_worker(args.gpu, ngpus_per_node, args)
File "train.py", line 205, in main_worker
model = reshape_model(model, args.arch, num_classes)
File "/jetson-inference/python/training/classification/reshape.py", line 55, in reshape_model
print("classifier reshaping not supported for " + args.arch)
NameError: name 'args' is not defined
Can anybody please guide How can I retrain mobilenet classification network for Nvidia Jetson devices?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.