Git Product home page Git Product logo

xview2_solution's Introduction

Xview2 2nd place solution

Data preparation

Download and extract dataset from https://xview2.org/dataset and put it to data directory (or whatever)

Generate Masks

Run python generate_polygons.py --input data/train This script will generate pixel masks from json files.

Training

Dockerfile has all the required libraries.

Most of the hyperparameters for training are defined by json files, see configs directory.

Other parameters are passed directly to train scripts.

Localization and classification networks are trained separately

  • train_localization.py - to train binary segmentation models. By default O0 opt level (FP32) is used for Apex due to unstable loss during training.
  • train.py - to train classification models. By default O1 opt level (Mixed-Precision) is used for Apex as multiclass loss FocalLossWithDice is stable in mixed precision .

Architectures

For localization network ordinary U-Net like network was used with pretrained DPN92 and Densenet161 encoders (see models/unet.py for U-Nets Zoo)

alt text

For classification Siamese-UNet was used with shared encoder weights (see models/siamese_unet.py for Siamese U-Nets Zoo)

alt text

xview2_solution's People

Contributors

selimsef avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

xview2_solution's Issues

multi-gpu failed in train_localization.py

unlike multi-gpu training using "train.py", in localization training using "train_localization.py", I am facing the following issue when trying to use multi-gpu. With single gpu, it is working, but I am running out of memory using only one Gefore 2080ti RTX gpu.

CUDA_VISIBLE_DEVICES=3,4 python train_localization.py --folds-csv folds.csv --config configs/se50_loc.json --logdir logs --predictions predictions --data-dir /datasets/xView2/train_tier3_combined --gpu 3,4 --output-dir weights
bottleneck  1280 256
bottleneck  704 192
bottleneck  384 128
bottleneck  128 64
Selected optimization level O0:  Pure FP32 training.

Defaults for this optimization level are:
enabled                : True
opt_level              : O0
cast_model_type        : torch.float32
patch_torch_functions  : False
keep_batchnorm_fp32    : None
master_weights         : False
loss_scale             : 1.0
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O0
cast_model_type        : torch.float32
patch_torch_functions  : False
keep_batchnorm_fp32    : None
master_weights         : False
loss_scale             : dynamic
Freezing encoder!!!
  0%|                                                                                                                                                                              | 0/1266 [00:09<?, ?it/s]
Traceback (most recent call last):
  File "train_localization.py", line 302, in <module>
    main()
  File "train_localization.py", line 191, in main
    args.local_rank)
  File "train_localization.py", line 268, in train_epoch
    out_mask = model(imgs)
  File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
    raise output
  File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
    output = module(*input, **kwargs)
  File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/apex/amp/_initialize.py", line 197, in new_fwd
    **applier(kwargs, input_caster))
  File "/space/export/data/azim_se/xView2_second_place/models/unet.py", line 390, in forward
    x = stage(x)
  File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 338, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)

`mask_loss` is missing on softmax.json config

Hi, @selimsef
I want to replicate your training for classification. From what I understood, x_softmax.json files are used to run train.py. However, train.py is expecting mask_loss configuration to run but it is missing in all softmax config files. Am I missing something here?

Weights

Would you mind publishing trained weights?

XViewSingleDataset KeyError: 'nondamage'

Hello selimsef
Thank you for your sharing.
I was trying to run your code in my computer, but got KeyError in XViewSingleDataset's init function at this line:
nondamage = df[(df['fold'] != fold) & (df['nondamage'] == True)]['id'].tolist()
error information:

Traceback (most recent call last):
File "train.py", line 357, in
main()
File "train.py", line 123, in main
normalize=conf["input"].get("normalize", None))
File "/opt/share1003/public/yaomf/codes/xview2-2/dataset/xview_dataset.py", line 28, in init
nondamage = df[(df['fold'] != fold) & (df['nondamage'] == True)]['id'].tolist()
File "/usr/lib/python2.7/dist-packages/pandas/core/frame.py", line 1969, in getitem
return self._getitem_column(key)
File "/usr/lib/python2.7/dist-packages/pandas/core/frame.py", line 1976, in _getitem_column
return self._get_item_cache(key)
File "/usr/lib/python2.7/dist-packages/pandas/core/generic.py", line 1091, in _get_item_cache
values = self._data.get(item)
File "/usr/lib/python2.7/dist-packages/pandas/core/internals.py", line 3211, in get
loc = self.items.get_loc(item)
File "/usr/lib/python2.7/dist-packages/pandas/core/index.py", line 1759, in get_loc
return self._engine.get_loc(key)
File "pandas/index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas/index.c:3979)
File "pandas/index.pyx", line 157, in pandas.index.IndexEngine.get_loc (pandas/index.c:3843)
File "pandas/hashtable.pyx", line 668, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12265)
File "pandas/hashtable.pyx", line 676, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12216)
KeyError: 'nondamage'

I found that df is loaded from csv file, and the csv file contained in your code has only 2 columns: id and fold, without 'nondamage'.

Do I need to generate a csv file in another format in order to run your code at xview2 dataset?
I have downloaded the dataset and generate mask using your script.

Thank you!

OSError: [Errno 24] Too many open files - Issue with tqdm in the validation phase

Hi,

During validation step, I am facing the following error which is related to tqdm. As I am training on both train and tier3 sets, it seems the a parameter in tqdm library is overflowing.

Traceback (most recent call last):
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/multiprocessing/queues.py", line 234, in _feed
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 319, in reduce_storage
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/multiprocessing/reduction.py", line 191, in DupFd
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/multiprocessing/resource_sharer.py", line 48, in __init__
OSError: [Errno 24] Too many open files
Traceback (most recent call last):                                                           
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/multiprocessing/resource_sharer.py", line 149, in _serve
Traceback (most recent call last):
  File "train_localization.py", line 301, in <module>
    send(conn, destination_pid)
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/multiprocessing/resource_sharer.py", line 50, in send
    main()
  File "train_localization.py", line 206, in main 
    predictions_dir=preds_dir) 
  File "train_localization.py", line 214, in evaluate_val
    dice = validate(model, data_loader=data_val, predictions_dir=predictions_dir)
  File "train_localization.py", line 241, in validate
    reduction.send_handle(conn, new_fd, pid)
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/multiprocessing/reduction.py", line 176, in send_handle
    for sample in tqdm(data_loader): 
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/tqdm/std.py", line 1127, in __iter__ 
    with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s:                                                                                     
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/socket.py", line 460, in fromfd 
    for obj in iterable:
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 576, in __next__ 
    nfd = dup(fd)
OSError: [Errno 24] Too many open files
    idx, batch = self._get_batch() 
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 553, in _get_batch 
    success, data = self._try_get_batch()                                                                                                        
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 511, in _try_get_batch
    data = self.data_queue.get(timeout=timeout)
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 276, in rebuild_storage_fd
    fd = df.detach()
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
    return recvfds(s, 1)[0]
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/multiprocessing/reduction.py", line 155, in recvfds
    raise EOFError

I tried the following ways to solve it:

I increased ulimit -n from 1024 to 4096.
In train.py and train_localization.py, in the function
def validate(net, data_loader, predictions_dir):
, I replaced
for sample in tqdm(data_loader):
with

with tqdm(data_loader) as samples:
            for i,  sample in enumerate(samples): 

, but none of these ways was a solution.

I was wondering whether you would have a solution for this. I am still looking around for a solution.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.