selimsef / xview2_solution Goto Github PK

View Code? Open in Web Editor NEW

57.0 2.0 16.0 1.25 MB

2nd place solution for Xview2 challenge https://xview2.org/

License: Apache License 2.0

Dockerfile 1.47% Python 98.50% Shell 0.04%

satellite-imagery deep-learning semantic-segmentation

xview2_solution's Introduction

Xview2 2nd place solution

Data preparation

Download and extract dataset from https://xview2.org/dataset and put it to data directory (or whatever)

Generate Masks

Run python generate_polygons.py --input data/train This script will generate pixel masks from json files.

Training

Dockerfile has all the required libraries.

Most of the hyperparameters for training are defined by json files, see configs directory.

Other parameters are passed directly to train scripts.

Localization and classification networks are trained separately

train_localization.py - to train binary segmentation models. By default O0 opt level (FP32) is used for Apex due to unstable loss during training.
train.py - to train classification models. By default O1 opt level (Mixed-Precision) is used for Apex as multiclass loss FocalLossWithDice is stable in mixed precision .

Architectures

For localization network ordinary U-Net like network was used with pretrained DPN92 and Densenet161 encoders (see models/unet.py for U-Nets Zoo)

For classification Siamese-UNet was used with shared encoder weights (see models/siamese_unet.py for Siamese U-Nets Zoo)

xview2_solution's People

Contributors

Stargazers

Watchers

Forkers

shivamspj amorgun maksimovkonstantin diux-xview arasharchor akamaestro serignecisse bezero tatigabru alfiesan samanthika95 abdullah-eisa

xview2_solution's Issues

which config file and model are the ones with the best score

Hi,

would you please kindly refer me to the configuration which you used to achieve the best performance output? there are several .json files in the config folder as well as several trained models.

thanks

multi-gpu failed in train_localization.py

unlike multi-gpu training using "train.py", in localization training using "train_localization.py", I am facing the following issue when trying to use multi-gpu. With single gpu, it is working, but I am running out of memory using only one Gefore 2080ti RTX gpu.

CUDA_VISIBLE_DEVICES=3,4 python train_localization.py --folds-csv folds.csv --config configs/se50_loc.json --logdir logs --predictions predictions --data-dir /datasets/xView2/train_tier3_combined --gpu 3,4 --output-dir weights
bottleneck  1280 256
bottleneck  704 192
bottleneck  384 128
bottleneck  128 64
Selected optimization level O0:  Pure FP32 training.

Defaults for this optimization level are:
enabled                : True
opt_level              : O0
cast_model_type        : torch.float32
patch_torch_functions  : False
keep_batchnorm_fp32    : None
master_weights         : False
loss_scale             : 1.0
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O0
cast_model_type        : torch.float32
patch_torch_functions  : False
keep_batchnorm_fp32    : None
master_weights         : False
loss_scale             : dynamic
Freezing encoder!!!
  0%|                                                                                                                                                                              | 0/1266 [00:09<?, ?it/s]
Traceback (most recent call last):
  File "train_localization.py", line 302, in <module>
    main()
  File "train_localization.py", line 191, in main
    args.local_rank)
  File "train_localization.py", line 268, in train_epoch
    out_mask = model(imgs)
  File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
    raise output
  File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
    output = module(*input, **kwargs)
  File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/apex/amp/_initialize.py", line 197, in new_fwd
    **applier(kwargs, input_caster))
  File "/space/export/data/azim_se/xView2_second_place/models/unet.py", line 390, in forward
    x = stage(x)
  File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 338, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)

Difference between Combo(Focal+Dice) and FocalLosswithDice

You used both Combo(Focal+Dice) and FocalLossWithDice.

I wonder why you used them both. Are they different? It seems you used ComboLoss on Localization network and FocalLossWithDice on Prediction network.

ignore please

ignore

`mask_loss` is missing on softmax.json config

Hi, @selimsef
I want to replicate your training for classification. From what I understood, x_softmax.json files are used to run train.py. However, train.py is expecting mask_loss configuration to run but it is missing in all softmax config files. Am I missing something here?

just ignore this sorry

Weights

Would you mind publishing trained weights?

no module named 'efficientnet_pytorch'

generate_polygons.py

XViewSingleDataset KeyError: 'nondamage'

Hello selimsef
Thank you for your sharing.
I was trying to run your code in my computer, but got KeyError in XViewSingleDataset's init function at this line:
nondamage = df[(df['fold'] != fold) & (df['nondamage'] == True)]['id'].tolist()
error information:

Traceback (most recent call last):
File "train.py", line 357, in
main()
File "train.py", line 123, in main
normalize=conf["input"].get("normalize", None))
File "/opt/share1003/public/yaomf/codes/xview2-2/dataset/xview_dataset.py", line 28, in init
nondamage = df[(df['fold'] != fold) & (df['nondamage'] == True)]['id'].tolist()
File "/usr/lib/python2.7/dist-packages/pandas/core/frame.py", line 1969, in getitem
return self._getitem_column(key)
File "/usr/lib/python2.7/dist-packages/pandas/core/frame.py", line 1976, in _getitem_column
return self._get_item_cache(key)
File "/usr/lib/python2.7/dist-packages/pandas/core/generic.py", line 1091, in _get_item_cache
values = self._data.get(item)
File "/usr/lib/python2.7/dist-packages/pandas/core/internals.py", line 3211, in get
loc = self.items.get_loc(item)
File "/usr/lib/python2.7/dist-packages/pandas/core/index.py", line 1759, in get_loc
return self._engine.get_loc(key)
File "pandas/index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas/index.c:3979)
File "pandas/index.pyx", line 157, in pandas.index.IndexEngine.get_loc (pandas/index.c:3843)
File "pandas/hashtable.pyx", line 668, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12265)
File "pandas/hashtable.pyx", line 676, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12216)
KeyError: 'nondamage'

I found that df is loaded from csv file, and the csv file contained in your code has only 2 columns: id and fold, without 'nondamage'.

Do I need to generate a csv file in another format in order to run your code at xview2 dataset?
I have downloaded the dataset and generate mask using your script.

Thank you!

OSError: [Errno 24] Too many open files - Issue with tqdm in the validation phase

Hi,

During validation step, I am facing the following error which is related to tqdm. As I am training on both train and tier3 sets, it seems the a parameter in tqdm library is overflowing.

Traceback (most recent call last):
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/multiprocessing/queues.py", line 234, in _feed
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 319, in reduce_storage
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/multiprocessing/reduction.py", line 191, in DupFd
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/multiprocessing/resource_sharer.py", line 48, in __init__
OSError: [Errno 24] Too many open files
Traceback (most recent call last):                                                           
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/multiprocessing/resource_sharer.py", line 149, in _serve
Traceback (most recent call last):
  File "train_localization.py", line 301, in <module>
    send(conn, destination_pid)
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/multiprocessing/resource_sharer.py", line 50, in send
    main()
  File "train_localization.py", line 206, in main 
    predictions_dir=preds_dir) 
  File "train_localization.py", line 214, in evaluate_val
    dice = validate(model, data_loader=data_val, predictions_dir=predictions_dir)
  File "train_localization.py", line 241, in validate
    reduction.send_handle(conn, new_fd, pid)
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/multiprocessing/reduction.py", line 176, in send_handle
    for sample in tqdm(data_loader): 
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/tqdm/std.py", line 1127, in __iter__ 
    with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s:                                                                                     
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/socket.py", line 460, in fromfd 
    for obj in iterable:
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 576, in __next__ 
    nfd = dup(fd)
OSError: [Errno 24] Too many open files
    idx, batch = self._get_batch() 
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 553, in _get_batch 
    success, data = self._try_get_batch()                                                                                                        
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 511, in _try_get_batch
    data = self.data_queue.get(timeout=timeout)
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 276, in rebuild_storage_fd
    fd = df.detach()
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
    return recvfds(s, 1)[0]
  File "/home/majid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/multiprocessing/reduction.py", line 155, in recvfds
    raise EOFError

I tried the following ways to solve it:

I increased ulimit -n from 1024 to 4096.
In train.py and train_localization.py, in the function
def validate(net, data_loader, predictions_dir):
, I replaced
for sample in tqdm(data_loader):
with

with tqdm(data_loader) as samples:
            for i,  sample in enumerate(samples):

, but none of these ways was a solution.

I was wondering whether you would have a solution for this. I am still looking around for a solution.