bertinetto / siamese-fc Goto Github PK

Arbitrary object tracking at 50-100 FPS with Fully Convolutional Siamese networks.

Home Page: http://www.robots.ox.ac.uk/~luca/siamese-fc.html

License: MIT License

MATLAB 99.69% Shell 0.31%

object-tracking siamese-network deep-learning computer-vision machine-learning

siamese-fc's Introduction

→ IMPORTANT. At CVPR'17 we presented CFNet, which uses a slightly modified version of SiamFC (which I have been calling v2 or baseline-conv5) to compare against that paper's Correlation Filter Network. The difference is simply that it has only 32 output channel instead of 256 and it has activations with higher spatial resolutions. Results are slightly better, speed is slightly worse. For this reason, if you are starting fresh it makes much more sense to use the more recent code from the CFNet repository, which is also a bit cleaner I think. However, if you have started with this repo, no worries. Things are just marginally different so there is no much use in switching.

Fully-Convolutional Siamese Networks for Object Tracking

Project page: http://www.robots.ox.ac.uk/~luca/siamese-fc.html

The code in this repository enables you to reproduce the experiments of our paper. It can be used in two ways: (1) tracking only and (2) training and tracking.

If you find our work and/or curated dataset useful, please cite:

@inproceedings{bertinetto2016fully,
  title={Fully-Convolutional Siamese Networks for Object Tracking},
  author={Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip H S},
  booktitle={ECCV 2016 Workshops},
  pages={850--865},
  year={2016}
}

[ Tracking only ] If you don't care much about training, simply plug one of our pretrained networks to our basic tracker and see it in action.

Prerequisites: GPU, CUDA drivers, cuDNN, Matlab (we used 2015b), MatConvNet (we used v1.0-beta20).
Clone the repository.
Download one of the pretrained networks from http://www.robots.ox.ac.uk/~luca/siamese-fc.html
Go to siam-fc/tracking/ and remove the trailing .example from env_paths_tracking.m.example, startup.m.example and run_tracking.m.example, editing the files as appropriate.
Be sure to have at least one video sequence in the appropriate format. You can find an example here in the repository (siam-fc/demo-sequences/vot15_bag).
siam-fc/tracking/run_tracking.m is the entry point to execute the tracker, have fun!

[ Training and tracking ] Well, if you prefer to train your own network, the process is slightly more involved (but also more fun).

Prerequisites: GPU, CUDA drivers, cuDNN, Matlab (we used 2015b), MatConvNet (we used v1.0-beta20).
Clone the repository.
Follow these step-by-step instructions, which will help you generating a curated dataset compatible with the rest of the code.
If you did not generate your own, download the imdb_video.mat (6.7GB) with all the metadata and the dataset stats.
Go to siam-fc/training/ and remove the trailing .example from env_paths.m.example, startup.m.example and run_experiment.m.example editing the files as appropriate.
siam-fc/training/run_experiment.m is the entry point to start training. Default hyper-params are at the start of experiment.m and can be overwritten by custom ones specified in run_experiment.m.
By default, training plots are saved in siam-fc/training/data/. When you are happy, grab a network snapshot (net-epoch-X.mat) and save it somewhere convenient to use it for tracking.
Go to point 4. of Tracking only and enjoy the result of the labour of your own GPUs!

siamese-fc's People

Contributors

Stargazers

Watchers

Forkers

deepmodel skyz8421 jizhihang stevenlol msnqqer caomw benjamesbabala ouya-bytes coocoky leezqcst ml-lab hubery94 xiaohuige dorniwang fireae zhangliliang tinyloop zhu1teng ilibx winggyn arasharchor haotians playgood111 1292765944 wanjinchang wltongxing zhengzhugithub shllhs shunfengdai andudu lkaige worldhellooo mathrho longchuanshu hawklucky gaobingaobingaobin boldjoel wuyuzaizai ericdoug xhwxd gongbudaizhe c-peng reborn521 lijian8 mwl5 livst walkoncross huizee liviust yangsuo bis-carbon catree constantineg1 xchenjh yingjieyao bigsnarfdude tianshengli drawzeropoint wangjuenew thomasjteixeira poisonbox www0wwwjs1 yang-fei delongqilinksprite peratham kby7144 wjpeters byeongyeonkang jacobishao tzzhang10 mna12478 jdc08161063 pustar bigmaye lyk125 hxl1990 v-italy ywangdg stoneyang-cv yerongli zhaoyang1708 nrupatunga dengshuo tahir-shahzad4 warden7 yyuzhongpv littlelittlespring xjtuwyd yangzhenphd ossdc bg6jxd kwan-ywan lz20061213 lucky7323 zhangsygithub greenteahua lexmao borispolonsky dlfollower damax18

siamese-fc's Issues

Suffering from problems while implementing the algorithm

Hi, Luca Bertinetto. I am doing the similar thing and just found that you've done a great job in that regard. I've read the paper and watched the video (one of your demos is just what I am working on). As far as I know, one of advantage of your model is that you avoid repeated computation by using fully convolution. I am trying to incorporating your idea to my network, but confronted with some challenges. Therefore, I would like to ask for your opinions.

When we're tracking the object, we calculate the max scores and multiply by strides. The first problem is that we can't estimate scores for all locations in the next frame because some strides in the network are larger than 1, and thus some scores of locations can't be calculated. Do you solve this problem by upsampling score maps ? The second problem is that we will need a large search images as we increase the number of downsampling layers (either in conv layer or pooling layer). In the original paper, there are three layers with stride of 2, which results in the stride of 8. If we increase the stride of the network (e.g to 32), the search image will increase largely. Is there anyway we can prevent the network from suffering this problem?

Download link of OTB100 results can not be opened

Hi Luca,

It seems the link of OTB100 results on your website cannot be opened. Could you have a look? Thanks.

cannot reproduce your pre-trained model

Hi, @bertinetto

I have rerun your training code for 50 epochs without any modification, but I cannot reproduce your released model for color image. Specifically, it performs worse in the OTB50 benchmark compared to your released color model in all aspects BC, DEF, IV, OCC, etc.

Could you kindly point out in which way I may have done wrong?

Attached the training metrics. net-train.pdf

about negative pairs

hi,
In the paper,Negative pairs were chosen with probability 0.25, however in the experiment.m,

neg_eltwise = []; % no negative pairs at the moment

Does it mean that no negative pairs are used in the experiment or I misunderstand this?

Thanks

evaluation code

Hi @bertinetto

Is there example code to measure the performance ?. thanks

about the pretrained .net.mat file

Hi Luca,

is there any files can help to use the pretrained .mat file like '2016-08-17.net.mat' for training my own dataset?

thank you!

no imageStats.mat save after run the 'vid_image_stats.mat'?

after run the last step of preprocess the imagenetVID, no imageStats has been saved, in 'vid_image_stats.mat' file has not save matrix code.

element-wise label size is 1515 in the code while 1717 in the paper

In siamese-fc paper, after cross-correlation layer, the output is a 1717 array which means that the element-wise label should be a 1717 array too. But in the released code, I found out that the element-wise label size is 15*15. It's quite confused.

load_pretrained

Hey, I tried to run the pre-trained model and find that in [~,xcorrId] = find_layers_from_type(net, 'xcorr'); xcorrId would be [ ] empty and the next line xcorrId = xcorrId{1}; would encounter a error. I don't if there is any problem ?

Best regards

How about gaussian weights?

Different label weights strategy has been implemented in the code, however, intuitively, gaussian weights should work better, since we should emphasize on the center of the object, weight in the center should have a larger value. So, How about the performance of gaussian weights, @bertinetto.Has anyone tried this? Or have any ideas about why balanced weights work better than the gaussian weights?

how to deal with shared siamese during bp

Hi Bertinetto,

When you define the shared siamese net, in the backpropagation process.
When it reach the first BatchNorm, after the norm, it will set obj.moments=[] (See matconvnet/matlab/+dagnnn/BatchNorm.m, the backward() function).
Then the second BatchNorm has an empty obj.moments, it will cause error.

How did you deal with that, did you modified the code somehow?

SiamFC-R architecture design

Hi, Luca.
SiamFC is a great framework for high-speed visual tracking. The default feature extractor is similar to AlexNet.
To get a competitive preformance to recent state-of-the-art visual tracker, it's helpful to release the ResNet version.
Since SiamFC is a Fully-Convolutional Network with zero-padding, it's hard to use identity mapping.

I try to use 2 max pooling(3x3) layers in the identity mapping branch to get a compatible resolution, but the performance is degraded a lot.(0.48 on OTB2013)

load_pretrained

Hey, I tried to used your pre-trained network and there are issues coming. I start with run the tracker function and find the one network layer type is Xcorr and it would not work with error and I modified it all to 'xcorr' and find that it would return
Error using xcorr (line 66)
Not enough input arguments.
Error in dagnn.DagNN.loadobj (line 27)
block = constr() ;
I spent a lot of time and it seems not work
Best regards

Training on Negative Samples

Hi! First of all, I wanna appreciate the great work you have done in object tracking!
Could you please clarify me on training the siamese network with negative samples? i.e. images from two different imagenet sequences
In your paper, you mention something like,

We employ a discriminative approach, training the network on positive and
negative pairs and adopting the logistic loss

In line 178 of experiment.m,
neg_eltwise = []; % no negative pairs at the moment
you specify that no negative pairs are used at the moment.

Is there a reason why negative pairs were not included during training? Thanks in advance!

on model update

Hi,

One interesting aspect of your work is that no model update is employed during tracking. As the paper states: "We found that updating(the feature representation of) the exemplar online through simple strategies, such as linear interpolation, does not gain much performance and thus we keep it fixed."

So my question is,

How did you conduct your experiments on model update? Can you elaborate on that?
Specifically, how did you extract the feature of target on later frames that is used for model update?
Is there any numerical results compared to without model update?

Thanks

error run_tracker

Hi @bertinetto

I got the following error:

run_tracker('demo-sequences/vot15-bag', true);
Warning: No stats found at /path/to/ILSVRC15-VID/stats.mat
In tracker at 49
In run_tracker at 11
No public field numChannels exists for class dagnn.BatchNorm.

Error in dagnn.Layer/load (line 195)
obj.(f) = s.(f) ;

Error in dagnn.DagNN.loadobj (line 17)
block.load(struct(s.layers(l).block)) ;

Error in load_pretrained (line 25)
net = dagnn.DagNN.loadobj(net);

Error in tracker (line 53)
net_z = load_pretrained([p.net_base_path p.net], p.gpus);

Error in run_tracker (line 11)
tracker(params);

Have downloaded your imdb_video.mat, cannot train

I just used the imdb_video.mat you supplyed to run the training phase.
But line 248-249 of the code ./training/experiment.m need use the JPEG_dir to generate batches for training.
So I cannot use current code to do my own training if I just use the imdb_video.mat.
I think there need to generate batch using the video_video.mat

Undefined function or variable 'dist'.

Hi, @bertinetto
Undefined function or variable 'dist'.
Error in create_logisticloss_label (line 11)
dist_from_origin = dist([i j], 'label_origin');

About xCorr layer

function outputs = forward(obj, inputs, params)
        assert(numel(inputs) == 2, 'two inputs are needed');

        z = inputs{1}; % exemplar
        x = inputs{2}; % instance (search region)

        assert(ndims(z) == ndims(x), 'z and x have different number of dimensions');
        assert(size(z,1) <= size(x,1), 'exemplar z has to be smaller than instance x');
        
        
        %c->channel b->channel
        [wx,hx,cx,bx] = size(x);
        x = reshape(x, [wx,hx,cx*bx,1]);
        o = vl_nnconv(x, z, []);
        [wo,ho,co,bo] = size(o);
        assert(co==bx);
        outputs{1} = reshape(o, [wo,ho,bo,co]);
    end

Before call vl_nnconv( x,z,[] ) to compute the score, you reshape the x to [wx,hx,cx*bx,1], but after this operation, the shape of x is not the same with z，so I was confused that how can vl_nnconv can operate reasonable.

transfer standard vot format to the suitable format

base_path = '/home/qiangwang/Desktop/vot/vot15';
goal_path = '/home/qiangwang/Desktop/vot/vot15_SiameseFC';
dirs = dir(base_path);
videos_name = {dirs.name};
videos_name(strcmp('.', videos_name) | strcmp('..', videos_name) | ...
    ~[dirs.isdir]) = [];

mkdir(goal_path);
for video = videos_name
disp(video)
mkdir([goal_path '/vot15_' video{1}]);
mkdir([goal_path '/vot15_' video{1} '/imgs']);
copyfile([base_path '/' video{1} '/*.jpg'],[goal_path '/vot15_' video{1} '/imgs']);
copyfile([base_path '/' video{1} '/groundtruth.txt'],[goal_path '/vot15_' video{1} '/groundtruth.txt']);
n = numel(dir([base_path '/' video{1} '/*.jpg']));
txt_name = ['vot15_' video{1} 'frames.txt'];
fid =fopen([goal_path '/vot15_' video{1} '/' txt_name],'w');
fprintf(fid,'%d,%d',1,n);
fclose(fid);
end

Poor tracking performance

I've successfully installed your program, but the tracking performance poor. I'm pretty confident this is just an error somewhere in my configuration. Perhaps the default hyperparameters in tracker.m don't match the pretrained network I'm using?

I am using the pretrained weights in 2016-8-17.net.mat. I've also tried adding the stats file ILSVRC15.stats.mat to the configuration, but this didn't seem to affect the behavior of the tracker. I'm using matlab r2017b and I installed matconvnet from its github repo today; I presume that's the most recent version (v1.0-beta25).

Specifically, the tracker tends to drift to the left in most of the videos, which seems like a strange and non-random characteristic.

Any help here is appreciated. Thanks so much for making this work public.

Difference instanceSize when training and testing

Hi. Thanks for sharing such a good code for studying.

I have a problem when reading your code.

In the testing procedure, the instanceSize set to 255. However, during the training time the instanceSize is set to 255 - 2*8.

Why they are different? And what is the meaning of subtracting 2*8 during the training time?

Wait for your response.

what is deep-one-shot?

hi bertinetto,
In the file env_paths_training.m there is a line of code that is "addpath(genpath('/path/to/deep-one-shot/util'));" I dont know what deep-one-shot is, is this code necessary? thank you

Multi-object tracking?

Thanks for you works. I'm interested if your code can be used for multi-object tracking at the same time? The demo videos provided are all single object tracking.

adjust layer

Nice work! But I have a question why should Siamese network add the adjust layer after xccor? f*xccor+b seems has no effect on the maximum position?

About exemplar image

why the size of exemplar image is more than the true target? Obviously, this would bring in extra background in exemplar image besides the target in first frame. Would it influence the divergence of the siamese network? Why not just take the true target region in first frame as exemplar image?

net-train graph issue

Hey, thanks a lot for your work. And I follow your instruction and try to produce my own model for tracking. However, the train graph looks really weird since the errmax is almost 0 from the start of training
net-train.pdf. I train this model with around 3500 frames from video. Just wonder if you have any clue for the training.

Does the code (training or testing) compatible with newer version of matconvnet, such as v1.0-beta25?

can the prototxt and caffemodel be provided?

Reference to non-existent field 'dilate'

I am using matlab 2017a and matconvnet-1.0-beta20. When I execute run_experiment, after loading imdb_video, I encounter this problem:

loading imdb video...
construct network
Reference to non-existent field 'dilate'.

Error in vl_simplenn_display (line 82)
      ks = (ks - 1) .* ly.dilate + 1 ;

Error in vid_create_net>add_block (line 60)
info = vl_simplenn_display(net) ;

Error in vid_create_net>modified_alexnet (line 141)
    net = add_block(net, opts, '2', 5, 5, 48, 256, strides(3), 0) ;

Error in vid_create_net (line 28)
    net = modified_alexnet(struct(), opts) ;

Error in make_siameseFC (line 22)
    branch = vid_create_net(...

Error in experiment>make_net (line 136)
    net = make_siameseFC(opts);

Error in experiment (line 69)
    net = make_net(opts);

Error in run_experiment (line 17)
	experiment(imdb_video, opts);

I have tried matconvnet-1.0-beta20, matconvnet-1.0-beta21, matconvnet-1.0-beta24 and all of them end with the same problem.
How can fix it?

What if the input examplar image is less than 127*127?

Hi, the tracker needs to resize the examplar image to 127127 and the next frame to 255255, however, if I want to track small object (e.g. less than 32*32), do you think is it possible? Or any hints on modifying the tracking code? Thanks.

Step 3 modification

The folder structure for Annotations also needs to be changed along with Data folder structure. In clear terms, we need to have 'Annotations/VID/train/{a,b,c,d,e}' along with 'Data/VID/train/{a,b,c,d,e}'.

Broken links

Hello,
Thanks for making this work public - just wanted to ask as to the broken links provided in the repository, can they be reset or maybe they have they been removed ?

Specifically I am talking about the links mentioned in this line:
"Run vid_setup_data.m to generate your own imdb_video.mat. Otherwise download the one we have already created - here for the one used for the ECCV'16 SiamFC, here for the one used for the CVPR'17 CFNet. 7bis. (only for CVPR'17 CFNet code) Add a field .set "

Thanks,
Abbas

Can’t download the "imdb_video.mat".

The download link is invalid. Could you please offer a new download link?

ILSVRC15-VID curated dataset

hi, @bertinetto
links of 'imdb_video.mat' and 'dataset stats' are broken. Can you recover them? thanks

about tracking speed

What decides the speed of siamese-fc? I notice that the siameseFC-ResNet still run 25fps (report in ECCV2016 paper), but we all know that ResNet network have far more parameters than AlexNet.

Training result is not as well as the paper .

I trained the alexNet, But the result is a little lower than the paper said, about 0.02(success plot)!
In addition, How can I select gray training mode？

the URL of imdb_video.mat created is invalid

Your work is very excellent!
Sorry! I just found that the URL of imdb_video.mat created is invalid.
Could you provide a valid URL please?
Thank you very much!

all training pair are positive?

Recently I am convert the matconvnet training code to pytorch version, but meet some questions.
I find that in "training/vid_get_random_batch.m", the labels here are all 1 because here just considering positive pairs. But in "training/experiment.m", in the function get_batch.m, the label_inputs will always be like a gaussian shape(because the center is 1, and the other is 0 etc). So this might train the network output a gaussian shape score map all the time.
I train it in my pytorch code, also find the loss is constant in convergence.

I guess this might because the label'shape is constant and the output is trained to constant

The tracker can solve occlusion problem?

Error during training step when running run_experiment(imdb_video)

Hi @bertinetto and other siamese-fc users,
I am trying to train the network by following the instructions.
However I get the following error when I ran run_experiment(imdb_video)

>> run_experiment(imdb_video.imdb_video);
construct network

ly = 

  struct with fields:

            type: 'conv'
            name: 'conv1'
         weights: {[11×11×3×96 single]  [96×1 single]}
          stride: 2
             pad: 0
    learningRate: [1 2]
     weightDecay: [1 0]
            opts: {'CudnnWorkspaceLimit'  [1.0737e+09]}

Reference to non-existent field 'dilate'.

Error in vl_simplenn_display (line 82)
      ks = (ks - 1) .* ly.dilate + 1 ;

Error in vid_create_net>add_block (line 60)
info = vl_simplenn_display(net) ;

Error in vid_create_net>modified_alexnet (line 141)
    net = add_block(net, opts, '2', 5, 5, 48, 256, strides(3), 0) ;

Error in vid_create_net (line 28)
    net = modified_alexnet(struct(), opts) ;

Error in make_siameseFC (line 22)
    branch = vid_create_net(...

Error in experiment>make_net (line 137)
    net = make_siameseFC(opts);

Error in experiment (line 69)
    net = make_net(opts);

Error in run_experiment (line 17)
	experiment(imdb_video, opts);

I understand error is raising because there is not dilate property in the layers. However, how to fix this?
Should I simply change the line 82 in vl_simplenn_display to
ks = (ks - 1) + 1 ;
by removing the dilate? Or does it has to do something with the version of MatConvNet that I am using?

Thanks,
Jumabek

how is the number of channels changed from 96 to 48 as input of conv2?

The number of the feature channels of conv1 is 96. However after batchnorm, relu and pool layers, it changes to 48. This confuses me a lot. Which layer changes the number of channels?

Error in training phase

I try to train the network but meeting an error.

train: epoch 01:   1/5985:Error using vl_nnbnorm
The MOMENTS size does not match the DATA depth.

Error in dagnn.BatchNorm/backward (line 29)
        vl_nnbnorm(inputs{1}, params{1}, params{2}, derOutputs{1}, ...

Error in dagnn.Layer/backwardAdvanced (line 120)
      [derInputs, derParams] = obj.backward ...

Error in dagnn.DagNN/eval (line 117)
  obj.layers(l).block.backwardAdvanced(obj.layers(l)) ;

Error in cnn_train_dag>processEpoch (line 253)
      net.eval(inputs, params.derOutputs, 'holdOn', s < params.numSubBatches) ;

Error in cnn_train_dag (line 105)
    [net, state] = processEpoch(net, state, params, 'train') ;

Error in experiment (line 102)
    [net, stats] = cnn_train_dag(net, imdb, batch_fn, opts.train);

I use matconvnet-1.0-beta25.

can any help provided ?

Thanks

configurations for duplicating color+gray model

Hi,

As the paper states, for OTB benchmark, 25% color images are randomly converted to gray images. The pretrained model is named color+gray in the project homepage:
http://www.robots.ox.ac.uk/~luca/stuff/siam-fc_nets/2016-08-17_gray025.net.mat.

Currently, I am trying to replicate training this model using configurations like:

    opts.augment.grayscale = 0.25

all other configurations remains untouched. But somehow, after training 50 epoches, I only get AUC: 0.577 using 3 scales and AUC: 0.584 using 5 scales at the OTB cvpr13 test set.

However, using the pretrained model, I get AUC: 0.582 using 3 scales, AUC: 0.605 using 5 scales.

Is there anything I am missing?

evaluate on VOT benchmark

Hi, luca

Recently, I am trying to reproduce your results in the VOT2016 benchmark. By using provided result files, the computed EAO is 0.2905.

However, evaluating the pretrained color model with 3 search scales, using VOT benchmark toolkits myself, I only get EAO 0.2247.

Below is the evaluation script. Is there anything I am missing?

function sfc_vot
    % *************************************************************
    % VOT: Always call exit command at the end to terminate Matlab!
    % *************************************************************
    cleanup = onCleanup(@() exit() );

    % *************************************************************
    % VOT: Set random seed to a different value every time.
    % *************************************************************
    RandStream.setGlobalStream(RandStream('mt19937ar', 'Seed', sum(clock)));

    % *************************************************************
    % SFC: Set tracking parameters
    % *************************************************************
    p.numScale = 3;
    p.scaleStep = 1.0375;
    p.scalePenalty = 0.9745;
    p.scaleLR = 0.59; % damping factor for scale update
    p.responseUp = 16; % upsampling the small 17x17 response helps with the accuracy
    p.windowing = 'cosine'; % to penalize large displacements
    p.wInfluence = 0.176; % windowing influence (in convex sum)
    p.net = '2016-08-17.net.mat';
    %% execution, visualization, benchmark
    p.gpus = 1;
    p.fout = -1;
    %% Params from the network architecture, have to be consistent with the training
    p.exemplarSize = 127;  % input z size
    p.instanceSize = 255;  % input x size (search region)
    p.scoreSize = 17;
    p.totalStride = 8;
    p.contextAmount = 0.5; % context amount for the exemplar
    p.subMean = false;
    %% SiamFC prefix and ids
    p.prefix_z = 'a_'; % used to identify the layers of the exemplar
    p.prefix_x = 'b_'; % used to identify the layers of the instance
    p.prefix_join = 'xcorr';
    p.prefix_adj = 'adjust';
    p.id_feat_z = 'a_feat';
    p.id_score = 'score';
% -------------------------------------------------------------------------------------------------

    startup;

    % Get environment-specific default paths.
    p = env_paths_tracking(p);
    % Load ImageNet Video statistics
    if exist(p.stats_path,'file')
        stats = load(p.stats_path);
    else
        warning('No stats found at %s', p.stats_path);
        stats = [];
    end
    % Load two copies of the pre-trained network
    net_z = load_pretrained([p.net_base_path p.net], p.gpus);
    net_x = load_pretrained([p.net_base_path p.net], []);

    % Divide the net in 2
    % exemplar branch (used only once per video) computes features for the target
    remove_layers_from_prefix(net_z, p.prefix_x);
    remove_layers_from_prefix(net_z, p.prefix_join);
    remove_layers_from_prefix(net_z, p.prefix_adj);
    % instance branch computes features for search region x and cross-correlates with z features
    remove_layers_from_prefix(net_x, p.prefix_z);
    zFeatId = net_z.getVarIndex(p.id_feat_z);
    scoreId = net_x.getVarIndex(p.id_score);
    
    % **********************************
    % VOT: Get initialization data
    % **********************************
    [handle, first_image, region] = vot('rectangle');

    % If the provided region is a polygon ...
    if numel(region) > 4
      x1 = round(min(region(1:2:end)));
      x2 = round(max(region(1:2:end)));
      y1 = round(min(region(2:2:end)));
      y2 = round(max(region(2:2:end)));
      region = round([x1, y1, x2 - x1, y2 - y1]);
    else
      region = round([round(region(1)), round(region(2)), ... 
                      round(region(1) + region(3)) - round(region(1)), ...
                      round(region(2) + region(4)) - round(region(2))]);
    end;

    irect = region
    targetPosition = [irect(2) + (1 + irect(4)) / 2 irect(1) + (1 + irect(3)) / 2];
    targetSize = [irect(4) irect(3)];

    startFrame = 1;
    % get the first frame of the video
    im = gpuArray(single(imread(first_image)));
    % if grayscale repeat one channel to match filters size
    if(size(im, 3)==1)
        im = repmat(im, [1 1 3]);
    end
    % get avg for padding
    avgChans = gather([mean(mean(im(:,:,1))) mean(mean(im(:,:,2))) mean(mean(im(:,:,3)))]);

    wc_z = targetSize(2) + p.contextAmount*sum(targetSize);
    hc_z = targetSize(1) + p.contextAmount*sum(targetSize);
    s_z = sqrt(wc_z*hc_z);
    scale_z = p.exemplarSize / s_z;
    % initialize the exemplar
    [z_crop, ~] = get_subwindow_tracking(im, targetPosition, [p.exemplarSize p.exemplarSize], [round(s_z) round(s_z)], avgChans);
    if p.subMean
        z_crop = bsxfun(@minus, z_crop, reshape(stats.z.rgbMean, [1 1 3]));
    end
    d_search = (p.instanceSize - p.exemplarSize)/2;
    pad = d_search/scale_z;
    s_x = s_z + 2*pad;
    % arbitrary scale saturation
    min_s_x = 0.2*s_x;
    max_s_x = 5*s_x;

    switch p.windowing
        case 'cosine'
            window = single(hann(p.scoreSize*p.responseUp) * hann(p.scoreSize*p.responseUp)');
        case 'uniform'
            window = single(ones(p.scoreSize*p.responseUp, p.scoreSize*p.responseUp));
    end
    % make the window sum 1
    window = window / sum(window(:));
    scales = (p.scaleStep .^ ((ceil(p.numScale/2)-p.numScale) : floor(p.numScale/2)));
    % evaluate the offline-trained network for exemplar z features
    net_z.eval({'exemplar', z_crop});
    z_features = net_z.vars(zFeatId).value;
    z_features = repmat(z_features, [1 1 1 p.numScale]);

    % start tracking
    i = startFrame;
    while true
        % **********************************
        % VOT: Get next frame
        % **********************************
        [handle, image] = handle.frame(handle);

        if isempty(image)
          break;
        end;

        if i>startFrame
            % load new frame on GPU
            im = gpuArray(single(imread(image)));
            % if grayscale repeat one channel to match filters size
            if(size(im, 3)==1)
              im = repmat(im, [1 1 3]);
            end
            scaledInstance = s_x .* scales;
            scaledTarget = [targetSize(1) .* scales; targetSize(2) .* scales];
            % extract scaled crops for search region x at previous target position
            x_crops = make_scale_pyramid(im, targetPosition, scaledInstance, p.instanceSize, avgChans, stats, p);
            % evaluate the offline-trained network for exemplar x features
            [newTargetPosition, newScale] = tracker_eval(net_x, round(s_x), scoreId, z_features, x_crops, targetPosition, window, p);
            targetPosition = gather(newTargetPosition);
            % scale damping and saturation
            s_x = max(min_s_x, min(max_s_x, (1-p.scaleLR)*s_x + p.scaleLR*scaledInstance(newScale)));
            targetSize = (1-p.scaleLR)*targetSize + p.scaleLR*[scaledTarget(1,newScale) scaledTarget(2,newScale)];
        else
            % at the first frame output position and size passed as input (ground truth)
        end
        i = i + 1;
        rectPosition = [targetPosition([2,1]) - targetSize([2,1])/2, targetSize([2,1])];
        % output bbox in the original frame coordinates
        oTargetPosition = targetPosition; % .* frameSize ./ newFrameSize;
        oTargetSize = targetSize; % .* frameSize ./ newFrameSize;
        region  = [oTargetPosition([2,1]) - oTargetSize([2,1])/2, oTargetSize([2,1])];

        % **********************************
        % VOT: Report position for frame
        % **********************************
        handle = handle.report(handle, region);
    end
      
    % **********************************
    % VOT: Output the results
    % **********************************
    handle.quit(handle);
end

Why keep the original ratio of the object when generating the exemplar

Hi,

When generating the exemplar for the tracking object, the code try to add 50% background as the context information while keep the original ratio of the object.

In my opinion, this strategy might be harmful especially when the tracking object ratio is very small or very big. (e.g. the pedestrians, whose ratios are around 0.4). In this situation, the strategy might involve too much background (might over 70% or more) into the exemplar. And then, the tracker might be pay more attention to the background instead of the tracking object, which might hurt the robustness of the tracker.

In my opinion, I prefer the brute-force strategy which is adopted in the R-CNN. It ignores the ratio of the object and just resizes the object to a fixed size, and then adds fixed size background as the context information.

Could you kindly give some comments for comparing these two strategies?

problem of download the imdb_video.mat

the link return is "Invalid request. Privoxy doesnt support FTP"

What is the meaning of "opts.instanceSize = 255 - 2*8" in line 19, experiment?

Why not just "opts.instanceSize = 239"? Thanks!

A little problem in save_crops.m

Hi, I meet a problem about padding. In save_crops.m the code calculate the padding coordinates about xmin, xmax, ymin, ymax. But there only considering the left_pad and top_pad. I think here should calculate right_pad, down_pad for xmax, ymax respectively. How do you think?

form save_crops.m 91row
%check out-of-bounds coordinates, and set them to avg_chans
context_xmin = round(pos(2) - c(2));
context_xmax = context_xmin + sz(2) - 1;
context_ymin = round(pos(1) - c(1));
context_ymax = context_ymin + sz(1) - 1;
left_pad = double(max(0, 1-context_xmin));
top_pad = double(max(0, 1-context_ymin));
right_pad = double(max(0, context_xmax - im_sz(2)));
bottom_pad = double(max(0, context_ymax - im_sz(1)));

context_xmin = context_xmin + left_pad;
context_xmax = context_xmax + left_pad;
context_ymin = context_ymin + top_pad;
context_ymax = context_ymax + top_pad;

Unable to obtain the full original ImageNet Video dataset (the 86 GB archive).

Hi,

I am trying to create my own dataset from the ImageNet Video Dataset mentioned. However, I don't seem to find a way to download that 86 GB archive on the ImageNet website as the 2015 competition is over and when I register, I could not seem to find the link to the 86GB archive.

May I ask that is there another place where I can get this 86GB archive or is this archive actually hosted somewhere on http://www.image-net.org?

Thank you!