Git Product home page Git Product logo

int-lab-book's Introduction

forthebadge forthebadge forthebadge

Project

In contrast with computer vision, biological vision is characterized by an anisotropic sensor (The Retina) as well as the ability to move the eyesight to different locations in the visual scene through ocular saccades. To better understand how the human eye analyzes visual scenes, a bio-inspired artificial vision model was recently suggested by Daucé et al (2020) 1.The goal of this master’s internship would be to compare the results obtained by Daucé et al with some of the more classical attentional computer vision models like the Spatial transformer network 2 where the visual input undergoes a foveal deformation.

Computational graph of a foveated spatial transformer network

  • This module is used in the POLO_ATN network. foveated st module

Results

The Generic Spatial Transformer Network Vs. The What pathway1

Exploring the 28x28 Noisy MNIST dataset.

Taking a look at a few examples from the dataset:

28x28 noisy no shift

STN_28x28

  • Spatial Transformer: 2 convolutional layers in localization network (ConvNet), grid sampler without downscaling (28x28 pixels) → (affine transformations) = 6 parameters

  • Training for 160 epochs with SGD, learning rate of 0.01 without decay, Each 10 epochs, increment the shift standard deviation by 1 [0, 15].

Training statistics:

training stn 28x28

Performance

  • Overall results: Central accuracy of 88% and general accuracy of 43%, compared to 84% and 34% in the generic what pathway, respectively.

Accuracy map comparaison with the generic what pathway from the paper with the same training parameters:

Spatial Transformer Network Generic What pathway 1
acc map stn acc map what

A test on a noisy dataset with a shift standard deviation = 7

results

Spatial Transformer Networks Vs. The What/Where pathway1

Exploring the 128x128 Noisy MNIST dataset 1.

Taking a look at a few examples:

128x128 noisy shift dataset

STN_128x128

  • Spatial Transformer: 4 convolutional layers in localization network (ConvNet), grid sampler without downscaling (128x128 pixels) → (affine transformations) = 6 parameters

Training for 110 epochs with an initial learning rate of 0.01 that decays by a factor of 10 every 30 epochs, each 10 epochs increase the standard deviation of the eccentricity, last 20 epochs vary the contrast.

training stn 128x128

After transformation with a STN:

transformed 128x128

Performance when the contrast varies between 30-70% and the digit is shifted by 40 pixels (the maximum amount):

contrast 128x128

ATN

  • Spatial Transformer: 4 convolutional layers in localization network (ConvNet), grid sampler with downscaling (28x28 pixels) → (attention) = 3 parameters

Training for 110 epochs with an initial learning rate of 0.01 that decays by a half every 10 epochs, each 10 epochs increase the standard deviation of the eccentricity, last 20 epochs vary the contrast.

training stn 128x128

After transformation with a ATN (STN parametrized for attention), the digit is shifted by 40 pixels to check if the network can catch it:

transformed atn_128x128

Performance when the contrast is 30 and the digit is shifted by 40 pixels (the maximum amount):

contrast 128x128

POLO_ATN

  • Spatial Transformer: 2 fully-connected layers in localization network (FeedForward Net), grid sampler with downscaling (28x28 pixels) → (fixed attention) = 2 parameters

Polar-Logarithmic compression: the filters were placed on [theta=8, eccentricity=6, azimuth=16], on 768 dimensions, providing a compression of ~95%, the original what/where model had 2880 filters, with a lesser compression rate of ~83%.

polo_transformed_dataset

Training for 110 epochs with an initial learning rate of 0.005 that decays by a half every 10 epochs, each 10 epochs increase the standard deviation of the eccentricity, last 20 epochs vary the contrast.

training polo_atn

After transformation with a POLO-ATN, the digit is shifted by 40 pixels to check if the network can catch it:

transformed polo_atn

Benchmark

Accuracy comparison of spatial transformer networks with the What/Where model, in function of contrast and eccentricity of the digit on the screen.

benchmarks

References

[1] Emmanuel Daucé, Pierre Albiges, Laurent U. Perrinet; A dual foveal-peripheral visual processing model implements efficient saccade selection. Journal of Vision 2020;20(8):22.

[2] Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu; Spatial Transformer Networks. arXiv:1506.02025

int-lab-book's People

Contributors

dabane-ghassan avatar laurentperrinet avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

int-lab-book's Issues

Adam Vs. SGD

Quand j'étais en train de chercher pourquoi le transformeur applique d'une sorte la même transformation sur tous les données,
j'ai trouvé qu'en changeant l'optimiseur, nous allons avoir des différentes informations à la sortie, et peut être le problème d'affichage viendra de ça.

Paramètres de l'entraînement:

  • Dataset bruité
  • shift_std = 0
  • 20 epochs (par rapport à 5 car SGD est plus lent)

Affichage des Résultats (test du réseau avec shift_std=5):

Avec Adam
image

Avec SGD
image

Problème avec Adam????

Transformer not being trained on a noisy 28x28 MNIST

The loss stays at ~ 2.3 and the accuracy is exactly 11% over 20 epochs

Some hypotheses of why that may be:

  1. the noise arguments used for the dataset are the same that where used to train the What pathway in the original paper (from the args in the .json file) and are considered very extreme and the transformer can't crop correctly.
  2. Maybe the architecture of the transformer needs to be tweaked.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.