Git Product home page Git Product logo

Comments (3)

fmassa avatar fmassa commented on June 23, 2024 14

Hi,

Question 1

I want to know what are considered as positional encoding while working with images.

Positional encoding takes a xy coordinate in [0, 1] and convert the xy into a vector of 256 elements. The encoding for x and y are the same, so for the sake of simplicity let's only look at the x part.
In

y_embed = not_mask.cumsum(1, dtype=torch.float32)
x_embed = not_mask.cumsum(2, dtype=torch.float32)

we create an image tensor which is similar in spirit to meshgrid, but that supports images with different sizes (read masks) in each batch. This way, we have a grid of xy, which we normalize afterwards so that they are between 0 and 1 (in this case, we scale by 2 * pi as well, but that's a detail)
if self.normalize:
eps = 1e-6
y_embed = y_embed / (y_embed[:, -1:, :] + eps) * self.scale
x_embed = x_embed / (x_embed[:, :, -1:] + eps) * self.scale

Then, in
pos_x = x_embed[:, :, :, None] / dim_t
pos_y = y_embed[:, :, :, None] / dim_t
pos_x = torch.stack((pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()), dim=4).flatten(3)
pos_y = torch.stack((pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()), dim=4).flatten(3)
we apply standard sine embedding in a vectorized fashion for x and y separately, and concatenate them afterwards for x and y, yielding the spatial positional embedding.
The positional embeddings only depend on the feature map shapes and the masks (as there could be padding between different images), and not on the content of the feature maps.

Question 2

How do you calculate masks when using images in transformers?

Those are calculated in

mask = torch.ones((b, h, w), dtype=torch.bool, device=device)

Basically, everything that corresponds to zero padding the image so that they have the same size are filled with True for the mask.

I believe I have answered your questions, and as such I'm closing the issue, but let us know if you have further questions.

from detr.

saahiluppal avatar saahiluppal commented on June 23, 2024 3

Perfect explanation!

from detr.

vigneshgig avatar vigneshgig commented on June 23, 2024

Hi,

Question 1

I want to know what are considered as positional encoding while working with images.

Positional encoding takes a xy coordinate in [0, 1] and convert the xy into a vector of 256 elements. The encoding for x and y are the same, so for the sake of simplicity let's only look at the x part.
In

y_embed = not_mask.cumsum(1, dtype=torch.float32)
x_embed = not_mask.cumsum(2, dtype=torch.float32)

we create an image tensor which is similar in spirit to meshgrid, but that supports images with different sizes (read masks) in each batch. This way, we have a grid of xy, which we normalize afterwards so that they are between 0 and 1 (in this case, we scale by 2 * pi as well, but that's a detail)

if self.normalize:
eps = 1e-6
y_embed = y_embed / (y_embed[:, -1:, :] + eps) * self.scale
x_embed = x_embed / (x_embed[:, :, -1:] + eps) * self.scale

Then, in

pos_x = x_embed[:, :, :, None] / dim_t
pos_y = y_embed[:, :, :, None] / dim_t
pos_x = torch.stack((pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()), dim=4).flatten(3)
pos_y = torch.stack((pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()), dim=4).flatten(3)

we apply standard sine embedding in a vectorized fashion for x and y separately, and concatenate them afterwards for x and y, yielding the spatial positional embedding.
The positional embeddings only depend on the feature map shapes and the masks (as there could be padding between different images), and not on the content of the feature maps.

Question 2

How do you calculate masks when using images in transformers?

Those are calculated in

mask = torch.ones((b, h, w), dtype=torch.bool, device=device)

Basically, everything that corresponds to zero padding the image so that they have the same size are filled with True for the mask.
I believe I have answered your questions, and as such I'm closing the issue, but let us know if you have further questions.

Hi @fmassa , I have one doubt, For positional encoding sine what is the input format. is tensor_list.mask kind of 0's and 1's where 1 is bounding box area and 0 is outer the bbox. so using that mask we are finding positional embedding is that right.
I have implemented the position encoding for my project where to extract spatial positional features. currently I just used one hot encoder by dividing the image into a grid, so if the bounding box is overlap the grid make it has one and if not zero and s o on. but I encountered this sine positional encoding so planning to add this positional encoding. and if possible please explain what's the difference between one hot encoding with grid and this positional encoding
Thanks

from detr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.