Can someone please explain me how you calculated the positional e

Unable to understand positional encoding and masks. about detr HOT 3 CLOSED

facebookresearch commented on June 23, 2024

Unable to understand positional encoding and masks.

from detr.

Comments (3)

fmassa commented on June 23, 2024 14

Hi,

Question 1

I want to know what are considered as positional encoding while working with images.

Positional encoding takes a xy coordinate in [0, 1] and convert the xy into a vector of 256 elements. The encoding for x and y are the same, so for the sake of simplicity let's only look at the x part.
In

detr/models/position_encoding.py

Lines 30 to 31 in 0fb754c

 y_embed = not_mask.cumsum(1, dtype=torch.float32) 

 x_embed = not_mask.cumsum(2, dtype=torch.float32)

we create an image tensor which is similar in spirit to meshgrid, but that supports images with different sizes (read masks) in each batch. This way, we have a grid of xy, which we normalize afterwards so that they are between 0 and 1 (in this case, we scale by 2 * pi as well, but that's a detail)

detr/models/position_encoding.py

Lines 32 to 35 in 0fb754c

 if self.normalize: 

 eps = 1e-6 

 y_embed = y_embed / (y_embed[:, -1:, :] + eps) * self.scale 

 x_embed = x_embed / (x_embed[:, :, -1:] + eps) * self.scale

Then, in

detr/models/position_encoding.py

Lines 40 to 43 in 0fb754c

 pos_x = x_embed[:, :, :, None] / dim_t 

 pos_y = y_embed[:, :, :, None] / dim_t 

 pos_x = torch.stack((pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()), dim=4).flatten(3) 

 pos_y = torch.stack((pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()), dim=4).flatten(3)

we apply standard sine embedding in a vectorized fashion for x and y separately, and concatenate them afterwards for x and y, yielding the spatial positional embedding.
The positional embeddings only depend on the feature map shapes and the masks (as there could be padding between different images), and not on the content of the feature maps.

Question 2

How do you calculate masks when using images in transformers?

Those are calculated in

detr/util/misc.py

Line 299 in 0fb754c

mask = torch.ones((b, h, w), dtype=torch.bool, device=device)

Basically, everything that corresponds to zero padding the image so that they have the same size are filled with True for the mask.

I believe I have answered your questions, and as such I'm closing the issue, but let us know if you have further questions.

from detr.

saahiluppal commented on June 23, 2024 3

Perfect explanation!

from detr.

vigneshgig commented on June 23, 2024

Hi,

Question 1

I want to know what are considered as positional encoding while working with images.

Positional encoding takes a xy coordinate in [0, 1] and convert the xy into a vector of 256 elements. The encoding for x and y are the same, so for the sake of simplicity let's only look at the x part.
In

detr/models/position_encoding.py

Lines 30 to 31 in 0fb754c

y_embed = not_mask.cumsum(1, dtype=torch.float32)

x_embed = not_mask.cumsum(2, dtype=torch.float32)

we create an image tensor which is similar in spirit to meshgrid, but that supports images with different sizes (read masks) in each batch. This way, we have a grid of xy, which we normalize afterwards so that they are between 0 and 1 (in this case, we scale by 2 * pi as well, but that's a detail)

detr/models/position_encoding.py

Lines 32 to 35 in 0fb754c

if self.normalize:

eps = 1e-6

y_embed = y_embed / (y_embed[:, -1:, :] + eps) * self.scale

x_embed = x_embed / (x_embed[:, :, -1:] + eps) * self.scale

Then, in

detr/models/position_encoding.py

Lines 40 to 43 in 0fb754c

pos_x = x_embed[:, :, :, None] / dim_t

pos_y = y_embed[:, :, :, None] / dim_t

pos_x = torch.stack((pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()), dim=4).flatten(3)

pos_y = torch.stack((pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()), dim=4).flatten(3)

we apply standard sine embedding in a vectorized fashion for x and y separately, and concatenate them afterwards for x and y, yielding the spatial positional embedding.
The positional embeddings only depend on the feature map shapes and the masks (as there could be padding between different images), and not on the content of the feature maps.

Question 2

How do you calculate masks when using images in transformers?

Those are calculated in

detr/util/misc.py

Line 299 in 0fb754c

mask = torch.ones((b, h, w), dtype=torch.bool, device=device)

Basically, everything that corresponds to zero padding the image so that they have the same size are filled with True for the mask.
I believe I have answered your questions, and as such I'm closing the issue, but let us know if you have further questions.

Hi @fmassa , I have one doubt, For positional encoding sine what is the input format. is tensor_list.mask kind of 0's and 1's where 1 is bounding box area and 0 is outer the bbox. so using that mask we are finding positional embedding is that right.
I have implemented the position encoding for my project where to extract spatial positional features. currently I just used one hot encoder by dividing the image into a grid, so if the bounding box is overlap the grid make it has one and if not zero and s o on. but I encountered this sine positional encoding so planning to add this positional encoding. and if possible please explain what's the difference between one hot encoding with grid and this positional encoding
Thanks

from detr.

Unable to understand positional encoding and masks. about detr HOT 3 CLOSED

Comments (3)

Question 1

Question 2

Question 1

Question 2

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	y_embed = not_mask.cumsum(1, dtype=torch.float32)
	x_embed = not_mask.cumsum(2, dtype=torch.float32)

	if self.normalize:
	eps = 1e-6
	y_embed = y_embed / (y_embed[:, -1:, :] + eps) * self.scale
	x_embed = x_embed / (x_embed[:, :, -1:] + eps) * self.scale

	pos_x = x_embed[:, :, :, None] / dim_t
	pos_y = y_embed[:, :, :, None] / dim_t
	pos_x = torch.stack((pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()), dim=4).flatten(3)
	pos_y = torch.stack((pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()), dim=4).flatten(3)