Comments (6)
@mli0603 see the section 4.2 'Importance of positional encodings' in the paper for the architectural choices on where to pass positional encodings. There is also Table 3, in which the second row corresponds to vanilla Transformer where we pass positional encodings once at transformer input, the one you are referring to (also used in the demo colab). As we explain in text, passing encodings in attention directly leads to a significant performance boost.
from detr.
in the code you pointed the positional encodings are added in the first line of the function, see https://github.com/facebookresearch/detr/blob/master/models/transformer.py#L154 can you elaborate?
from detr.
@szagoruyko The residual connection is what I refer to. If you look at line 157, the residual connection is made upon the original input instead of the position encoded input. I.e., currently in the code it is
src = src + self.dropout1(src2)
while I think both the paper and the original transformer paper describe it as
src = q + self.dropout1(src2)
where q
is the input with position encoding.
Does this clear things up a little bit? Sorry if the previous description was too confusing.
from detr.
@szagoruyko Another issue I see is that the positional encoding is added to the src
for every encoder/decoder layer in the for loop
(https://github.com/facebookresearch/detr/blob/master/models/transformer.py#L76) by with_pos_embedding
(https://github.com/facebookresearch/detr/blob/master/models/transformer.py#L154). Is this necessary? In the paper, the positional encoding is only added once, which makes more sense to me.
from detr.
@szagoruyko Thanks for your comment on pos encoding! I am sorry I completely misunderstood that paragraph in the paper. It now makes sense.
I still wonder if there is any explicit design choices on residual connection between connecting from image featues src
directly and from position encoded features q
(my first question above). Maybe it is just too minor to make a difference? I really appreciate it.
from detr.
That paragraph now makes much more sense. There is also another deviation from original transformer it seems, in that you apply position encoding only on key and query, but not on value. I did not see in the paper if that choice also improves the performance.
Even in the original paper value output from encoder does not get position embedding treatment, so it makes sense to avoid it for all the values. You add the position (to k,q) for all layers, so it makes sense to not add it to value in any of them.
from detr.
Related Issues (20)
- Question about object queries. HOT 4
- I want to train the DETR model on a CPU. How can I make it possible on a small computer, 8gb RAM HOT 3
- Why positional encoding is added to different role in encoder and decoder. HOT 1
- 🐛 Bug: Architecture diagram in README.md renders incorrectly when using dark mode
- continue training with chekckpoint
- How to finetune DETR for semantic segmentation task?
- I do not understand what the mask meaning in "samlpes"
- Process finished with exit code 137 (interrupted by signal 9: SIGKILL)Please read & provide the following
- Very low performance for segmentation task.
- box_cxcywh_to_xyxy
- ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 6 (pid: 257736) of binary: /home/public/anaconda3/envs/DL/bin/python
- Average Precision of each class for best epoch and then it's mean HOT 1
- the mAP is chage
- I think there are some errors in the posted code HOT 6
- Queries for images with low number of objects HOT 2
- RuntimeError: Error(s) in loading state_dict for DETRsegm: HOT 2
- Map metrics anomalies after backbone replacement
- when the trained model is used for inference this import error comes: RuntimeError: Failed to import transformers.models.detr.modeling_detr because of the following error (look up to see its traceback): cannot import name 'experimental_functions_run_eagerly' from 'tensorflow.python.eager.def_function' (C:\Anaconda\lib\site-packages\tensorflow\python\eager\def_function.py)
- Get Image masks coordinates.
- GFLOPs instead of GFLOPS?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from detr.