The tombert from jefferyyu

TomBERT(all-text)

I have read your code but did not find the part of TomBERT(all-text). So this is not in the code?

about representations of intra-modality & inter-modality

Hi author,
I'm confused about the representations of "intra-modality dynamics including target-text and target-image alignments and inter-modality dynamics, i.e., text-image alignments" mentioned in the paper. In my opinion, intra-modality dynamics should refer to what happened between the same modality, such as text to text rather than target (text) -image, similarly inter-modality dynamics should refer to what happened between the different modality, such as text-image. So, is there something that I ignored?
Thank you for your trouble!

Question on Multimodal attention mask

Hi authors, thank you for the well-written paper and detailed documentation of your work. I have a question regarding the multimodal attention mask.

Under TomBERT/my_bert/mm_modelling.py:

class MBertForMMSequenceClassification(PreTrainedBertModel):
    """BERT model for classification with text and image inputs, pooling-1+text (MBERT I)
    """

def __init__(self, config, num_labels=2, pooling="cls"):
    super(MBertForMMSequenceClassification, self).__init__(config)
    self.num_labels = num_labels
    self.pooling = pooling
    self.bert = BertModel(config)
    self.dropout = nn.Dropout(config.hidden_dropout_prob)
    self.vismap2text = nn.Linear(2048, config.hidden_size)
    #self.img_attention = BertLayer(config)
    self.comb_attention = MultimodalEncoder(config)
    if pooling == "cls":
        self.text_pooler = BertText1Pooler(config)
        self.classifier = nn.Linear(config.hidden_size, num_labels)
    elif pooling == "first":
        self.img_pooler = BertPooler(config)
        self.classifier = nn.Linear(config.hidden_size, num_labels)
    else:
        self.text_pooler = BertText1Pooler(config)
        self.img_pooler = BertPooler(config)
        self.classifier = nn.Linear(config.hidden_size * 2, num_labels)

    self.apply(self.init_bert_weights)

def forward(self, input_ids, s2_input_ids, visual_embeds_att, token_type_ids=None, s2_type_ids=None,
            attention_mask=None, s2_mask=None, added_attention_mask=None, labels=None, copy_flag=False):
    # Concatenate Bert-based Text, Text-Aware Image and Image-Aware Text
    sequence_output, pooled_output = self.bert(input_ids, token_type_ids, attention_mask,
                                               output_all_encoded_layers=False)

    # apply entity-based attention mechanism to obtain different image representations
    vis_embed_map = visual_embeds_att.view(-1, 2048, 49).permute(0, 2, 1)  # self.batch_size, 49, 2048
    vis_pooled_output, _ = vis_embed_map.max(1)  # self.batch_size, 2048
    converted_vis_embed_map = self.vismap2text(vis_pooled_output)  # self.batch_size, hidden_dim

    transpose_img_embed = converted_vis_embed_map.unsqueeze(1)
    text_img_output = torch.cat((transpose_img_embed, sequence_output), dim=1)

    comb_attention_mask = added_attention_mask[:, 48:]  # only the first dimension is for image
    extended_attention_mask = comb_attention_mask.unsqueeze(1).unsqueeze(2)
    extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype)  # fp16 compatibility
    extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0

`
I am trying to understand the extended_attention_mask here.

So far, I understand that added_attention_mask refers to [1]*(len(input_ids) + 49) followed by zero-pad to get length 113. (64 values for input_id mask and 49 values for image patches)
And for each input, this would vary since the input_ids will have varying lengths. So why does comb_attention_mask specifically start from index 48?
And why does extended_attention_mask need to multiply -10000.0

Hope to hear from you, thanks so much!

jefferyyu / tombert Goto Github PK

tombert's People

Contributors

Stargazers

Watchers

Forkers

tombert's Issues

TomBERT(all-text)

about representations of intra-modality & inter-modality

Question on Multimodal attention mask

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent