Git Product home page Git Product logo

tombert's People

Contributors

jefferyyu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

tombert's Issues

TomBERT(all-text)

I have read your code but did not find the part of TomBERT(all-text). So this is not in the code?

about representations of intra-modality & inter-modality

Hi author,
I'm confused about the representations of "intra-modality dynamics including target-text and target-image alignments and inter-modality dynamics, i.e., text-image alignments" mentioned in the paper. In my opinion, intra-modality dynamics should refer to what happened between the same modality, such as text to text rather than target (text) -image, similarly inter-modality dynamics should refer to what happened between the different modality, such as text-image. So, is there something that I ignored?
Thank you for your trouble!

Question on Multimodal attention mask

Hi authors, thank you for the well-written paper and detailed documentation of your work. I have a question regarding the multimodal attention mask.

Under TomBERT/my_bert/mm_modelling.py:

class MBertForMMSequenceClassification(PreTrainedBertModel):
    """BERT model for classification with text and image inputs, pooling-1+text (MBERT I)
    """

def __init__(self, config, num_labels=2, pooling="cls"):
    super(MBertForMMSequenceClassification, self).__init__(config)
    self.num_labels = num_labels
    self.pooling = pooling
    self.bert = BertModel(config)
    self.dropout = nn.Dropout(config.hidden_dropout_prob)
    self.vismap2text = nn.Linear(2048, config.hidden_size)
    #self.img_attention = BertLayer(config)
    self.comb_attention = MultimodalEncoder(config)
    if pooling == "cls":
        self.text_pooler = BertText1Pooler(config)
        self.classifier = nn.Linear(config.hidden_size, num_labels)
    elif pooling == "first":
        self.img_pooler = BertPooler(config)
        self.classifier = nn.Linear(config.hidden_size, num_labels)
    else:
        self.text_pooler = BertText1Pooler(config)
        self.img_pooler = BertPooler(config)
        self.classifier = nn.Linear(config.hidden_size * 2, num_labels)

    self.apply(self.init_bert_weights)

def forward(self, input_ids, s2_input_ids, visual_embeds_att, token_type_ids=None, s2_type_ids=None,
            attention_mask=None, s2_mask=None, added_attention_mask=None, labels=None, copy_flag=False):
    # Concatenate Bert-based Text, Text-Aware Image and Image-Aware Text
    sequence_output, pooled_output = self.bert(input_ids, token_type_ids, attention_mask,
                                               output_all_encoded_layers=False)

    # apply entity-based attention mechanism to obtain different image representations
    vis_embed_map = visual_embeds_att.view(-1, 2048, 49).permute(0, 2, 1)  # self.batch_size, 49, 2048
    vis_pooled_output, _ = vis_embed_map.max(1)  # self.batch_size, 2048
    converted_vis_embed_map = self.vismap2text(vis_pooled_output)  # self.batch_size, hidden_dim

    transpose_img_embed = converted_vis_embed_map.unsqueeze(1)
    text_img_output = torch.cat((transpose_img_embed, sequence_output), dim=1)

    comb_attention_mask = added_attention_mask[:, 48:]  # only the first dimension is for image
    extended_attention_mask = comb_attention_mask.unsqueeze(1).unsqueeze(2)
    extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype)  # fp16 compatibility
    extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0

`
I am trying to understand the extended_attention_mask here.

  • So far, I understand that added_attention_mask refers to [1]*(len(input_ids) + 49) followed by zero-pad to get length 113. (64 values for input_id mask and 49 values for image patches)
  • And for each input, this would vary since the input_ids will have varying lengths. So why does comb_attention_mask specifically start from index 48?
  • And why does extended_attention_mask need to multiply -10000.0

Hope to hear from you, thanks so much!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.