The transformers-embedder's discuss from riccorl

Faster indices computation for build_offsets

Feature proposal

When building the word_ids mask for text pairs, we can look for the last index of the BPE tokens in the first sentence and update the second part accordingly. As for now, the implementation is a bit slow due to the use of for loops. It can be performed more efficiently if we vectorize the function so that it looks for the offsets batch-wise (e.g. I used NumPy for the purpose, but I'm confident it can be implemented with PyTorch too).

Code snippet

from transformers import AutoTokenizer
import numpy as np

model_name = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

sents = [
    ("Today I'm using BERT embeddings.", "What's the wheather like in London?"),
    ("My Name is Luke.", "What's your name?")
]

# Tokenizer the text pairs
inputs = tokenizer(sents, return_special_tokens_mask=True)

# Select the offset indices 
idxs = np.argwhere(np.diff(np.concatenate(inputs.special_tokens_mask)) == 1)[::2].squeeze()

# Obtain the batch word_ids
word_ids = np.concatenate([inputs.word_ids(i) for i in range(len(inputs.input_ids))])

offsets = word_ids[idxs].astype(int)
print(offsets)

Detailed explanation

# [SEP] and [CLS] are encoded as `1s` with the special_token_mask
>>> inputs.special_tokens_mask
[
   [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
   [1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1]
]

# Look for differences of contiguous elements as to find the offset:
>>> np.diff(np.concatenate(inputs.special_tokens_mask))
array([-1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1, -1,  0,  0,    <- "1s" at indexes: 13, 24, 31, 38
        0,  0,  0,  0,  0,  0,  0,  1,  0, -1,  0,  0,  0,  0,  1, -1,  0,
        0,  0,  0,  0,  1])

# Alternate sequence of [SEP], [CLS], [SEP], [CLS], ... select [SEP]s only (i.e. the even indices)

>>> idxs = np.argwhere(np.diff(np.concatenate(inputs.special_tokens_mask)) == 1)[::2].squeeze()
array([13, 31])   <- offset indices (i.e. lengths of the first text in a text pair when concatenated)
                       i.e. BPE lengths are: 13 - 0 = 13 for the first sentence and 31 - 26 for the second

# Get the word_ids, unfortunately the transformers library doesn't provide an attribute as for the `special_tokens_mask`
>>> word_ids = np.concatenate([inputs[i].word_ids for i in range(len(inputs.input_ids))])
>>> word_ids
array([None, 0, 1, 2, 3, 4, 5, 5, 5, 6, 6, 6, 6, 7, None, 0, 1, 2, 3, 4,      <- offsets are: 7 and 4
      4, 5, 6, 7, 8, None, None, 0, 1, 2, 3, 4, None, 0, 1, 2, 3, 4, 5,
      None], dtype=object)

# Select the sentence offsets:
>>> word_ids[idxs].astype(int)
array([7, 4])

riccorl / transformers-embedder Goto Github PK

transformers-embedder's Issues

Faster indices computation for build_offsets

Feature proposal

Code snippet

Detailed explanation

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent