Hi, I am been going through the code, documentation and issues to fi

Thanks to <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

How to generate embeddings for new candidates? about blink HOT 6 OPEN

abhinavkulkarni commented on May 24, 2024

How to generate embeddings for new candidates?

from blink.

Comments (6)

abhinavkulkarni commented on May 24, 2024 4

Thanks to @ledw-2 and others from other issues, I was able to recreate embeddings for existing entities (in entity.json) using their Wikipedia description and title and was able to verify that they match those in all_entities.t7 up to the 6th decimal point.

Given a new entity title and its description, here's how to generate its embeddings:

# Load biencoder model and biencoder params just like in main_dense.py
with open(args.biencoder_config) as json_file:
    biencoder_params = json.load(json_file)
    biencoder_params["path_to_model"] = args.biencoder_model
biencoder = load_biencoder(biencoder_params)

# Read 10 entities from entity.jsonl
entities = []
count = 10
with open('./models/entity.jsonl') as f:
    for i, line in enumerate(f):
        entity = json.loads(line)
        entities.append(entity)
        if i == count-1:
            break

# Get token_ids corresponding to candidate title and description
tokenizer = biencoder.tokenizer
max_context_length, max_cand_length =  biencoder_params["max_context_length"], biencoder_params["max_cand_length"]
max_seq_length = max_cand_length
ids = []

for entity in entities:
    candidate_desc = entity['text']
    candidate_title = entity['title']
    cand_tokens = get_candidate_representation(
        candidate_desc, 
        tokenizer, 
        max_seq_length, 
        candidate_title=candidate_title
    )

    token_ids = cand_tokens["ids"]
    ids.append(token_ids)

ids = torch.tensor(ids)
torch.save(ids, path)

The file in which these ids are saved should be passed in the --saved_cand_ids param of scripts/generate_candidates.py.

Thanks to the FB team for this awesome project!

from blink.

ledw-2 commented on May 24, 2024

@abhinavkulkarni Thanks for the comments! I hope you find this project useful to you.😃

from blink.

amelieyu1989 commented on May 24, 2024

seems that we have to update and re-generate the whole entity.jsonl file in order to get .t7 file.

from blink.

abhinavkulkarni commented on May 24, 2024

@amelieyu1989: No, if entity.jsonl has N entities - then all_entities.t7 file is a torch ndarry of N rows. So, you can add additional entity to entity.jsonl file and load the torch matrix, add a row and resave it.

from blink.

amelieyu1989 commented on May 24, 2024

I see. you mean I could get my new_encode_list = torch.cat((old_encode_list, new_entities_tokens))
could you share code if possible?

from blink.

lentikr commented on May 24, 2024

@abhinavkulkarni
Thank you for providing the code and assistance! I used the code you provided to generate a file called entity_token_ids_128.t7, which contains entity representations. Next, I should use the generate_candidates.py file to generate embeddings for the entities. Could you please advise me on how to set the parameters? (especially batch_size, --chunk_start and --chunk_end)

I guess it may be as follows.

python generate_candidates.py --path_to_model_config models/biencoder_wiki_large.json --path_to_model models/biencoder_wiki_large.bin --entity_dict_path models/entity1.jsonl --encoding_save_file_dir models --saved_cand_ids models/entity_token_ids_128.t7 --batch_size 512 --chunk_start 0 --chunk_end 1000000

from blink.

How to generate embeddings for new candidates? about blink HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent