Git Product home page Git Product logo

Comments (6)

abhinavkulkarni avatar abhinavkulkarni commented on May 24, 2024 4

Thanks to @ledw-2 and others from other issues, I was able to recreate embeddings for existing entities (in entity.json) using their Wikipedia description and title and was able to verify that they match those in all_entities.t7 up to the 6th decimal point.

Given a new entity title and its description, here's how to generate its embeddings:

# Load biencoder model and biencoder params just like in main_dense.py
with open(args.biencoder_config) as json_file:
    biencoder_params = json.load(json_file)
    biencoder_params["path_to_model"] = args.biencoder_model
biencoder = load_biencoder(biencoder_params)

# Read 10 entities from entity.jsonl
entities = []
count = 10
with open('./models/entity.jsonl') as f:
    for i, line in enumerate(f):
        entity = json.loads(line)
        entities.append(entity)
        if i == count-1:
            break

# Get token_ids corresponding to candidate title and description
tokenizer = biencoder.tokenizer
max_context_length, max_cand_length =  biencoder_params["max_context_length"], biencoder_params["max_cand_length"]
max_seq_length = max_cand_length
ids = []

for entity in entities:
    candidate_desc = entity['text']
    candidate_title = entity['title']
    cand_tokens = get_candidate_representation(
        candidate_desc, 
        tokenizer, 
        max_seq_length, 
        candidate_title=candidate_title
    )

    token_ids = cand_tokens["ids"]
    ids.append(token_ids)

ids = torch.tensor(ids)
torch.save(ids, path)

The file in which these ids are saved should be passed in the --saved_cand_ids param of scripts/generate_candidates.py.

Thanks to the FB team for this awesome project!

from blink.

ledw-2 avatar ledw-2 commented on May 24, 2024

@abhinavkulkarni Thanks for the comments! I hope you find this project useful to you.๐Ÿ˜ƒ

from blink.

amelieyu1989 avatar amelieyu1989 commented on May 24, 2024

seems that we have to update and re-generate the whole entity.jsonl file in order to get .t7 file.

from blink.

abhinavkulkarni avatar abhinavkulkarni commented on May 24, 2024

@amelieyu1989: No, if entity.jsonl has N entities - then all_entities.t7 file is a torch ndarry of N rows. So, you can add additional entity to entity.jsonl file and load the torch matrix, add a row and resave it.

from blink.

amelieyu1989 avatar amelieyu1989 commented on May 24, 2024

I see. you mean I could get my new_encode_list = torch.cat((old_encode_list, new_entities_tokens))
could you share code if possible?

from blink.

lentikr avatar lentikr commented on May 24, 2024

@abhinavkulkarni
Thank you for providing the code and assistance! I used the code you provided to generate a file called entity_token_ids_128.t7, which contains entity representations. Next, I should use the generate_candidates.py file to generate embeddings for the entities. Could you please advise me on how to set the parameters? (especially batch_size, --chunk_start and --chunk_end)

I guess it may be as follows.

python generate_candidates.py --path_to_model_config models/biencoder_wiki_large.json --path_to_model models/biencoder_wiki_large.bin --entity_dict_path models/entity1.jsonl --encoding_save_file_dir models --saved_cand_ids models/entity_token_ids_128.t7 --batch_size 512 --chunk_start 0 --chunk_end 1000000

from blink.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.