Comments (6)
Thanks to @ledw-2 and others from other issues, I was able to recreate embeddings for existing entities (in entity.json
) using their Wikipedia description and title and was able to verify that they match those in all_entities.t7
up to the 6th decimal point.
Given a new entity title and its description, here's how to generate its embeddings:
# Load biencoder model and biencoder params just like in main_dense.py
with open(args.biencoder_config) as json_file:
biencoder_params = json.load(json_file)
biencoder_params["path_to_model"] = args.biencoder_model
biencoder = load_biencoder(biencoder_params)
# Read 10 entities from entity.jsonl
entities = []
count = 10
with open('./models/entity.jsonl') as f:
for i, line in enumerate(f):
entity = json.loads(line)
entities.append(entity)
if i == count-1:
break
# Get token_ids corresponding to candidate title and description
tokenizer = biencoder.tokenizer
max_context_length, max_cand_length = biencoder_params["max_context_length"], biencoder_params["max_cand_length"]
max_seq_length = max_cand_length
ids = []
for entity in entities:
candidate_desc = entity['text']
candidate_title = entity['title']
cand_tokens = get_candidate_representation(
candidate_desc,
tokenizer,
max_seq_length,
candidate_title=candidate_title
)
token_ids = cand_tokens["ids"]
ids.append(token_ids)
ids = torch.tensor(ids)
torch.save(ids, path)
The file in which these ids are saved should be passed in the --saved_cand_ids
param of scripts/generate_candidates.py.
Thanks to the FB team for this awesome project!
from blink.
@abhinavkulkarni Thanks for the comments! I hope you find this project useful to you.๐
from blink.
seems that we have to update and re-generate the whole entity.jsonl file in order to get .t7 file.
from blink.
@amelieyu1989: No, if entity.jsonl
has N
entities - then all_entities.t7
file is a torch ndarry of N
rows. So, you can add additional entity to entity.jsonl
file and load the torch matrix, add a row and resave it.
from blink.
I see. you mean I could get my new_encode_list = torch.cat((old_encode_list, new_entities_tokens))
could you share code if possible?
from blink.
@abhinavkulkarni
Thank you for providing the code and assistance! I used the code you provided to generate a file called entity_token_ids_128.t7, which contains entity representations. Next, I should use the generate_candidates.py file to generate embeddings for the entities. Could you please advise me on how to set the parameters? (especially batch_size
, --chunk_start
and --chunk_end
)
I guess it may be as follows.
python generate_candidates.py --path_to_model_config models/biencoder_wiki_large.json --path_to_model models/biencoder_wiki_large.bin --entity_dict_path models/entity1.jsonl --encoding_save_file_dir models --saved_cand_ids models/entity_token_ids_128.t7 --batch_size 512 --chunk_start 0 --chunk_end 1000000
from blink.
Related Issues (20)
- Use a smaller model to speed up the prediction time HOT 3
- Slightly different scores when using a quantized model
- Poor recall using non-dense FAISS indexes HOT 2
- How to only generate Precision, Recall, and f1 score when benchmarking BLINK HOT 3
- Biencoder with GPU RuntimeError: Expected object of device type cuda but got device type cpu for argument #3 'index' in call to _th_index_select HOT 3
- python: symbol lookup error:
- KeyError HOT 7
- A short tutorial on how to train a smaller biencoder model on custom dataset HOT 1
- Entity linking in Wikidata? HOT 3
- Missing `add_special_tokens` in biencoder? HOT 1
- Average length of words in a Wikipedia Entity HOT 1
- AttributeError: 'KeyedVectors' object has no attribute 'key_to_index' HOT 1
- Add truncation for data_process.get_context_representation
- Tutorial on how to train a Crossencoder HOT 1
- Python 3.7 no longer supported by conda HOT 2
- How to get entity type?
- ValueError in faiss_indexer.py Due to Mismatched Tensor Shapes During ELQ Training
- ELQ Wikipedia-trained biencoder checkpoints
- Can Support Chinese๏ผ
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from blink.