Comments (3)
Leaving a comment since it might be helpful for someone in the future that wants to create their own trie from the current wikipedia version.
I used wikimapper to get all wikipedia titles since it is easier than paring the xml dump. Then I used a sql query to get all titles and redirects. As far as I can tell you only need to filter all disambiguation pages (containing "(disambiguation)") and subsections (containing "/").
select wikipedia_title from
where wikidata_id is not NULL AND
wikipedia_title is not NULL AND
wikipedia_title != "" AND
wikipedia_title NOT LIKE '%(disambiguation)%' AND
wikipedia_title NOT LIKE '%/%'""")
The code by HuiBinR works fine for creating the Trie.
from genre.
I used no filtering for creating the KILT trie but I used the KILT knowledge source http://dl.fbaipublicfiles.com/KILT/kilt_knowledgesource.json not the Wikipedia dump directly. Probably your piece of code is extracting many other page titles (maybe from special or deprecated pages) that should not be there (indeed 14M titles is too much as you can see from my screenshot of https://www.wikipedia.org as of today there are 6.2M pages so less than half of what you are extracting).
from genre.
Hi @bablf! It's very kind of you to share the SQL for filtering KILT titles. I am wondering if you are doing the same for entity linking usage.
from genre.
Related Issues (20)
- is prefix_allowed_tokens_fn only working for seq2seq model.generate? HOT 2
- Loading mgenre models is taking 44GB RAM
- Problem in candidate-based generation on GENRE using transformers >= 4.36.0
- the same entity name question
- Inference speed is too slow. Is this problem because of Constrained beam search?
- can not receive different outputs from mGENRE.sample using dropout in train mode and different seeds HOT 2
- can't find ID to title map json file HOT 1
- alignment between candidate and KILT wikipedia data source HOT 4
- Question: Running genre on multiple GPUs HOT 1
- format of entries for entity linking training HOT 2
- Invalid prediction - no wikipedia entity HOT 10
- Fail to Reproduce the dev score of GENRE Document Retrieval HOT 7
- mGENRE finetuning issue
- Why do you prepend `eos_token_id' to sent_orig HOT 2
- colab script to run GENRE
- NameError: name 'batched_hypos' is not defined (mGENRE) HOT 5
- [Question] Evaluating mGENRE on Mewsli-9
- Fine-tune with hugging face trainer
- import package error
- Chinese entity linking
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from genre.