mandarjoshi90 / triviaqa Goto Github PK
View Code? Open in Web Editor NEWCode for the TriviaQA reading comprehension dataset
Home Page: http://nlp.cs.washington.edu/triviaqa/
License: Apache License 2.0
Code for the TriviaQA reading comprehension dataset
Home Page: http://nlp.cs.washington.edu/triviaqa/
License: Apache License 2.0
I found out that files in triviaqa/utils
have inputs with errors:
ModuleNotFoundError: No module named 'utils.utils'; 'utils' is not a package
At dataset_utils.py
should be:
import utils
Instead of:
import utils.utils
import utils
Also at convert_to_squad_format.py
are lots of such errors:
def get_text(qad, domain):
...
return utils.utils.get_file_contents(local_file, encoding='utf-8')
The correct way would be to call get_file_contents as:
def get_text(qad, domain):
...
return utils.get_file_contents(local_file, encoding='utf-8')
I found out that after correcting of dataset_utils.py
as I showed and just moving convert_to_squad_format.py
file from /triviaqa/utils
to the main /triviaqa
the error message disappears.
Does the Apache 2.0 license also apply to the data, or just the code? If not, what is the dataset's license? Thanks
Hi,
I was trying the command given in #3 given by mandarjoshi90. and i got the error as below:
python -m utils.convert_to_squad_format --triviaqa_file triv
iaqa/qa/wikipedia-dev.json --squad_file squad --wikipedia_dir triviaqa/evidence/wikipedia/ --web_dir triviaqa/evidence/web/
2%|█▎ | 230/14229 [00:01<01:29, 156.84it/s]
Traceback (most recent call last):
File "C:\Users\Vamshi\Anaconda3\envs\PythonGPU\lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "C:\Users\Vamshi\Anaconda3\envs\PythonGPU\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\Users\Vamshi_Workspace\triviaqa-master\triviaqa-master\utils\convert_to_squad_format.py", line 110, in
convert_to_squad_format(args.triviaqa_file, args.squad_file)
File "C:\Users\Vamshi_Workspace_\triviaqa-master\triviaqa-master\utils\convert_to_squad_format.py", line 67, in convert_t
o_squad_format
text = get_text(qad, qad['Source'])
File "C:\Users\Vamshi_Workspace_\triviaqa-master\triviaqa-master\utils\convert_to_squad_format.py", line 12, in get_text
return utils.utils.get_file_contents(local_file, encoding='utf-8')
File "C:\Users\Vamshi_Workspace_\triviaqa-master\triviaqa-master\utils\utils.py", line 10, in get_file_contents
with open(filename, encoding=encoding) as f:
OSError: [Errno 22] Invalid argument: "triviaqa/evidence/wikipedia/Who's_on_First?.txt"
Can someone help me in resolving it ?
I am trying to finetune a RoBERTa (or BERT) model on TriviaQA. I am using question-answering example from Huggingface transformers. Before the training, I have done TriviaQA 2 Squad using convert_to_squad_format.py
When I ran the finetune, there is an eeror with loading the data as follow:
File "run_squad.py", line 826, in
main()
File "run_squad.py", line 768, in main
train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=False)
File "run_squad.py", line 447, in load_and_cache_examples
examples = processor.get_train_examples(args.data_dir, filename=args.train_file)
File "/usr/local/lib/python3.6/dist-packages/transformers/data/processors/squad.py", line 578, in get_train_examples
return self._create_examples(input_data, "train")
File "/usr/local/lib/python3.6/dist-packages/transformers/data/processors/squad.py", line 605, in _create_examples
title = entry["title"]
KeyError: 'title'
That seems like the conversion is ignoring part of the data needed in Squad format.
I wonder if you can comment on this that what may be wrong or how to handle this issue.
Here is the run-time parameters for run_squad:
python run_squad.py
--model_type roberta
--model_name_or_path roberta-base
--do_train
--do_eval
--do_lower_case
--train_file $SQUAD_DIR/squad-wikipedia-train-4096.json
--predict_file $SQUAD_DIR/squad-wikipedia-dev-4096.json
--learning_rate 3e-5
--num_train_epochs 2.0
--max_seq_length 4096
--doc_stride 128
--output_dir ro_tri_st_debug_squad/
--fp16
--per_gpu_eval_batch_size 1
--per_gpu_train_batch_size 1
--gradient_accumulation_steps 8 \
I tried to extract dataset ("rc" version) which I downloaded from http://nlp.cs.washington.edu/triviaqa/ , for some reason the downloaded file is corrupted and it cannot be extracted.
Below is the log:
$ wget http://nlp.cs.washington.edu/triviaqa/data/triviaqa-rc.tar.gz
--2022-05-26 23:42:05-- http://nlp.cs.washington.edu/triviaqa/data/triviaqa-rc.tar.gz
Resolving nlp.cs.washington.edu (nlp.cs.washington.edu)... 128.208.3.120, 2607:4000:200:12::78
Connecting to nlp.cs.washington.edu (nlp.cs.washington.edu)|128.208.3.120|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2665779500 (2.5G) [application/x-gzip]
Saving to: ‘triviaqa-rc.tar.gz’
12% [==============> ] 341,223,498 117KB/s in 5m 10s
2022-05-26 23:47:22 (1.05 MB/s) - Connection closed at byte 341223498. Retrying.
--2022-05-26 23:47:23-- (try: 2) http://nlp.cs.washington.edu/triviaqa/data/triviaqa-rc.tar.gz
Connecting to nlp.cs.washington.edu (nlp.cs.washington.edu)|128.208.3.120|:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 2665779500 (2.5G), 2324556002 (2.2G) remaining [application/x-gzip]
Saving to: ‘triviaqa-rc.tar.gz’
100%[+++++++++++++++=======================================================================================================>] 2,665,779,500 12.2MB/s in 3m 46s
2022-05-26 23:51:11 (9.82 MB/s) - ‘triviaqa-rc.tar.gz’ saved [2665779500/2665779500]
$ tar xf triviaqa-rc.tar.gz
tar: Skipping to next header
tar: Exiting with failure status due to previous errors
I even downloaded from web directly but still not able to extract the downloaded dataset.
is dataset hosted at http://nlp.cs.washington.edu/triviaqa/ valid now?
Hi,
I was trying to download the unfiltered-questions on the website.
I ran the command
wget http://nlp.cs.washington.edu/triviaqa/data/triviaqa-rc.tar.gz
Now when I tried to extract it using the command
tar xvzf triviaqa-unfiltered.tar.gz
It gave me the following error:
gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
Even extraction with gunzip
failed similarly. Could you mention what is the compression type or the correct file type of this file?
HI! i want to convert dataset to squad formation ,but I don't clear which file need to use, i try as bellow, but meet some error.
root@843d9505135f:/home/trivia/utils# python convert_to_squad_format.py --triviaqa_file dataset/wikipedia-dev.json --squad_file squad --wikipedia_dir dataset/wikipdia-dev.json --web_dir dataset/web-dev.json
0%| | 0/14229 [00:00<?, ?it/s]
Traceback (most recent call last):
File "convert_to_squad_format.py", line 110, in
convert_to_squad_format(args.triviaqa_file, args.squad_file)
File "convert_to_squad_format.py", line 67, in convert_to_squad_format
text = get_text(qad, qad['Source'])
File "convert_to_squad_format.py", line 12, in get_text
return utils.get_file_contents(local_file, encoding='utf-8')
File "/home/trivia/utils/utils.py", line 10, in get_file_contents
with open(filename, encoding=encoding) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'dataset/wikipdia-dev.json/White.txt'
icould you tell me more detail how to convert?
Can you please add an open source license (preferably Apache 2.0)? This is mainly important for the evaluation script.
The argument help has an error, stating that this is the same as Random seed...
Also, I am wondering what this sample_size is used for?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.