mandarjoshi90 / triviaqa Goto Github PK

View Code? Open in Web Editor NEW

268.0 11.0 44.0 21 KB

Code for the TriviaQA reading comprehension dataset

Home Page: http://nlp.cs.washington.edu/triviaqa/

License: Apache License 2.0

Python 100.00%

nlp question-answering reading-comprehension machine-reading acl2017 triviaqa

triviaqa's People

Contributors

Stargazers

Watchers

triviaqa's Issues

Wrong importing at triviaqa/utils

I found out that files in triviaqa/utils have inputs with errors:
ModuleNotFoundError: No module named 'utils.utils'; 'utils' is not a package

At dataset_utils.py should be:

import utils

Instead of:

import utils.utils
import utils

Also at convert_to_squad_format.py are lots of such errors:

def get_text(qad, domain):
    ...
    return utils.utils.get_file_contents(local_file, encoding='utf-8')

The correct way would be to call get_file_contents as:

def get_text(qad, domain):
    ...
    return utils.get_file_contents(local_file, encoding='utf-8')

I found out that after correcting of dataset_utils.py as I showed and just moving convert_to_squad_format.py file from /triviaqa/utils to the main /triviaqa the error message disappears.

Dataset license

Does the Apache 2.0 license also apply to the data, or just the code? If not, what is the dataset's license? Thanks

#4 (comment)

Errror when converting to SQuAD format

Hi,

I was trying the command given in #3 given by mandarjoshi90. and i got the error as below:

python -m utils.convert_to_squad_format --triviaqa_file triv
iaqa/qa/wikipedia-dev.json --squad_file squad --wikipedia_dir triviaqa/evidence/wikipedia/ --web_dir triviaqa/evidence/web/
2%|█▎ | 230/14229 [00:01<01:29, 156.84it/s]
Traceback (most recent call last):
File "C:\Users\Vamshi\Anaconda3\envs\PythonGPU\lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "C:\Users\Vamshi\Anaconda3\envs\PythonGPU\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\Users\Vamshi_Workspace\triviaqa-master\triviaqa-master\utils\convert_to_squad_format.py", line 110, in

convert_to_squad_format(args.triviaqa_file, args.squad_file)

File "C:\Users\Vamshi_Workspace_\triviaqa-master\triviaqa-master\utils\convert_to_squad_format.py", line 67, in convert_t
o_squad_format
text = get_text(qad, qad['Source'])
File "C:\Users\Vamshi_Workspace_\triviaqa-master\triviaqa-master\utils\convert_to_squad_format.py", line 12, in get_text
return utils.utils.get_file_contents(local_file, encoding='utf-8')
File "C:\Users\Vamshi_Workspace_\triviaqa-master\triviaqa-master\utils\utils.py", line 10, in get_file_contents
with open(filename, encoding=encoding) as f:
OSError: [Errno 22] Invalid argument: "triviaqa/evidence/wikipedia/Who's_on_First?.txt"

Can someone help me in resolving it ?

Finetuning TriviaQA using RoBERTA/BERT

I am trying to finetune a RoBERTa (or BERT) model on TriviaQA. I am using question-answering example from Huggingface transformers. Before the training, I have done TriviaQA 2 Squad using convert_to_squad_format.py

When I ran the finetune, there is an eeror with loading the data as follow:

File "run_squad.py", line 826, in
main()
File "run_squad.py", line 768, in main
train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=False)
File "run_squad.py", line 447, in load_and_cache_examples
examples = processor.get_train_examples(args.data_dir, filename=args.train_file)
File "/usr/local/lib/python3.6/dist-packages/transformers/data/processors/squad.py", line 578, in get_train_examples
return self._create_examples(input_data, "train")
File "/usr/local/lib/python3.6/dist-packages/transformers/data/processors/squad.py", line 605, in _create_examples
title = entry["title"]
KeyError: 'title'

That seems like the conversion is ignoring part of the data needed in Squad format.

I wonder if you can comment on this that what may be wrong or how to handle this issue.

Here is the run-time parameters for run_squad:
python run_squad.py
--model_type roberta
--model_name_or_path roberta-base
--do_train
--do_eval
--do_lower_case
--train_file $SQUAD_DIR/squad-wikipedia-train-4096.json
--predict_file $SQUAD_DIR/squad-wikipedia-dev-4096.json
--learning_rate 3e-5
--num_train_epochs 2.0
--max_seq_length 4096
--doc_stride 128
--output_dir ro_tri_st_debug_squad/
--fp16
--per_gpu_eval_batch_size 1
--per_gpu_train_batch_size 1
--gradient_accumulation_steps 8 \

Unable to extract downloaded dataset

I tried to extract dataset ("rc" version) which I downloaded from http://nlp.cs.washington.edu/triviaqa/ , for some reason the downloaded file is corrupted and it cannot be extracted.

Below is the log:

$ wget http://nlp.cs.washington.edu/triviaqa/data/triviaqa-rc.tar.gz
--2022-05-26 23:42:05--  http://nlp.cs.washington.edu/triviaqa/data/triviaqa-rc.tar.gz
Resolving nlp.cs.washington.edu (nlp.cs.washington.edu)... 128.208.3.120, 2607:4000:200:12::78
Connecting to nlp.cs.washington.edu (nlp.cs.washington.edu)|128.208.3.120|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2665779500 (2.5G) [application/x-gzip]
Saving to: ‘triviaqa-rc.tar.gz’
12% [==============>                                                                                                          ] 341,223,498  117KB/s   in 5m 10s 
2022-05-26 23:47:22 (1.05 MB/s) - Connection closed at byte 341223498. Retrying.
--2022-05-26 23:47:23--  (try: 2)  http://nlp.cs.washington.edu/triviaqa/data/triviaqa-rc.tar.gz
Connecting to nlp.cs.washington.edu (nlp.cs.washington.edu)|128.208.3.120|:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 2665779500 (2.5G), 2324556002 (2.2G) remaining [application/x-gzip]
Saving to: ‘triviaqa-rc.tar.gz’
100%[+++++++++++++++=======================================================================================================>] 2,665,779,500 12.2MB/s   in 3m 46s 
2022-05-26 23:51:11 (9.82 MB/s) - ‘triviaqa-rc.tar.gz’ saved [2665779500/2665779500]

$ tar xf triviaqa-rc.tar.gz
tar: Skipping to next header
tar: Exiting with failure status due to previous errors

I even downloaded from web directly but still not able to extract the downloaded dataset.
is dataset hosted at http://nlp.cs.washington.edu/triviaqa/ valid now?

Unable to extract the data from the website

Hi,
I was trying to download the unfiltered-questions on the website.
I ran the command
wget http://nlp.cs.washington.edu/triviaqa/data/triviaqa-rc.tar.gz
Now when I tried to extract it using the command
tar xvzf triviaqa-unfiltered.tar.gz
It gave me the following error:

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now

Even extraction with gunzip failed similarly. Could you mention what is the compression type or the correct file type of this file?

how to convert TriviaQA to squad format?

HI! i want to convert dataset to squad formation ,but I don't clear which file need to use, i try as bellow, but meet some error.

root@843d9505135f:/home/trivia/utils# python convert_to_squad_format.py --triviaqa_file dataset/wikipedia-dev.json --squad_file squad --wikipedia_dir dataset/wikipdia-dev.json --web_dir dataset/web-dev.json
0%| | 0/14229 [00:00<?, ?it/s]
Traceback (most recent call last):
File "convert_to_squad_format.py", line 110, in
convert_to_squad_format(args.triviaqa_file, args.squad_file)
File "convert_to_squad_format.py", line 67, in convert_to_squad_format
text = get_text(qad, qad['Source'])
File "convert_to_squad_format.py", line 12, in get_text
return utils.get_file_contents(local_file, encoding='utf-8')
File "/home/trivia/utils/utils.py", line 10, in get_file_contents
with open(filename, encoding=encoding) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'dataset/wikipdia-dev.json/White.txt'

icould you tell me more detail how to convert？

Open Source License

Can you please add an open source license (preferably Apache 2.0)? This is mainly important for the evaluation script.

convert_to_squad_format sample_size purpose

The argument help has an error, stating that this is the same as Random seed...
Also, I am wondering what this sample_size is used for?

mandarjoshi90 / triviaqa Goto Github PK

triviaqa's People

Contributors

Stargazers

Watchers

Forkers

triviaqa's Issues

Wrong importing at triviaqa/utils

Dataset license

Errror when converting to SQuAD format

Finetuning TriviaQA using RoBERTA/BERT

Unable to extract downloaded dataset

Unable to extract the data from the website

how to convert TriviaQA to squad format?

Open Source License

convert_to_squad_format sample_size purpose

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent