Git Product home page Git Product logo

triviaqa's People

Contributors

mandarjoshi90 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

triviaqa's Issues

Wrong importing at triviaqa/utils

I found out that files in triviaqa/utils have inputs with errors:
ModuleNotFoundError: No module named 'utils.utils'; 'utils' is not a package

At dataset_utils.py should be:

import utils

Instead of:

import utils.utils
import utils

Also at convert_to_squad_format.py are lots of such errors:

def get_text(qad, domain):
    ...
    return utils.utils.get_file_contents(local_file, encoding='utf-8')

The correct way would be to call get_file_contents as:

def get_text(qad, domain):
    ...
    return utils.get_file_contents(local_file, encoding='utf-8')

I found out that after correcting of dataset_utils.py as I showed and just moving convert_to_squad_format.py file from /triviaqa/utils to the main /triviaqa the error message disappears.

Dataset license

Does the Apache 2.0 license also apply to the data, or just the code? If not, what is the dataset's license? Thanks

#4 (comment)

Errror when converting to SQuAD format

Hi,

I was trying the command given in #3 given by mandarjoshi90. and i got the error as below:

python -m utils.convert_to_squad_format --triviaqa_file triv
iaqa/qa/wikipedia-dev.json --squad_file squad --wikipedia_dir triviaqa/evidence/wikipedia/ --web_dir triviaqa/evidence/web/
2%|█▎ | 230/14229 [00:01<01:29, 156.84it/s]
Traceback (most recent call last):
File "C:\Users\Vamshi\Anaconda3\envs\PythonGPU\lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "C:\Users\Vamshi\Anaconda3\envs\PythonGPU\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\Users\Vamshi_Workspace
\triviaqa-master\triviaqa-master\utils\convert_to_squad_format.py", line 110, in

convert_to_squad_format(args.triviaqa_file, args.squad_file)

File "C:\Users\Vamshi_Workspace_\triviaqa-master\triviaqa-master\utils\convert_to_squad_format.py", line 67, in convert_t
o_squad_format
text = get_text(qad, qad['Source'])
File "C:\Users\Vamshi_Workspace_\triviaqa-master\triviaqa-master\utils\convert_to_squad_format.py", line 12, in get_text
return utils.utils.get_file_contents(local_file, encoding='utf-8')
File "C:\Users\Vamshi_Workspace_\triviaqa-master\triviaqa-master\utils\utils.py", line 10, in get_file_contents
with open(filename, encoding=encoding) as f:
OSError: [Errno 22] Invalid argument: "triviaqa/evidence/wikipedia/Who's_on_First?.txt"

Can someone help me in resolving it ?

Finetuning TriviaQA using RoBERTA/BERT

I am trying to finetune a RoBERTa (or BERT) model on TriviaQA. I am using question-answering example from Huggingface transformers. Before the training, I have done TriviaQA 2 Squad using convert_to_squad_format.py

When I ran the finetune, there is an eeror with loading the data as follow:

File "run_squad.py", line 826, in
main()
File "run_squad.py", line 768, in main
train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=False)
File "run_squad.py", line 447, in load_and_cache_examples
examples = processor.get_train_examples(args.data_dir, filename=args.train_file)
File "/usr/local/lib/python3.6/dist-packages/transformers/data/processors/squad.py", line 578, in get_train_examples
return self._create_examples(input_data, "train")
File "/usr/local/lib/python3.6/dist-packages/transformers/data/processors/squad.py", line 605, in _create_examples
title = entry["title"]
KeyError: 'title'

That seems like the conversion is ignoring part of the data needed in Squad format.

I wonder if you can comment on this that what may be wrong or how to handle this issue.

Here is the run-time parameters for run_squad:
python run_squad.py
--model_type roberta
--model_name_or_path roberta-base
--do_train
--do_eval
--do_lower_case
--train_file $SQUAD_DIR/squad-wikipedia-train-4096.json
--predict_file $SQUAD_DIR/squad-wikipedia-dev-4096.json
--learning_rate 3e-5
--num_train_epochs 2.0
--max_seq_length 4096
--doc_stride 128
--output_dir ro_tri_st_debug_squad/
--fp16
--per_gpu_eval_batch_size 1
--per_gpu_train_batch_size 1
--gradient_accumulation_steps 8 \

Unable to extract downloaded dataset

I tried to extract dataset ("rc" version) which I downloaded from http://nlp.cs.washington.edu/triviaqa/ , for some reason the downloaded file is corrupted and it cannot be extracted.

Below is the log:

$ wget http://nlp.cs.washington.edu/triviaqa/data/triviaqa-rc.tar.gz
--2022-05-26 23:42:05--  http://nlp.cs.washington.edu/triviaqa/data/triviaqa-rc.tar.gz
Resolving nlp.cs.washington.edu (nlp.cs.washington.edu)... 128.208.3.120, 2607:4000:200:12::78
Connecting to nlp.cs.washington.edu (nlp.cs.washington.edu)|128.208.3.120|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2665779500 (2.5G) [application/x-gzip]
Saving to: ‘triviaqa-rc.tar.gz’
12% [==============>                                                                                                          ] 341,223,498  117KB/s   in 5m 10s 
2022-05-26 23:47:22 (1.05 MB/s) - Connection closed at byte 341223498. Retrying.
--2022-05-26 23:47:23--  (try: 2)  http://nlp.cs.washington.edu/triviaqa/data/triviaqa-rc.tar.gz
Connecting to nlp.cs.washington.edu (nlp.cs.washington.edu)|128.208.3.120|:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 2665779500 (2.5G), 2324556002 (2.2G) remaining [application/x-gzip]
Saving to: ‘triviaqa-rc.tar.gz’
100%[+++++++++++++++=======================================================================================================>] 2,665,779,500 12.2MB/s   in 3m 46s 
2022-05-26 23:51:11 (9.82 MB/s) - ‘triviaqa-rc.tar.gz’ saved [2665779500/2665779500]

$ tar xf triviaqa-rc.tar.gz
tar: Skipping to next header
tar: Exiting with failure status due to previous errors

I even downloaded from web directly but still not able to extract the downloaded dataset.
is dataset hosted at http://nlp.cs.washington.edu/triviaqa/ valid now?

Unable to extract the data from the website

Hi,
I was trying to download the unfiltered-questions on the website.
I ran the command
wget http://nlp.cs.washington.edu/triviaqa/data/triviaqa-rc.tar.gz
Now when I tried to extract it using the command
tar xvzf triviaqa-unfiltered.tar.gz
It gave me the following error:

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now

Even extraction with gunzip failed similarly. Could you mention what is the compression type or the correct file type of this file?

how to convert TriviaQA to squad format?

HI! i want to convert dataset to squad formation ,but I don't clear which file need to use, i try as bellow, but meet some error.

root@843d9505135f:/home/trivia/utils# python convert_to_squad_format.py --triviaqa_file dataset/wikipedia-dev.json --squad_file squad --wikipedia_dir dataset/wikipdia-dev.json --web_dir dataset/web-dev.json
0%| | 0/14229 [00:00<?, ?it/s]
Traceback (most recent call last):
File "convert_to_squad_format.py", line 110, in
convert_to_squad_format(args.triviaqa_file, args.squad_file)
File "convert_to_squad_format.py", line 67, in convert_to_squad_format
text = get_text(qad, qad['Source'])
File "convert_to_squad_format.py", line 12, in get_text
return utils.get_file_contents(local_file, encoding='utf-8')
File "/home/trivia/utils/utils.py", line 10, in get_file_contents
with open(filename, encoding=encoding) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'dataset/wikipdia-dev.json/White.txt'

icould you tell me more detail how to convert?

Open Source License

Can you please add an open source license (preferably Apache 2.0)? This is mainly important for the evaluation script.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.