facebookresearch / neuraldb Goto Github PK

Database Reasoning Over Text project for ACL paper

License: Apache License 2.0

Python 93.25% Shell 6.75%

neuraldb's Introduction

Database Reasoning over Text

This repository contains the code for the Database Reasoning Over Text paper, to appear at ACL2021. Work is performed in collaboration with James Thorne, Majid Yazdani, Marzieh Saeidi, Fabrizio Silvestri, Sebastian Riedel, and Alon Halevy.

Data

The completed NeuralDB datasets can be downloaded here and are released under a CC BY-SA 3.0 license.

The dataset includes entity names from Wikidata which are released under a CC BY-SA 3.0 license. This dataset includes sentences from the KELM corpus. KELM is released under the CC BY-SA 2.0 license

Repository Structure

The repository is structured in 3 sub-folders:

Tools for mapping the KELM data to Wikidata identifiers are provided in the dataset construction folder ,
The information retrieval system for the support set generator are provided in the ssg folder
The models for Neural SPJ, the baseline retrieval (TF-IDF and DPR), and evaluation scripts are provided in the modelling folder.

Instructions for running each component are provided in the README files in the respective sub-folders.

Setup

All sub-folders were set up with one Python environment per folder. Requirements for each environment can be installed by running a pip install:

pip install -r requirements.txt

In the dataset-construction and modelling folders, the src folder should be included in the python path.

export PYTHONPATH=src

License

The code in this repository is released under the Apache 2.0 license

neuraldb's People

Contributors

Stargazers

Watchers

neuraldb's Issues

Regarding the Wikidata mongo dump

Hey, regarding this:

mongorestore --archive --gzip < mongo_wikidata_dump.gz

Where can I get the file mongo_wikidata_dump.gz?

Project dependencies may have API risk issues

Hi, In NeuralDB, inappropriate dependency versioning constraints can cause risks.

Below are the dependencies and version constraints that the project is using

tqdm
pymongo
numpy

The version constraint == will introduce the risk of dependency conflicts because the scope of dependencies is too strict.
The version constraint No Upper Bound and * will introduce the risk of the missing API Error because the latest version of the dependencies may remove some APIs.

After further analysis, in this project,
The version constraint of dependency tqdm can be changed to >=4.42.0,<=4.64.0.
The version constraint of dependency pymongo can be changed to >=2.9,<=4.1.1.

The above modification suggestions can reduce the dependency conflicts as much as possible,
and introduce the latest version as much as possible without calling Error in the projects.

The invocation of the current project includes all the following methods.

The calling methods from the tqdm

itertools.product
tqdm.tqdm

The calling methods from the pymongo

pymongo.MongoClient
pymongo.UpdateOne

The calling methods from the all methods

loaded.extend
global_obs.append
search_key.split
qh2_filtered.append
r.split
json.load.keys
pathlib.Path
transformers.integrations.deepspeed_init
special_tokens.extend
instance.strip.rstrip
write_updates
os.environ.get
collections.defaultdict.items
torch.utils.data.DataLoader
self.context_tokenizer.items
get_longest
self.RankArgs
fact.split.detok.detokenize.replace.replace.replace
pathlib.Path.exists
s.startswith
try_recovery
collections.defaultdict
tmp_heights.append
idx.by_idx.append
logging.getLogger.info
self.callback_handler.on_prediction_step
model
maybe_split
matplotlib.pyplot.xticks
nltk.tokenize.treebank.TreebankWordDetokenizer.detokenize
random.choice.replace
map_triples_to_facts
f.write
resolve_redirect
grp.tuple.all_unique.append
collections.defaultdict.add
all_series.append
collections.Counter.most_common
torch.stack
item.replace.strip.replace
neuraldb.modelling.neuraldb_trainer.NeuralDBTrainer.save_metrics
EvalPredictionWithMetadata
get_unit
name.replace.replace.replace
resolve_later_ref.append
read_questions_into_dict.items
torch.utils.data.sampler.WeightedRandomSampler
collections.defaultdict.values
all_grams.extend
matplotlib.pyplot.savefig
sampled.extend
refs.split.replace
convert_numeric_hypothesis
hdr.replace
name.replace.replace.replace.replace.replace.replace.replace
json.load.get
ax.plot
map
question_template.split
self.model.generate
context_outputs.pooler_output.T.question_outputs.pooler_output.torch.matmul.cpu
get_generator
sizes.append
r.final_templates.keys
relation_id.additional_subjects.get.keys.set.union
name.replace.replace
float
Exception
glob.glob
collections.defaultdict.keys
generate_positive_question
list.append
v.to
matplotlib.pyplot.plot
subj.by_relation.extend
self.tokenizer.convert_tokens_to_ids
next_actions.tolist.tolist
long_questions.extend
new_search.append
stripped_template.replace.replace
is_valid_file
population.pop
find_matches
r.source_mutations.keys
generate_answers
resolve_first_ref
self.maybe_tokenize_db
result.strip
self.database_reader.load_instances
loaded.append
self.model.prepare_decoder_input_ids_from_labels
logging.getLogger.debug
relation_id.additional_objects.get.keys.set.union
question_template.split.strip
aggr.update
numpy.concatenate
pandas.pivot_table
logging.getLogger
transformers.integrations.is_deepspeed_zero3_enabled
f.read
transformers.trainer_utils.is_main_process
hasattr
subject_name.q.replace.replace
pydash.get
pandas.DataFrame
join
normalize_subject
resolve_later_ref.split
object_id.startswith
model.half.to.half
added_instances.append
matplotlib.pyplot.subplots
subj.by_subject.extend
rel.subj.by_sub_rel.append
join_decoded
sro.startswith
resolve_later_ref
feature.items
hdr.replace.replace
os.path.exists
bool
build_questions_for_db
batch.append
k.replace
os.path.isdir
tmp_rels.append
instance.strip.rstrip.replace.replace.split
merge_type
read_csv.items
self._prepare_inputs
rel.split.split_by_relation.extend
torch.nn.Softmax
partition_questions
statement.replace
all
predicted.split
answer_sizes.append
dataset.append
print
threshold.cos_scores.np.nonzero.squeeze
map_triples_to_facts.keys
torch.no_grad
transformers.trainer_utils.denumpify_detensorize.pop
isinstance
collections.OrderedDict
found_sro.hf.set.union
generate_derivations.append
TFIDFRetriever.closest_docs
answer.strip
query.active_questions.append
search_toks.index
element.split
tokenizer.pad_token.tokenizer.pad_token.tokenizer.eos_token.tokenizer.eos_token.tokenizer.bos_token.tokenizer.bos_token.label.replace.replace.replace.strip
nltk.word_tokenize
itertools.chain
ordering.index
subject_name.question_template.replace.replace
str
all_losses.mean.item
clean_title
actual.split
transformers.trainer_pt_utils.find_batch_size
dev_examples.append
partition_subject
k.strip.answers.append
derivation.strip.split
json.dumps
question_template.replace.replace
qh1_filtered.append
stds.extend
sentence_transformers.util.pytorch_cos_sim
self._wrap_model
operator.itemgetter
outputs.outputs.dict.outputs.isinstance.mean
sample_databases
final_questions.append
random.uniform
logging.getLogger.warning
range
subj.by_object.extend
evaluate_ndb_with_ssg
os.getenv
repr
o.by_object.append
statement.replace.replace
context_tokens.append
pandas.set_option
itertools.repeat
get_instances_from_file
precision
ax.fill_between
csv.DictReader
linearize
transformers.HfArgumentParser.parse_args_into_dataclasses
item.replace.strip
tokenizer.pad_token.tokenizer.pad_token.tokenizer.eos_token.tokenizer.eos_token.tokenizer.bos_token.tokenizer.bos_token.label.replace.replace.replace.strip.split
matplotlib.pyplot.legend
dpr.context_model.eval
logging.getLogger.critical
all_metadata.extend
self.collection.find_one
wikidata_common.wikpedia.Wikipedia
medium_questions.extend
tokenizer.batch_decode
group_derivations
b.count
copy.copy
process_lists
key.startswith
setuptools.find_packages
json.loads.split
partition_relation
q.strip
numpy.mean
question_template.replace
object_name.replace
TFIDFRetriever.lookup
question.split.strip
features.keys
nltk.ngrams
fact.split.detok.detokenize.replace.replace.split
tokenizer.add_special_tokens
transformers.DPRQuestionEncoder.from_pretrained
collections.Counter.update
list
model.half.to
kvp.split
hdr.replace.replace.replace.replace.replace
v.torch.LongTensor.to
new_states.append
object_name.replace.replace
template.keys.set.difference
search_str.clean.strip
example.append
read_csv
fact.split.detok.detokenize.replace.replace
torch.matmul
dataclasses.field
query_obj.range.set.difference
majority_vote
ndb_data.util.log_helper.setup_logging
numpy.argmax
tmp_positive_answers.append
neuraldb.modelling.neuraldb_trainer.NeuralDBTrainer.train
self._process_query
v.len.by_len.append
logging.root.addHandler
generate_facts_for_db.append
transformers.DPRContextEncoder.from_pretrained
wikidata_common.wikpedia.Wikipedia.resolve_redirect
set
ndb_data.wikidata_common.wikidata.Wikidata
prediction.replace.split
filter
neuraldb.modelling.neuraldb_trainer.NeuralDBTrainer.save_state
neuraldb.modelling.neuraldb_trainer.NeuralDBTrainer.save_model
setup_logging
min
lookup_entity
all_subjects.set.union
self.subsampler.maybe_drop_sample
generate_joins_extra
neuraldb.modelling.neuraldb_trainer.NeuralDBTrainer.log_metrics
tokenizer.eos_token.tokenizer.eos_token.tokenizer.bos_token.tokenizer.bos_token.label.replace.replace.replace
subj.by_sub_rel.append
round
sampled.append
neuraldb.dataset.neuraldb_parser.NeuralDBParser.load_instances
sampled_fact.strip
dpr.question_model.eval
in_dict.items
dict_flatten
self._prepare_inputs.items
collection.insert_many
pymongo.UpdateOne
numpy.count_nonzero
num_fact_used.append
get_bool_breakdown
reader_cls.read
sentence_transformers.losses.ContrastiveLoss
transformers.DPRContextEncoderTokenizer.from_pretrained
neuraldb.evaluation.scoring_functions.average_score
numpy.std
sorted
read_dump
matplotlib.pyplot.hlines
v.split.strip.strip
q.by_qid.append
find_longest_match
ssg_data.append
tokenizer.eos_token.tokenizer.eos_token.tokenizer.bos_token.tokenizer.bos_token.pred.replace.replace.replace
generate_joins_filter
transformers.DPRQuestionEncoderTokenizer.from_pretrained
matplotlib.pyplot.ylabel
self.context_model
DPRRetriever.lookup
i.values
transformers.trainer_pt_utils.nested_truncate
ssg_output.remove
relation_id.extra_subjects.get.keys
matplotlib.pyplot.style.use
db.split.rsplit
neuraldb.evaluation.scoring_functions.breakdown_score
delattr
qid.split
open.read
self.tokenizer.pad.append
tokenizer.pad_token.tokenizer.pad_token.tokenizer.eos_token.tokenizer.eos_token.tokenizer.bos_token.tokenizer.bos_token.pred.replace.replace.replace.strip.split
local_obs.append
neuraldb.dataset.instance_generator.subsampler.Subsampler
fact.split.detok.detokenize.replace
torch.cat
prediction.split.replace
transformers.AutoTokenizer.from_pretrained.tokenize
scoring_function
self.tokenizer.tokenize
short_questions.extend
self.features.keys
numpy.sum
self._generate
self._process_answer
transformers.DPRQuestionEncoder.from_pretrained.to
tok.startswith
k.strip
math.pow
subject.get
numpy.concatenate.mean
generate_negative_bool
q.startswith
s.strip
additional_ids.difference.difference
question.strip
instance.strip.rstrip.replace
random.choices
matplotlib.pyplot.xlabel
ndb_data.wikidata_common.wikidata.Wikidata.get_by_id_or_uri
set.append
reader_cls
normalize_subject.replace
generate_derivations
self.concatenate_answer
transformers.trainer_utils.denumpify_detensorize.keys
actual.join_decoded.lower
rel.by_sub_rel.append
state.copy.append
self.context_tokenizer
self.prediction_step
transformers.AutoTokenizer.from_pretrained.decode
check_match
derivation.strip.startswith
self.compute_metrics
random.choice
derivation.rsplit
len.keys
inputs.outputs.self.label_smoother.mean
make_symmetric
ndb_data.construction.make_database_initial.normalize_subject
transformers.AutoModelForSeq2SeqLM.from_pretrained.resize_token_embeddings
super.__init__
matplotlib.pyplot.title
question_template.replace.replace.replace
self.collection.find
get_size_bin
b.strip.first_bit.strip
template.keys
claim.pydash.get.values
ef.write
q_idx.db_idx.questions_answers.append
len
random.random
numpy.min
prediction.replace.replace
k.rel_avgs.append
get_bool_ans
argparse.ArgumentParser.parse_args
prediction.split
json.loads.replace
neuraldb.dataset.seq2seq_dataset.Seq2SeqDataset
ValueError
prediction.split.replace.replace
self.validation_file.split
all_stds.append
tmp_derivations.append
sentence_transformers.evaluation.BinaryClassificationEvaluator.from_input_examples
torch.cuda.amp.autocast
try_numeric
self.test_file.split
generate_joins
self.instance_generator.generate
clean.append
name.replace.replace.replace.replace.replace.replace
neuraldb.dataset.data_collator_seq2seq.DataCollatorForSeq2SeqAllowMetadata
context_outputs.pooler_output.T.question_outputs.pooler_output.torch.matmul.cpu.detach.numpy.argsort
pred.replace
key.added_q_type_bin.append
by_subj.keys.set.difference
logging.StreamHandler
next
extended_question_answers.append
sentence_transformers.SentenceTransformer
self.answer_delimiter.join
numpy.where
wikidata_common.wikidata.Wikidata.find_custom
torch.zeros
logging.getLogger.error
instance.update
format
collection.find
super.compute_loss
context_outputs.pooler_output.T.question_outputs.pooler_output.torch.matmul.cpu.detach
ndb_data.wikidata_common.kelm.KELMMongo
logging.root.setLevel
get_indexable
load_experiment
drqascripts.retriever.build_tfidf_lines.OnlineTfidfDocRanker
extra_negative_facts.append
collections.Counter
ndb_data.wikidata_common.kelm.KELMMongo.find_entity_rel
torch.LongTensor
any
pydash.get.items
sum
setuptools.setup
wikidata_common.wikidata.Wikidata
model.half.to.eval
clean
second.nested.n_count.add
lengths.append
hyp.original_for.append
self._nested_gather
question_template.replace.replace.replace.replace.replace
os.path.basename
index_dump
bulks.append
matplotlib.pyplot.fill_between
zip
search_key.result.n_count.add
json.loads.append
generate_db_facts
self._pad_tensors_to_max_len
singleton_questions.extend
elem.to
transformers.AutoTokenizer.from_pretrained
statement.replace.replace.replace
compute_f1
set.add
transformers.AutoTokenizer.from_pretrained.encode
bz2.open
similarity.normalized_levenshtein.NormalizedLevenshtein
argparse.ArgumentParser.add_argument
transformers.trainer_utils.EvalLoopOutput
generate_derivations.extend
s.copy
context_outputs.pooler_output.T.question_outputs.pooler_output.torch.matmul.cpu.detach.numpy.argsort.tolist
generate_hypotheses
hdr.replace.replace.replace.replace
sentence_transformers.SentencesDataset
retokenize
self.context_delimiter.join
out_file.write
self._maybe_sample
label.replace
json.loads
transformers.set_seed
self.tokenizer.encode_plus
type
question.strip.replace
neuraldb.evaluation.scoring_functions.f1
json.load.items
additional_ids.difference.update
sentence_transformers.InputExample
o.startswith
random.shuffle
transformers.AutoConfig.from_pretrained
sentence_transformers.SentenceTransformer.encode
re.match
prediction.split.replace.replace.lower
math.floor
additional_subjects.keys.set.union
derivation.split
datasets.tqdm
self._pad_across_processes
instance.questions.append
name.replace
question.replace
self.tokenizer.as_target_tokenizer
question_template.replace.replace.replace.replace
recall
read_databases
candidate_negatives_1.append
plot.append
question_template.startswith
numpy.max
nltk.tokenize.treebank.TreebankWordDetokenizer
ctx.insert
isinstance.items
self._load_instances
local_f.append
qbin.qtype.all_questions_binned.append
generate_facts_for_db
outputs.outputs.dict.outputs.isinstance.mean.detach
get_numeric_value
tmp_types.append
normalize_subject.split
self.question_types.values
random.choice.startswith
context_outputs.pooler_output.T.question_outputs.pooler_output.torch.matmul.cpu.detach.numpy
collection.bulk_write
ndb_data.wikidata_common.kelm.KELMMongo.close
transformers.trainer_utils.denumpify_detensorize
ssg_utils.read_NDB
instance.split.TreebankWordDetokenizer.detokenize.replace.replace
data_files.items
instance.copy.strip
super
generator_cls
subject_name.question.replace.replace
r.final_templates.keys.set.difference
self.tokenizer.add_tokens
dataset.extend
all_experiments.append
plot.sort
self.num_examples
json.dump
object.get
self._prepare_inputs.pop
v.split.strip
collection.estimated_document_count
expt.update
neuraldb.dataset.neuraldb_parser.NeuralDBParser
q.replace
keys.split
final_sets.append
db.extend
tmp_fact_ids.append
hyp.extra_kelm_for.append
self.train_file.split
kwargs.get
argparse.ArgumentParser.error
numpy.nonzero
similarity.normalized_levenshtein.NormalizedLevenshtein.similarity
argparse.ArgumentParser
config_kwargs.update
refs.split.split
startptr.toks.join.clean.split
keys.split.strip
db_idx.to_add.append
ndb_data.generation.question_to_db.generate_answers
tuple
transformers.utils.logging.set_verbosity_info
shutil.rmtree
additional_objects.keys.set.union
pandas.pivot_table.to_records
NotImplementedError
a.strip
super.prediction_step
itertools.product
tqdm.tqdm
derivation.strip
subject_name.islower
random.choice.split
post_process_instances
property.property_entity.append
matplotlib.pyplot.show
os.unlink
final_period
self.tokenizer.decode
instance.strip.rstrip.replace.replace
swap_so
question.split
v.strip.answers.append
re.match.group
outputs.outputs.dict.outputs.isinstance.mean.detach.repeat
relation_id.additional_objects.get.keys
line.rstrip
instance.split.TreebankWordDetokenizer.detokenize.replace
self.tokenizer.pad
numpy.cumsum
copy.copy.extend
tuple.startswith
partition_idx
partition_subject.keys
functools.reduce
states.pop.copy
hyp.hypotheses_facts.append
transformers.utils.logging.enable_default_handler
others.append
read_questions_into_dict
self.maybe_decorate_with_metadata
obj.startswith
v.split.strip.split
neuraldb.modelling.neuraldb_trainer.NeuralDBTrainer.evaluate
object_name.islower
cos_scores.cpu.cpu
transformers.trainer_pt_utils.nested_concat
json.load
b.strip
self.question_tokenizer
series.extend
partition_subject_relation
transformers.trainer_pt_utils.nested_numpify
get_file_stats
bring_extra_facts
pymongo.MongoClient
read_csv.lower
json.loads.strip
postprocess_text
numpy.percentile
transformers.utils.logging.enable_explicit_format
os.listdir
random.sample
os.makedirs
derivation.split.strip
hf.keys
self.label_smoother
sentence_transformers.SentenceTransformer.fit
is_valid_folder
read_questions_into_dict.keys
tokenizer.pad_token.tokenizer.pad_token.tokenizer.eos_token.tokenizer.eos_token.tokenizer.bos_token.tokenizer.bos_token.pred.replace.replace.replace.strip
TFIDFRetriever
transformers.trainer_utils.get_last_checkpoint
collections.Counter.items
subject_name.modifier.is_subject.q.replace.replace
transformers.DPRContextEncoder.from_pretrained.to
out.extend
subject_name.fact.replace.replace
extract_operator
inputs.outputs.self.label_smoother.mean.detach
ssg_utils.create_dataset
self.question_tokenizer.items
lookup_relation
start.toks.join.startswith
r.by_relation.append
derv.split
max
derv.tokenizer.encode.tokenizer.decode.strip
tokenizer.bos_token.tokenizer.bos_token.label.replace.replace
int
remove_lst.append
random.randint
transformers.AutoModelForSeq2SeqLM.from_pretrained
neuraldb.modelling.neuraldb_trainer.NeuralDBTrainer
example.self.tokenizer.convert_tokens_to_ids.self.tokenizer.decode.split
open
set.update
derivation.rsplit.strip
hdr.replace.replace.replace
enumerate
k.startswith
pandas.DataFrame.select_dtypes
states.pop
weights.append
v.items
DPRRetriever
name.replace.replace.replace.replace.replace
tmp_questions.append
of.write
main
tokenizer.bos_token.tokenizer.bos_token.pred.replace.replace
predicted.join_decoded.lower
unit_uri.replace
relation_id.additional_subjects.get.keys
flatten_dicts
transformers.HfArgumentParser
self.maybe_tokenize_answer
convert_comparable
self._prepend_prediction_type_answer
self.question_model
q.q_heights.append
train_examples.append
name.replace.replace.replace.replace
dict
set.items
evaluation_metrics
pydash.get.values
virtual_features.extend
s.by_subject.append
batch_update.append
relation_id.extra_objects.get.keys
name.replace.replace.replace.replace.replace.replace.replace.replace
neuraldb.util.log_helper.setup_logging
search_toks.append
wikidata_common.wikidata.Wikidata.get_by_id_or_uri

@developer
Could please help me check this issue?
May I pull a request to fix it?
Thank you very much.

questions about spj_rand

What does spj_rand mean? Is spj_rand the same as ssg+spj? if no, how can I run ssg+spj?

2.I'm going to reproduce the results of the ACL paper, I have executed 'bash scripts/experiments_ours.sh v2.4_25' and 'bash scripts/experiments_baseline.sh v2.4_25'. After execute 'python -m neuraldb.final_scoring' I got this results:

The result here is only minmax/set/bool/count. How to output the results related to Atomic and join?

Performance problem combination of SSG + SPJ

Hey,

My team and I are facing an issue with the concatenation of SSG and SPJ. We trained the SSG and SPJ. The performances are quite good taken separatly. But, as soon as we test the NRD globally, the performances drop down. We have 0.55, 087 precision and recall for the SSG, 0.89 F1 score for the SPJ but 0.131 for the accumulation of SSG + SPJ. Based on the table from the Neural Databases article, we expected to have better results. Do you have any idea why this is happening ?

Why can't we predict the other types of questions for SSG + SPJ ... ?

Thanks :)

How to obtain <path_to_mapped_kelm>.jsonl

Hi, regarding this:

1.2 KELM

Importing the mappings from KELM is quite fast and is done through running the kelm_data.py script
python -m ndb_data.data_import.kelm_data <path_to_mapped_kelm>.jsonl

How do I obtain the <path_to_mapped_kelm>.jsonl? I think it's not described in the readme. I guess it has something to do with the map_kelm.py script, but I wasn't able to figure the details. E.g. downloading KELM from https://github.com/google-research-datasets/KELM-corpus and feeding it into either map_kelm.py or kelm_data.py doesn't seem to work.

What am I missing?

FiD Missing

Line 51 in modelling/neuraldb/run.py has the following import:
from neuraldb.modelling.fusion_in_decoder import T5MergeForConditionalGeneration
Is it possible to add it to the repository?

HuggingFace model for inference testing

Hey team,

Is there a chance of releasing the final finetuned models on HuggingFace or in-general, for inference testing purposes on the downloaded NeuralDB dataset?

Thanks.

"kelm_file.jsonl" is missing

Hey, I downloaded mongo wikidata dump from Google Drive and restore the dump of mongo db successfully. Databases is also created with the collection: "wiki_graph", with the respective indexes - wikidata_id, english_name, english_wiki, sitelinks.title and collection: "wiki_redirect", with the index - title .

I am stuck at section 1.2 of README under dataset-construction folder, followed all the steps of README. Please help me with the location of "kelm_file.jsonl" to download it and reproducing the results.