Git Product home page Git Product logo

kge's Introduction

LibKGE: A knowledge graph embedding library

LibKGE is a PyTorch-based library for efficient training, evaluation, and hyperparameter optimization of knowledge graph embeddings (KGE). It is highly configurable, easy to use, and extensible. Other KGE frameworks are listed below.

The key goal of LibKGE is to foster reproducible research into (as well as meaningful comparisons between) KGE models and training methods. As we argue in our ICLR 2020 paper (see video), the choice of training strategy and hyperparameters are very influential on model performance, often more so than the model class itself. LibKGE aims to provide clean implementations of training, hyperparameter optimization, and evaluation strategies that can be used with any model. Every potential knob or heuristic implemented in the framework is exposed explicitly via well-documented configuration files (e.g., see here and here). LibKGE also provides the most common KGE models and new ones can be easily added (contributions welcome!).

For link prediction tasks, rule-based systems such as AnyBURL are a competitive alternative to KGE.

UPDATE: LibKGE now includes GraSH, an efficient multi-fidelity hyperparameter optimization algorithm for large-scale KGE models. See here for an example on how to use it.

Quick start

# retrieve and install project in development mode
git clone https://github.com/uma-pi1/kge.git
cd kge
pip install -e .

# download and preprocess datasets
cd data
sh download_all.sh
cd ..

# train an example model on toy dataset (you can omit '--job.device cpu' when you have a gpu)
kge start examples/toy-complex-train.yaml --job.device cpu

Table of contents

  1. Features
  2. Results and pretrained models
  3. Using LibKGE
  4. Currently supported KGE models
  5. Extending LibKGE
  6. FAQ
  7. Known issues
  8. Changelog
  9. Other KGE frameworks
  10. How to cite

Features

  • Training
    • Training types: negative sampling, 1vsAll, KvsAll
    • Losses: binary cross entropy (BCE), Kullback-Leibler divergence (KL), margin ranking (MR), squared error (SE)
    • All optimizers and learning rate schedulers of PyTorch supported and can be chosen individually for different parameters (e.g., different for entity and for relation embeddings)
    • Learning rate warmup
    • Early stopping
    • Checkpointing
    • Stop (e.g., via Ctrl-C) and resume at any time
    • Automatic memory management to support large batch sizes (see config key train.subbatch_auto_tune)
  • Hyperparameter tuning
    • Grid search, manual search, quasi-random search (using Ax), Bayesian optimization (using Ax)
    • Resource-efficient multi-fidelity search for large graphs (using GraSH)
    • Highly parallelizable (multiple CPUs/GPUs on single machine)
    • Stop and resume at any time
  • Evaluation
    • Entity ranking metrics: Mean Reciprocal Rank (MRR), HITS@k with/without filtering
    • Drill-down by: relation type, relation frequency, head or tail
  • Extensive logging and tracing
    • Detailed progress information about training, hyper-parameter tuning, and evaluation is recorded in machine readable formats
    • Quick export of all/selected parts of the traced data into CSV or YAML files to facilitate analysis
  • KGE models
  • Embedders

Results and pretrained models

We list some example results (filtered MRR and HITS@k on test data) obtained with LibKGE below. These results are obtained by running automatic hyperparameter search as described here.

These results are not necessarily the best results that can be achieved using LibKGE, but they are comparable in that a common experimental setup (and equal amount of work) has been used for hyperparameter optimization for each model. Since we use filtered MRR for model selection, our results may not be indicative of the achievable model performance for other validation metrics (such as HITS@10, which has been used for model selection elsewhere).

We report performance numbers on the entire test set, including the triples that contain entities not seen during training. This is not done consistently throughout existing KGE implementations: some frameworks remove unseen entities from the test set, which leads to a perceived increase in performance (e.g., roughly add +3pp to our WN18RR MRR numbers for this method of evaluation).

We also provide pretrained models for these results. Each pretrained model is given in the form of a LibKGE checkpoint, which contains the model as well as additional information (such as the configuration being used). See the documentation below on how to use checkpoints.

FB15K-237 (Freebase)

MRR Hits@1 Hits@3 Hits@10 Config file Pretrained model
RESCAL 0.356 0.263 0.393 0.541 config.yaml 1vsAll-kl
TransE 0.313 0.221 0.347 0.497 config.yaml NegSamp-kl
DistMult 0.343 0.250 0.378 0.531 config.yaml NegSamp-kl
ComplEx 0.348 0.253 0.384 0.536 config.yaml NegSamp-kl
ConvE 0.339 0.248 0.369 0.521 config.yaml 1vsAll-kl
RotatE 0.333 0.240 0.368 0.522 config.yaml NegSamp-bce

WN18RR (Wordnet)

MRR Hits@1 Hits@3 Hits@10 Config file Pretrained model
RESCAL 0.467 0.439 0.480 0.517 config.yaml KvsAll-kl
TransE 0.228 0.053 0.368 0.520 config.yaml NegSamp-kl
DistMult 0.452 0.413 0.466 0.530 config.yaml KvsAll-kl
ComplEx 0.475 0.438 0.490 0.547 config.yaml 1vsAll-kl
ConvE 0.442 0.411 0.451 0.504 config.yaml KvsAll-kl
RotatE 0.478 0.439 0.494 0.553 config.yaml NegSamp-bce

FB15K (Freebase)

MRR Hits@1 Hits@3 Hits@10 Config file Pretrained model
RESCAL 0.644 0.544 0.708 0.824 config.yaml NegSamp-kl
TransE 0.676 0.542 0.787 0.875 config.yaml NegSamp-bce
DistMult 0.841 0.806 0.863 0.903 config.yaml 1vsAll-kl
ComplEx 0.838 0.807 0.856 0.893 config.yaml 1vsAll-kl
ConvE 0.825 0.781 0.855 0.896 config.yaml KvsAll-bce
RotatE 0.783 0.727 0.820 0.877 config.yaml NegSamp-kl

WN18 (Wordnet)

MRR Hits@1 Hits@3 Hits@10 Config file Pretrained model
RESCAL 0.948 0.943 0.951 0.956 config.yaml 1vsAll-kl
TransE 0.553 0.315 0.764 0.924 config.yaml NegSamp-bce
DistMult 0.941 0.932 0.948 0.954 config.yaml 1vsAll-kl
ComplEx 0.951 0.947 0.953 0.958 config.yaml KvsAll-kl
ConvE 0.947 0.943 0.949 0.953 config.yaml 1vsAll-kl
RotatE 0.946 0.943 0.948 0.953 config.yaml NegSamp-kl

Yago3-10 (YAGO)

LibKGE supports large datasets such as Yago3-10 (123k entities) and Wikidata5M (4.8M entities). The results given below were found by automatic hyperparameter search with a similar search space as above, but with some values fixed (training with shared negative sampling, embedding dimension: 128, batch size: 1024, optimizer: Adagrad, regularization: weighted). The Yago3-10 result was obtained by training 30 pseudo-random configurations for 20 epochs, and then rerunning the configuration that performed best on validation data for 400 epochs.

MRR Hits@1 Hits@3 Hits@10 Config file Pretrained model
ComplEx 0.551 0.476 0.596 0.682 config.yaml NegSamp-kl

Wikidata5M (Wikidata)

We report two results for Wikidata5m. The first result was found by the same automatic hyperparameter search as described for Yago3-10, but we limited the final training to 200 epochs. The second result was obtained with significantly less resource consumption by using the multi-fidelity GraSH search.

Search + budget Final training MRR Hits@1 Hits@3 Hits@10 Config file Pretrained model
ComplEx Random, 600 epochs 200 epochs 0.301 0.245 0.331 0.397 config.yaml NegSamp-kl
ComplEx GraSH, 192 epochs 64 epochs 0.300 0.247 0.328 0.390 config.yaml -

Freebase

GraSH was also applied to Freebase, one of the largest benchmarking datasets containing 86M entities. The reported results were obtained by combining GraSH with distributed training implemented in Dist-KGE. The respective config files can be found in the GraSH repository as their execution is not yet supported in LibKGE.

MRR Hits@1 Hits@3 Hits@10
ComplEx 0.594 0.511 0.667 0.726
RotatE 0.613 0.578 0.637 0.669
TransE 0.553 0.520 0.571 0.614

CoDEx

CoDEx is a Wikidata-based KG completion benchmark. The results here have been obtained using the automatic hyperparameter search used for the Freebase and WordNet datasets, but with fewer epochs and Ax trials for CoDEx-M and CoDEx-L. See the CoDEx paper (EMNLP 2020) for details.

CoDEx-S
MRR Hits@1 Hits@3 Hits@10 Config file
RESCAL 0.404 0.293 0.4494 0.623 config.yaml
TransE 0.354 0.219 0.4218 0.634 config.yaml
ComplEx 0.465 0.372 0.5038 0.646 config.yaml
ConvE 0.444 0.343 0.4926 0.635 config.yaml
TuckER 0.444 0.339 0.4975 0.638 config.yaml
CoDEx-M
MRR Hits@1 Hits@3 Hits@10 Config file
RESCAL 0.317 0.244 0.3477 0.456 config.yaml
TransE 0.303 0.223 0.3363 0.454 config.yaml
ComplEx 0.337 0.262 0.3701 0.476 config.yaml
ConvE 0.318 0.239 0.3551 0.464 config.yaml
TuckER 0.328 0.259 0.3599 0.458 config.yaml
CoDEx-L
MRR Hits@1 Hits@3 Hits@10 Config file
RESCAL 0.304 0.242 0.3313 0.419 config.yaml
TransE 0.187 0.116 0.2188 0.317 config.yaml
ComplEx 0.294 0.237 0.3179 0.400 config.yaml
ConvE 0.303 0.240 0.3298 0.420 config.yaml
TuckER 0.309 0.244 0.3395 0.430 config.yaml

Using LibKGE

LibKGE supports training, evaluation, and hyperparameter tuning of KGE models. The settings for each task can be specified with a configuration file in YAML format or on the command line. The default values and usage for available settings can be found in config-default.yaml as well as the model- and embedder-specific configuration files (such as lookup_embedder.yaml).

Train a model

First create a configuration file such as:

job.type: train
dataset.name: fb15k-237

train:
  optimizer: Adagrad
  optimizer_args:
    lr: 0.2

valid:
  every: 5
  metric: mean_reciprocal_rank_filtered

model: complex
lookup_embedder:
  dim: 100
  regularize_weight: 0.8e-7

To begin training, run one of the following:

# Store the file as `config.yaml` in a new folder of your choice. Then initiate or resume
# the training job using:
kge resume <folder>

# Alternatively, store the configuration anywhere and use the start command
# to create a new folder
#   <kge-home>/local/experiments/<date>-<config-file-name>
# with that config and start training there.
kge start <config-file>

# In both cases, configuration options can be modified on the command line, too: e.g.,
kge start <config-file> config.yaml --job.device cuda:0 --train.optimizer Adam

Various checkpoints (including model parameters and configuration options) will be created during training. These checkpoints can be used to resume training (or any other job type such as hyperparameter search jobs).

Resume training

All of LibKGE's jobs can be interrupted (e.g., via Ctrl-C) and resumed (from one of its checkpoints). To resume a job, use:

kge resume <folder>

# Change the device when resuming
kge resume <folder> --job.device cuda:1

By default, the last checkpoint file is used. The filename of the checkpoint can be overwritten using --checkpoint.

Evaluate a trained model

To evaluate trained model, run the following:

# Evaluate a model on the validation split
kge valid <folder>

# Evaluate a model on the test split
kge test <folder>

By default, the checkpoint file named checkpoint_best.pt (which stores the best validation result so far) is used. The filename of the checkpoint can be overwritten using --checkpoint.

Hyperparameter optimization

LibKGE supports various forms of hyperparameter optimization such as grid search, random search, Bayesian optimization, or resource-efficient multi-fidelity search. The search type and search space are specified in the configuration file.

For example, you may use Ax for SOBOL (pseudo-random) and Bayesian optimization. The following config file defines a search of 10 SOBOL trials (arms) followed by 20 Bayesian optimization trials:

job.type: search
search.type: ax

dataset.name: wnrr
model: complex
valid.metric: mean_reciprocal_rank_filtered

ax_search:
  num_trials: 30
  num_sobol_trials: 10  # remaining trials are Bayesian
  parameters:
    - name: train.batch_size
      type: choice
      values: [256, 512, 1024]
    - name: train.optimizer_args.lr
      type: range
      bounds: [0.0003, 1.0]
    - name: train.type
      type: fixed
      value: 1vsAll

For large graph datasets such as Wikidata5m, you may use GraSH, which enables resource-efficient hyperparameter optimization. A full documentation of the GraSH functionality, useful search configs, and obtained results can be found in the accompanying repository. The following example config defines a search of 64 randomly generated trials with a search budget equivalent to only 3 full training runs on the whole dataset:

job.type: search
search.type: grash_search

dataset.name: wikidata5m
model: complex
valid.metric: mean_reciprocal_rank_filtered

grash_search:
  num_trials: 64 # initial number of randomly generated trials
  search_budget: 3 # in terms of full training runs on the whole dataset
  eta: 4 # reduction factor - only keep 1/eta best-performing trials per round
  variant: combined # low-fidelity approximation technique - combined = epoch + graph reduction
  parameters:
    - name: train.batch_size
      type: choice
      values: [256, 512, 1024]
    - name: train.optimizer_args.lr
      type: range
      bounds: [0.0003, 1.0]
    - name: train.type
      type: fixed
      value: 1vsAll

Trials can be run in parallel across several devices:

# Run 4 trials in parallel evenly distributed across two GPUs
kge resume <folder> --search.device_pool cuda:0,cuda:1 --search.num_workers 4

# Run 3 trials in parallel, with per GPUs capacity
kge resume <folder> --search.device_pool cuda:0,cuda:1,cuda:1 --search.num_workers 3

Export and analyze logs and checkpoints

Extensive logs are stored as YAML files (hyperparameter search, training, validation). LibKGE provides a convenience methods to export the log data to CSV.

kge dump trace <folder>

The command above yields CSV output such as this output for a training job or this output for a search job. Additional configuration options or metrics can be added to the CSV files as needed (using a keys file).

Information about a checkpoint (such as the configuration that was used, training loss, validation metrics, or explored hyperparameter configurations) can also be exported from the command line (as YAML):

kge dump checkpoint <checkpoint>

Configuration files can also be dumped in various formats.

# dump just the configuration options that are different from the default values
kge dump config <config-or-folder-or-checkpoint>

# dump the configuration as is
kge dump config <config-or-folder-or-checkpoint> --raw

# dump the expanded config including all configuration keys
kge dump config <config-or-folder-or-checkpoint> --full

Help and other commands

# help on all commands
kge --help

# help on a specific command
kge dump --help

Use a pretrained model in an application

Using a trained model trained with LibKGE is straightforward. In the following example, we load a checkpoint and predict the most suitable object for a two subject-relations pairs: ('Dominican Republic', 'has form of government', ?) and ('Mighty Morphin Power Rangers', 'is tv show with actor', ?).

import torch
from kge.model import KgeModel
from kge.util.io import load_checkpoint

# download link for this checkpoint given under results above
checkpoint = load_checkpoint('fb15k-237-rescal.pt')
model = KgeModel.create_from(checkpoint)

s = torch.Tensor([0, 2,]).long()             # subject indexes
p = torch.Tensor([0, 1,]).long()             # relation indexes
scores = model.score_sp(s, p)                # scores of all objects for (s,p,?)
o = torch.argmax(scores, dim=-1)             # index of highest-scoring objects

print(o)
print(model.dataset.entity_strings(s))       # convert indexes to mentions
print(model.dataset.relation_strings(p))
print(model.dataset.entity_strings(o))

# Output (slightly revised for readability):
#
# tensor([8399, 8855])
# ['Dominican Republic'        'Mighty Morphin Power Rangers']
# ['has form of government'    'is tv show with actor']
# ['Republic'                  'Johnny Yong Bosch']

For other scoring functions (score_sp, score_po, score_so, score_spo), see KgeModel.

Use your own dataset

To use your own dataset, create a subfolder mydataset (= dataset name) in the data folder. You can use your dataset later by specifying dataset.name: mydataset in your job's configuration file.

Each dataset is described by a dataset.yaml file, which needs to be stored in the mydataset folder. After performing the quickstart instructions, have a look at the provided toy example under data/toy/dataset.yaml. The configuration keys and file formats are documented here.

Your data can be automatically preprocessed and converted into the format required by LibKGE. Here is the relevant part for the toy dataset, which see:

# download
curl -O http://web.informatik.uni-mannheim.de/pi1/kge-datasets/toy.tar.gz
tar xvf toy.tar.gz

# preprocess
python preprocess/preprocess_default.py toy

Currently supported KGE models

LibKGE currently implements the KGE models listed in features.

The examples folder contains some configuration files as examples of how to train these models.

We welcome contributions to expand the list of supported models! Please see CONTRIBUTING for details and feel free to initially open an issue.

Extending LibKGE

LibKGE can be extended with new training, evaluation, or search jobs as well as new models and embedders.

KGE models implement the KgeModel class and generally consist of a KgeEmbedder to associate each subject, relation and object to an embedding and a KgeScorer to score triples given their embeddings. All these base classes are defined in kge_model.py.

KGE jobs perform training, evaluation, and hyper-parameter search. The relevant base classes are Job, TrainingJob, EvaluationJob, and SearchJob.

To add a component, say mycomp (= a model, embedder, or job) with implementation MyClass, you need to:

  1. Create a configuration file mycomp.yaml. You may store this file directly in the LibKGE module folders (e.g., <kge-home>/kge/model/) or in your own module folder. If you plan to contribute your code to LibKGE, we suggest to directly develop in the LibKGE module folders. If you just want to play around or publish your code separately from LibKGE, use your own module.

  2. Define all required options for your component, their default values, and their types in mycomp.yaml. We suggest to follow LibKGE's core philosophy and define every option that can influence the outcome of an experiment in this way. Please pay attention w.r.t. integer (0) vs. float (0.0) values; e.g., float_option: 0 is incorrect because is interpreted as an integer.

  3. Implement MyClass in a module of your choice. In mycomp.yaml, add key mycomp.class_name with value MyClass. If you follow LibKGE's directory structure (mycomp.yaml for configuration and mycomp.py for implementation), then ensure that MyClass is imported in __init__.py (e.g., as done here).

  4. To use your component in an experiment, register your module via the modules key and its configuration via the import key in the experiment's configuration file. See config-default.yaml for a description of those keys. For example, in myexp_config.yaml, add:

    modules: [ kge.job, kge.model, kge.model.embedder, mymodule ]
    import: [ mycomp ]

FAQ

Are the configuration options documented somewhere?

Yes, see config-default.yaml as well as the configuration files for each component listed above.

Are the command line options documented somewhere?

Yes, try kge --help. You may also obtain help for subcommands, e.g., try kge dump --help or kge dump trace --help.

LibKGE runs out of memory. What can I do?

  • For training, set train.subbatch_auto_tune to true (equivalent result, less memory but slower).
  • For evaluation, set entity_ranking.chunk_size to, say, 10000 (equivalent result, less memory but slightly slower, the more so the smaller the chunk size).
  • Change hyperparameters (non-equivalent result): e.g., decrease the batch size, use negative sampling, use less samples).

Known issues

Changelog

See here.

Other KGE frameworks

Other KGE frameworks:

KGE projects for publications that also implement a few models:

PRs to this list are welcome.

How to cite

Please cite the following publication to refer to the experimental study about the impact of training methods on KGE performance:

@inproceedings{
  ruffinelli2020you,
  title={You {CAN} Teach an Old Dog New Tricks! On Training Knowledge Graph Embeddings},
  author={Daniel Ruffinelli and Samuel Broscheit and Rainer Gemulla},
  booktitle={International Conference on Learning Representations},
  year={2020},
  url={https://openreview.net/forum?id=BkxSmlBFvr}
}

If you use LibKGE, please cite the following publication:

@inproceedings{
  libkge,
  title="{L}ib{KGE} - {A} Knowledge Graph Embedding Library for Reproducible Research",
  author={Samuel Broscheit and Daniel Ruffinelli and Adrian Kochsiek and Patrick Betz and Rainer Gemulla},
  booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
  year={2020},
  url={https://www.aclweb.org/anthology/2020.emnlp-demos.22},
  pages = "165--174",
}

kge's People

Contributors

adrianks avatar cthoyt avatar fniesel avatar kenkoko avatar mayo42 avatar nluedema avatar nzteb avatar ppoffice avatar rgemulla avatar rufex2001 avatar sailera19 avatar samuelbroscheit avatar sanxing-chen avatar sfschouten avatar tillgeissler avatar tsafavi avatar unmeshvrije avatar vonvogelstein avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kge's Issues

label smoothing broken

The default value in config-default of -1.0 (to disable it) is not accepted anymore. This breaks all configurations which do not explicitely set it. It's also unclear hwo to disable label smoothing right now.

Probably introduced by 2b082b7

Improve scalability of unweighted regularization

Right now, weighted regularization (which regularizes batch entities) scales better then unweigthed regularization (which regularizes all entities) when negative sampling is used.

Whether we regularize only the batch entities or all entities, and whether regularization is unweighted or weighted should be done independently.

Use entry_points to make vanity CLI script

Rather than providing kge.py, you can use the console_scripts entry point to define a CLI that automatically gets installed and made available in the shell with pip install .

Would be happy to send a PR!

Computation of penalty terms

Commit e667cf9 changes the way penalties are interpreted for many models. The penalty term is currently computed only once per embedding, but with this change it's computed twice if subject and object embedder are the same (a common case). Instead of calling penalty twice, the code should check whether they are the same and, if so, call penalty only once.

Add weighted regularization

Add possibility to regularize each entity/relation embedding proportional to its inverse frequency in the training data in lookup_embedder.penalty().

This may be controlled with a Boolean option lookup_embedder.regularize_weighted or so (default: False).

Technically, LookupEmbedder should take an additional argument vocab_weights, defaulting to None.

Let auto search output best run

Currently only a parameter estimate and a metric estimate is output. It is helpful if information about the best actual run would also be output.

Cannot set different dropout, initialization ( or for Tucker embedding sizes ) for entities and relations with current config style

I propose the following schema.

model:
  type: complex               
  class_name: ComplEx
  entity_embedder: <lookup_embedder>
  relation_embedder: <lookup_embedder>

  grid_search_tied_attributes: [
      [ 'model.entity_embedder.dim', 'model.relation_embedder.dim' ],
      [ 'model.entity_embedder.sparse', 'model.relation_embedder.sparse' ],
      [ 'model.entity_embedder.normalize', 'model.relation_embedder.normalize' ],
      [ 'model.entity_embedder.initialize', 'model.relation_embedder.initialize' ],
  ]

where CONFIG_ITEM: <NAME> triggers copying the NAME config from the core into CONFIG_ITEM, e.g. <lookup_embedder> from core into entity_embedder, so automatically expanded this looks like

model:
  type: complex               
  class_name: ComplEx
  entity_embedder:
    type: lookup_embedder 
    dim: 100                    # entity dimensionality or [ entity, relation ] dimensionality
    initialize: normal          # xavier, uniform, normal
    initialize_arg: 0.1         # gain for Xavier, range for uniform, stddev for Normal
    dropout: 0.                 # dropout used for embeddings
    sparse: False               # ??
    normalize: ''               # alternatively: normalize '', L2
  relation_embedder: 
    type: lookup_embedder 
    dim: 100                    # entity dimensionality or [ entity, relation ] dimensionality
    initialize: normal          # xavier, uniform, normal
    initialize_arg: 0.1         # gain for Xavier, range for uniform, stddev for Normal
    dropout: 0.                 # dropout used for embeddings
    sparse: False               # ??
    normalize: ''               # alternatively: normalize '', L2

  grid_search_tied_attributes: [
      [ 'model.entity_embedder.dim', 'model.relation_embedder.dim' ],
      [ 'model.entity_embedder.sparse', 'model.relation_embedder.sparse' ],
      [ 'model.entity_embedder.normalize', 'model.relation_embedder.normalize' ],
      [ 'model.entity_embedder.initialize', 'model.relation_embedder.initialize' ],
  ]

This also relieves us from having many if-elses, like f.ex. the "if embedder == 'lookup'" ...

grid_search_tied_attributes tells grid search that if we do grid search over values for model.entity_embedder.dim then we copy/tie them to model.relation_embedder.dim. This is what I am currently in my code.

Serialize dataset and indexes

When loading a dataset, LibKGE parses the text files holding the raw data and creates indexes. This may take while.

To speed things up, datasets and indexes should be pickled to the dataset's folder once loaded for the first time. One subsequent loads, we can directly use the pickled files, which should be much faster.

fixed_parameters may be ignored

The new fixed_parameters feature below is only applied when num_sobol_trials is set. Perhaps the best approach is to change our implementation to always create the generation_strategy manually (i.e., also when num_sobol_trials=-1).

fixed_parameters: []

Improve seeding with ax

Make SOBOAL seed configurable as ax_search.sobol_seed, default to 0.

Also, seed 0 is currently used twice in the code. It looks like that's for different purposes, which is problematic (not sure).

Verify sparse regularization

The sparse regularization implementation in lookup embedders looks fishy: it seems to assume that embed is only called once per batch.

Also, it may also lead to incorrect results for 1toN, where every entity is embedded but only the entities/relations in the batch triples should be regularized.

A better design would be to pass the set of used entities to the penalty function (e.g., as a kw argument so_indexes and p_indexes); the lookup emebdder can then pick it up there.

Code review of negative sampling

Sampler:

  • Merge _sample and call (no need to have both)
  • Use config keys negative_sampling.sampler (for the type) and negative_sampling.num_e. We can add num_{s,r,o} later in case we really need it. Document the keys in the config file.
  • The sampling API does not look ideal. I suggest to give it a batch directly (nx3 tensor) and should output the samples in an (nx (3+2*num_e)) tensor. This allows for a faster implementation.

Negative sampling job:

  • Do not call model.prepare_job and set is_prepared (done in superclass)
  • For collate, I suggest columns s,p,o,neg_s,neg_o (n x 3+2num_e tensor). This would then match the sampler when changed as above. Also, no labels needed any more (each row has exactly one positive and 2num_e negatives).
  • As for computing the loss with ranking, it may be most efficient to use a custom loss function.

Loss:

  • Move NotImplementedError to constructor.
  • I suggest to have compute_1toN and compute_negative_sampling functions (with different arguments and a separater documentation).

Config bug: Trying to read type from embedder leads to exception for inverse_relations_model

When model is set to 'model': 'inverse_relations_model' the following code

if config.get(self.configuration_key + ".relation_embedder.type") == 'projection_embedder':
in sd_rescal throws the error

Traceback (most recent call last):
  File "/home/sbrosche/anaconda3/lib/python3.7/concurrent/futures/process.py", line 232, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/home/sbrosche/PycharmProjects/kge/kge/job/search.py", line 126, in _run_train_job
    job = Job.create(train_job_config, search_job.dataset, parent_job=search_job)
  File "/home/sbrosche/PycharmProjects/kge/kge/job/job.py", line 48, in create
    job = TrainingJob.create(config, dataset, parent_job)
  File "/home/sbrosche/PycharmProjects/kge/kge/job/train.py", line 88, in create
    return TrainingJob1toN(config, dataset, parent_job)
  File "/home/sbrosche/PycharmProjects/kge/kge/job/train.py", line 396, in __init__
    super().__init__(config, dataset, parent_job)
  File "/home/sbrosche/PycharmProjects/kge/kge/job/train.py", line 34, in __init__
    self.model = KgeModel.create(config, dataset)
  File "/home/sbrosche/PycharmProjects/kge/kge/model/kge_model.py", line 324, in create
    model = getattr(module, class_name)(config, dataset, configuration_key)
  File "/home/sbrosche/PycharmProjects/kge/kge/model/inverse_relations_model.py", line 33, in __init__
    config, alt_dataset, self.configuration_key + ".base_model"
  File "/home/sbrosche/PycharmProjects/kge/kge/model/kge_model.py", line 324, in create
    model = getattr(module, class_name)(config, dataset, configuration_key)
  File "/home/sbrosche/PycharmProjects/kge/kge/model/sd_rescal.py", line 224, in __init__
    if config.get(self.configuration_key + ".relation_embedder.type") == 'projection_embedder':
  File "/home/sbrosche/PycharmProjects/kge/kge/config.py", line 43, in get
    result = result[name]
KeyError: 'type'

Revert 26a4d2a

Revert 26a4d2a. This commit creates a configuration mess and all possible initializers and options need to be specified aprior. An alternative to obtain this behavior (why is it needed, though?) might be to use

initialize_args.x -> to pass along option x
initialize_args.normal_.x -> to pass along option x only when initialize is normal

This way, initializers and their options do not need to be listed in the config file (and they should not be listed there).

Config does not work as expected

sd_rescal_tucker3.yaml:

import: [lookup_embedder, projection_embedder]

sd_rescal_tucker3:
  class_name: SparseDiagonalRescal
  blocks: -1
  block_size: -1
  entity_embedder:
    type: lookup_embedder
    dim: -1  # determine automatically
    +++: +++
  relation_embedder:
    type: projection_embedder
    dim: -1  # determine automatically
    +++: +++

toy-sdrescal-tucker3.yaml

import: sd_rescal_tucker3
model: sd_rescal_tucker3

sd_rescal_tucker3:
  class_name: SparseDiagonalRescal
  blocks: 4
  block_size: 16
  entity_embedder:
    initialize: auto_initialization
  relation_embedder:
    initialize: auto_initialization
    dim: -1  # determine automatically

This throws the an error

           relation_embedder = ".base_embedder.relation_embedder"

...

            config.set(
                self.configuration_key + relation_embedder + ".initialize",
                "normal_",
                log=True,
            )

Error:

Traceback (most recent call last):
  File "/home/samuel/PycharmProjects/kge/kge.py", line 200, in <module>
    job = Job.create(config, dataset)
  File "/home/samuel/PycharmProjects/kge/kge/job/job.py", line 48, in create
    job = TrainingJob.create(config, dataset, parent_job)
  File "/home/samuel/PycharmProjects/kge/kge/job/train.py", line 86, in create
    return TrainingJob1toN(config, dataset, parent_job)
  File "/home/samuel/PycharmProjects/kge/kge/job/train.py", line 394, in __init__
    super().__init__(config, dataset, parent_job)
  File "/home/samuel/PycharmProjects/kge/kge/job/train.py", line 32, in __init__
    self.model = KgeModel.create(config, dataset)
  File "/home/samuel/PycharmProjects/kge/kge/model/kge_model.py", line 322, in create
    model = getattr(module, class_name)(config, dataset, configuration_key)
  File "/home/samuel/PycharmProjects/kge/kge/model/sd_rescal.py", line 261, in __init__
    log=True,
  File "/home/samuel/PycharmProjects/kge/kge/config.py", line 143, in set
    create = create or "+++" in data[splits[i]]
KeyError: 'base_embedder

Its not clear why this doesn't work.

Bug resuming ConvE

resume-conve-bug.yaml.zip

I attach a ConvE file which can be used to reproduce the resuming bug. Start with:

kge start resume-conve-bug.yaml --folder experiments/resume-conve-bug

The stop after the first trial is finished, and resume with:

kge resume experiments/resume-conve-bug/config.yaml

Two folders named data

One on the root folder, one inside the kge folder. Perhaps rename the one inside kge to kge/dataset?

Trace every job with a unique run id

4233993 introduced a git commit field into the configuration. We should remove this field from there.

The field is misleading in the configuration because (i) our code does not ensure that exactly this commit is actually used and (ii) a model may be resumed with a different commit.

I suggest to clearly separate configuration and environment:

  • Give every job a unique id. Add this id to each trace entry automatically.

  • Whenever a job starts, add a trace entry which has the job id, type, etc. as well as other environment variables (start time, git commit, user, machine, ...). This way, we can keep track of changing environments (e.g, when a job is resumed).

Improve error message for config typos

For a config like

job.type: train
dataset.name: toy
model: complex

complex:
  relation_embedders: # <--- typo relation_embedder*s* 
    regularize_args:
      p: 1

the error message is very technical

KeyError: 'complex.relation_embedders.regularize_args cannot be set because creation of complex.relation_embedders is not permitted'

which could be improved to say

KeyError: 'Key "complex.relation_embedders" does not exist. Parent key "complex" does not allow creation of new keys. If the creation of "complex.relation_embedders" was intended, then "complex" should have the "+++" attribute.'

Output partial results in grid job

Grid search validation outputs are currently copied to the trace file of the grid search job only once grid search completes. This makes it unnecessarily hard to track the state of the search; all validation results should be forwarded immediately as they are computed.

Batch size and penalty terms

7aa82ce introduces a patch that divides penalty terms by the number of batches to keep the penelty terms consistent. This needs discussion.

In particular: we average example losses/gradients over the batch. Thus, before 7aa82ce:

E[gradient] = E[gradient of a random example] + gradient of penatly term

That's independent of the number of batches. After 7aa82ce :

E[gradient] = E[gradient of a random example] + gradient of penalty term / num_batches

That's dependent on the number of batches. This patch thus seems to introduce what it tries to avoid.

Code review: reimplementation of auto_initialization

Please have a look at 6e6f004. auto_initialization is now a type of initializer. In practice this essentially means the auto_initialize flag which existed in some models has been replaced with lookup_embedder.initialize="lookup_embedder". I think nothing has been broken, except that old config files with the auto_initialize flag will no longer work.

Also, Samuel suggested to export all auto_initialization logic to a separate function in, say, util.py. But this would require considerations for the differences across models.

Add devicepool to parallel search

Should use a devicepool to distribute parallel search jobs to different devices, e.g.

device_pool = ['cpu', 'gpu#1', 'gpu#1', 'gpu#1', 'gpu#2', 'gpu#2', 'gpu#2', ]

API changes Config

  1. Get rid of get_option so that we use one API for accessing the config everywhere and don't have two names "config/option" for the same thing.
  2. Make config_key an optional kwarg for config.get_default and do the check from get_option there.
  3. Merge get and get_default
  4. Rename default to resolve_type and make a boolean argument get(..., resolve_type=True, ...) as it is done for set(..., create=True, ...). Then we have one get and one set with either an option to have a resolve_type/create behaviour or not.

Support "half-ranks" during evaluation

During evaluation: If a true answer has a half-rank (such as 1.5), it is currently counted as rank 2.

Example: scores are (3,+10,*10,5). If the true answer is + or *, returned rank should be 1.5 in both cases (right now, it's 2 in both cases).

Internally, fixing this problem requires a change of the histogram layout.

Allow to select checkpoint for resuming

This is necessary, for example, to run test evaluation on the best checkpoint instead of the last.

Suggested implementation:

  • add an optional --checkpoint option to kge resume
  • when present, use this checkpoint id (such as "100" or "best")
  • when absent, use "best" if present and it's an eval job, else use latest checkpoint by default

Problem with ConvE and "KgeModel.load_from_checkpoint(checkpoint)"

I'm having some problems but I cannot get fully behind it and I'm not sure if Iam doing something wrong so I rather ask for help.

Training toy-conve-train.yaml and then loading it from any checkpoint with KgeModel.load_from_checkpoint(checkpoint) results in a long torch error about a dimension mismatch (Resuming the ConvE job with the normal resume functionality is no problem and does work, it does not use this function).

The error is raised here https://github.com/uma-pi1/kge/blob/master/kge/model/kge_model.py#L368

I could find out that this config
https://github.com/uma-pi1/kge/blob/master/kge/model/kge_model.py#L364
seems to be different in the key "entity_embedder.dim" from the original config with which the job was run. For example, when I substitute it and by hand load the original config, then it seems to work.
Somehow the problem might be connected with the lines here
https://github.com/uma-pi1/kge/blob/master/kge/model/conve.py#L116
because I tried to outcomment them and it seemed to work.

Provide "packaged" models

When sharing models, we currently need to share both dataset and checkpoint. For prediciton, the dataset is solely used to obtain a mapping between entity and relation indexes and their ids or mentions, however.

A better approach may be to support "packaged models", where a package contains the checkpoint and just the relevant part of the dataset (which is much smaller than the entire dataset). With this, models can be deployed right away without having to have the dataset around.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.