semagrow / kobe Goto Github PK

View Code? Open in Web Editor NEW

10.0 6.0 1.0 58.57 MB

Framework for benchmarking SPARQL query federators

Home Page: https://semagrow.github.io/kobe/

License: Apache License 2.0

Java 51.62% Shell 4.34% Dockerfile 0.30% Groovy 0.09% Go 43.01% Mustache 0.64%

semagrow benchmarking database-benchmarking federated distributed sparql big-data

kobe's Introduction

Semagrow

Semagrow is a federated SPARQL query processor that allows combining, cross-indexing and, in general, making the best out of all public data, regardless of their size, update rate, and schema.

Semagrow offers a single SPARQL endpoint that serves data from remote data sources and that hides from client applications heterogeneity in both form (federating non-SPARQL endpoints) and meaning (transparently mapping queries and query results between vocabularies).

The main difference between Semagrow and most existing distributed querying solutions is that Semagrow targets the federation of heterogeneous and independently provided data sources.

In other words, Semagrow aims to offer the most efficient distributed querying solution that can be achieved without controlling the way data is distributed between sources and, in general, without having the responsibility to centrally manage the data sources of the federation.

Getting Started

Building

Building Semagrow from sources requires to have a system with JDK8 and Maven 3.1 or higher.
Optionally, you may need a PostgreSQL as a requirement for the query transformation functionality.

To build Semagrow you should type:

$ mvn clean install

in the top-level project directory. This will result in jar file in the target directory of the respective module and in a war file in the target directory of the webgui module that can be deployed to the Servlet server of your choice.

Bundled with Apache Tomcat

Moreover, Semagrow can be build pre-bundled with the Apache Tomcat servlet server. To achieve that you could issue

$ mvn clean package -P tomcat-bundle

from the top-level directory of the project. This will result in a compressed file in the target directory of the assembly module containing a fully equipped Apache Tomcat with Semagrow pre-installed. However, please note that external dependencies such as the PostgresSQL database needs to be installed and run separately.

Building a Docker image from sources

You can also test your build deployed in a docker image (Docker 18.09 or newer required for building). To do so run at the project root directory:

$ DOCKER_BUILDKIT=1 docker build -t semagrow .

The produced image will be tagged as semagrow:latest and will contain Tomcat with Semagrow deployed.

Configuration

By default, Semagrow look for its configuration files in /etc/default/semagrow and expects to find at least a repository.ttl and a metadata.ttl file in order to establish a federation of endpoints. The repository.ttl describes the configuration of the Semagrow endpoint, while the metadata.ttl describes the endpoints to be federated. The repository.ttl configuration file also defines the location of the metadata.ttl that can be changed to the desired path.

Samples of these configuration files can be found as resources of the http module

Running Semagrow

Running Semagrow from the Apache Tomcat bundle

In order to run the bundle of Apache Tomcat with SemaGrow you should

uncompress the generated zip,
copy the files from the resources folder to /etc/default/semagrow and
run the startup.sh script located in the bin folder.

SemaGrow can be accessed at http://localhost:8080/SemaGrow/.

Running Semagrow using Docker

Semagrow has an official docker repository and official docker images are available in Docker Hub.

To run semagrow using the latest official docker image you should execute

$ docker run -d semagrow/semagrow

Howeover, you can also build your own docker image using the steps described in Section [Building](#### Building a Docker image from sources) The produced image will be tagged as semagrow and will contain Tomcat with Semagrow deployed.

To run the newly produced image you should execute

$ docker run -d semagrow

or if you want to test Semagrow with your configuration files (repository.ttl and metadata.ttl) issue

$ docker run -d -v /path/to/configuration:/etc/default/semagrow semagrow

In either case you can access Semagrow at http://<CONTAINER_IP>:8080/SemaGrow/ where <CONTAINER_IP> is the address assigned to the semagrow container and can be retrieved using docker inspect

Known issues

SemaGrow uses UNION instead of VALUES to implement the BindJoin operator. This fails in 4store 1.1.5 and previous versions in the presence of FILTER clauses due to an unsafe optimization by 4store.
When deploying in Glassfish 4 by coping the SemaGrow.war file in the autodeploy directory, Semagrow is accessible at http://DOMAIN/SemaGrow/index.jsp instead of http://DOMAIN/SemaGrow/

kobe's People

Contributors

Stargazers

Watchers

Forkers

kostbabis

kobe's Issues

Specify Istio version in kobectl

kobectl always downloads latest Istio version, however Istio 1.6 is used in kobectl commands. Istio download script supports specifying the version using ENV ISTIO_VERSION. It is better that we use it to specify a known working version and also provide documentation on how to change it.

related to #41

Resultset size in Virtuoso template

The result size of a request seems to be limited to 10,000 when utilizing the Virtuoso server.
If that's the case, is it possible to remove the default limit or increase the limit?

Issue identified by @chengsijin0817

Replace NFS for backing up databases

As an optimization, Kobe keeps copies of populated Virtuoso and other databases so that they do not need to be populated from scratch for each experiment. This is currently done on disk space served by NFS, although the files are not directly accessed on NFS but downloaded to the containers' local space and accessed there. It would be more efficient to use an FTP server or similar for this.

"kobectl version" not working

kobectl version just prints the command usage guide. It should print all kobe configurable component versions (i.e. Kubernetes and istio versions)

Maintainance of custom status for Benchmark and Experiment

There are provisional status fields in the custom resource of Benchmark and Experiment that are not updated appropriately.

Status fields are useful and Kubernetes can display them next to the resources in the command line and in the web interface.

Enhance documentation with extensibility of query evaluator

The default query evaluator requests queries in a sequential manner and collects the results. However, there are other patterns to issue a set of queries. Fortunately, KOBE is extensible in terms of the query evaluator but we need to enhance the documentation in order to include instructions of how someone can plug a different query evaluator.

Make Istio optional

When it is available, Istio is always used even for benchmarks that do not specify delays. This is because the Operator tags all containers with the "istio-injection:enabled" label, which is used by Istio to decide when to create sidecars. The Operator must check if delays are specified in the benchmark definition, and only assign the "istio-injection:enabled" label when delays are declared and they are non-zero.

Abstract paths for dataset server template

Currently, the implementor of a dataset server template should read the dump file from the directory /kobe/dataset/$DATASET_NAME/dump and should create a copy of the database in some custom directory inside /kobe/dataset/$DATASET_NAME.

As an improvement, the api should be extended such as these directories should be chosen by the author of the template (as already done in the federator templates).

Support for Kubernetes 1.18

When deploying the custom resource definitions to a Kubernetes cluster >=1.18 I receive the some errors (see below).

this property is in x-kubernetes-list-map-keys, so it must have a default or be a required property

The CustomResourceDefinition "benchmarks.kobe.semagrow.org" is invalid: 
* spec.validation.openAPIV3Schema.properties[spec].properties[datasets].items.properties[template].properties[spec].properties[containers].items.properties[ports].items.properties[protocol].default: Required value: this property is in x-kubernetes-list-map-keys, so it must have a default or be a required property
* spec.validation.openAPIV3Schema.properties[spec].properties[datasets].items.properties[template].properties[spec].properties[importContainers].items.properties[ports].items.properties[protocol].default: Required value: this property is in x-kubernetes-list-map-keys, so it must have a default or be a required property
* spec.validation.openAPIV3Schema.properties[spec].properties[datasets].items.properties[template].properties[spec].properties[initContainers].items.properties[ports].items.properties[protocol].default: Required value: this property is in x-kubernetes-list-map-keys, so it must have a default or be a required property

This is due to an incompatibility between the definition of the Containerand the CustomResourceDefinition update. I believe this is due to an API change in v1.18.

CustomResourceDefinition schemas that use x-kubernetes-list-map-keys to specify properties that uniquely identify list items must make those properties required or have a default value, to > ensure those properties are present for all list items. See https://kubernetes.io/docs/reference/using-api/api-concepts/#merge-strategy for details. (#88076, @eloyekunle) [SIG API Machinery and Testing]

Experiment-Fedx-Fedbench-ls: query execution error

Hi, when I perform an experiment for FedX using the life-science dataset of FedBench, the evaluation completed successfully but query evaluation time and results are missing:
Query;run1;run2;run3;avg;numResults;minRes;maxRes; ls1; -2; -2; -2; -1; -1; -1; -1; ls2;-2;-2;-2;-1;-1;-1;-1; ls3;-2;-2;-2;-1;-1;-1;-1; ls4;-2;-2;-2;-1;-1;-1;-1; ls5;-2;-2;-2;-1;-1;-1;-1; ls6;-2;-2;-2;-1;-1;-1;-1; ls7;-2;-2;-2;-1;-1;-1;-1;
You can find the log and description for the Pod "fedx-ls-exp-evaluationjob" below:
log.txt
fedx_fedbench_ls_pod_describe.txt

Please feel free to let me know if I'm missing something.

Use init containers in the dataset server template examples

Currently, both strabon and virtuoso templates in examples make use of only one container, which is used both for loading data and serving data. Therefore, the experimenter must start the experiment after the data loading is complete.

We should use init containers for loading the data.

Istio 1.6 not working, update to 1.11.3 needed.

Move from Istio 1.6 to last known working version, Istio 1.11.3. Update the documentation accordingly.

Compatibility with k8s version 1.15 and below

Switching API version from apiextensions.k8s.io/v1beta1 to apiextensions.k8s.io/v1 (#21) broke backwards compatibility with version of Kubernetes below 1.16.

"kobectl install operator" instruction not working

Error message is

usage: kobectl install [component] [kobe-directory]

so we need to update the documentation to include the kobe-directory parameter.

Datasets and Federators should mount from NFS as read only

Currently they mount with read/write permissions which could produce some unwanted overhead. It should be mounted as read only since they never alter the files there.

Datasets for the 'largerdfbench' benchmark are missing

Related to PR #30, the datasets for the benchmark 'largerdfbench' are missing:

https://users.iit.demokritos.gr/~antru/largerdfbench/LinkedTCGA-A.tar.gz
https://users.iit.demokritos.gr/~antru/largerdfbench/LinkedTCGA-E.tar.gz
https://users.iit.demokritos.gr/~antru/largerdfbench/LinkedTCGA-M.tar.gz
https://users.iit.demokritos.gr/~antru/largerdfbench/Affymetrix.tar.gz

Deploy kobectl as kubectl plugin

Nowadays, kubectl allows for plugins. Kobectl can also be deployed as such.

Provide the ability to deploy Kobe in a namespace other than "default"

By following the documentation or using kobectl, Kobe can only be deployed in namespace "default". We should provide instructions on how to deploy Kobe in another namespace and modify kobectl to support this.

Provide support for LDF servers

Apart from Virtuoso and Strabon, a federation could also include LDF servers.

kobectl does not work with helm v3

error message

$ kobectl install efk .
"elastic" has been added to your repositories
"kiwigrid" has been added to your repositories
Error: unknown flag: --name
Error: unknown flag: --name
Error: unknown flag: --name

The --name flag was removed in helm 3. New helm install usage is
helm install [NAME] [CHART] [flags]

Update example datasettemplates

Update dataset server templates to work with and without delays (make use of the $USE_ISTIO flag).

Provide documentation on how to verify installation.

Users should be able to verify that Kobe was properly deployed. For example, after the kobe install operator step, the user should issue kubectl get pods -l name=kobe-operator and see pod's status "Running".

Installation details - ReadMe

It is not clear in the readme how to install the prerequisites. I could not use kobe because I did not manage to install the list of the mentioned prerequisites. Detailed explanation would be a great enhancement.

Prerequisites
Kubernetes >= 1.8.0
kubectl configured for the Kubernetes cluster
Helm version 3 (for the Evaluation Metrics Extraction subsystem)
nfs-commons installed in the nodes of the cluster. If in Debian or Ubuntu you can install it using apt-get install nfs-common

kubectl missing from prerequisites

kubectl needs to be installed and configured with the kubernetes cluster credentials before deploying kobe. We need to add this in Prerequisites section.

Flag to use Istio not set

I try to run an example experiment with delays (toybench-delays) and the Kobe-operator emit this error: "Flag to use Istio not set"

kobectl depends on helm, missing documentation

When using kobectl to deploy the efk stack, the following error is returned in helm is not installed.

kobectl install efk .
/home/gmouchakis/tmp/kobe/kobe/bin/kobectl: 236: helm: not found
/home/gmouchakis/tmp/kobe/kobe/bin/kobectl: 237: helm: not found
/home/gmouchakis/tmp/kobe/kobe/bin/kobectl: 238: helm: not found
/home/gmouchakis/tmp/kobe/kobe/bin/kobectl: 241: helm: not found
/home/gmouchakis/tmp/kobe/kobe/bin/kobectl: 243: helm: not found
configmap/kobe-kibana-config created
job.batch/kobe-kibana-configuration created

helm should be added in the Prerequisites section.

Problem in Kibana configuration

While installing the metrics subsystem, the status of pod "Kobe-kibana-configuration" is always "NotReady".

The machine used for setting up KOBE is a server that runs a 64-bit Debian GNU/Linux 10 server operating system. On this machine, I have installed Kubernetes (v1.22.4), nfs-commons (v1:1.3.4-6), minikube (v1.24.0), istio (v1.12.0), and helm (v2.17.0)