Git Product home page Git Product logo

serene's Introduction

Serene Data Integration Platform

Serene is a data integration platform designed to provide semantic matching across heterogeneous relational data stores.

Prerequisites

You will need sbt to build and run the platform. On mac:

brew install sbt

on Debian linux

sudo apt-get install sbt

Installation

You can build the library with

bin/build

This should build the Serene server and place the final jar into the jars directory.

Alternatively to use sbt

sbt assembly

Usage

To start the web server use

bin/server-start

The following cmd line options are available:

--storage-path <value>  Storage Path determines the directory in which to store all files and objects
--host <value>          Server host address (default 127.0.0.1)
--port <value>          Server port number (default 8080)
--help                  Prints this usage text

Alternatively to use sbt, you can run

sbt run

with arguments in quotes e.g.

sbt "run --port 8888"

Additional configuration is available in application.conf, specifically for the initialization of Spark.

The API can be used with the following commands...

General

By default the server will run on localhost, port 8080. This can be changed in src/main/resources/application.conf. To check that the server is running, ensure that the following endpoints return valid JSON:

# check version
curl localhost:8080

# simple test
curl localhost:8080/v1.0

WARNING: the server will not work properly if logging level is set to DEBUG!

Datasets

Datasets need to be uploaded to the server. Currently only CSVs are supported. A description can also be added to the dataset upload. In case a dataset does not have headers, special header line needs to be added to the CSV (otherwise such dataset will not be properly read in by serene): the header line should be numbers starting from 0 to the number of columns -1.

# Get a list of datasets...
curl localhost:8080/v1.0/dataset

# Post a new dataset...
# Note that the max upload size is 2GB...
curl -X POST -F '[email protected]' -F 'description=This is a file' -F 'typeMap={"a":"int", "c":"string", "e":"int"}' localhost:8080/v1.0/dataset

# Show a single dataset
curl localhost:8080/v1.0/dataset/12341234

# Show a single dataset with custom sample size
curl localhost:8080/v1.0/dataset/12341234?samples=50

# Update a single dataset
curl -X POST -F 'description=This is a file' -F 'typeMap={"a":"int", "c":"string", "e":"float"}' localhost:8080/v1.0/dataset/12341234

# Delete a dataset
curl -X DELETE  localhost:8080/v1.0/dataset/12341234

Schema Matcher Models

The model endpoint controls the parameters used for the Schema Matcher classifier. The Schema Matcher takes a list of classes, and attempts to assign them to the columns of a dataset. If a column is known, use labelData to indicate the class to the ColumnID in the dataset. The features, modelType and resamplingStrategy can be modified.

# List models
curl localhost:8080/v1.0/model

# Post model
curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "description": "This is the description",
    "modelType": "randomForest",
    "classes": ["name", "address", "phone", "unknown"],
    "features": { "activeFeatures" : [
          "num-unique-vals",
          "prop-unique-vals",
          "prop-missing-vals",
          "ratio-alpha-chars",
          "prop-numerical-chars",
          "prop-whitespace-chars",
          "prop-entries-with-at-sign",
          "prop-entries-with-hyphen",
          "prop-range-format",
          "is-discrete",
          "entropy-for-discrete-values"
        ],
        "activeFeatureGroups" : [
          "inferred-data-type",
          "stats-of-text-length",
          "stats-of-numeric-type",
          "prop-instances-per-class-in-knearestneighbours",
          "mean-character-cosine-similarity-from-class-examples",
          "min-editdistance-from-class-examples",
          "min-wordnet-jcn-distance-from-class-examples",
          "min-wordnet-lin-distance-from-class-examples"
        ],
        "featureExtractorParams" : [
             {
              "name" : "prop-instances-per-class-in-knearestneighbours",
              "num-neighbours" : 3
             }, {
              "name" : "min-editdistance-from-class-examples",
              "max-comparisons-per-class" : 3
             }, {
              "name" : "min-wordnet-jcn-distance-from-class-examples",
              "max-comparisons-per-class" : 3
             }, {
              "name" : "min-wordnet-lin-distance-from-class-examples",
              "max-comparisons-per-class" : 3
             }
           ]
        },
    "costMatrix": [[1,0,0], [0,1,0], [0,0,1]],
    "labelData" : {"1696954974" : "name", "66413956": "address"},
    "resamplingStrategy": "ResampleToMean"
    }' \
  localhost:8080/v1.0/model


# Show a single model
curl localhost:8080/v1.0/model/12341234

# Update model (all fields optional)
curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "description": "This is the description",
    "modelType": "randomForest",
    "classes": ["name", "address", "phone", "unknown"],
    "features": { "activeFeatures" : [
          "num-unique-vals",
          "prop-unique-vals",
          "prop-missing-vals",
          "ratio-alpha-chars",
          "prop-numerical-chars",
          "prop-whitespace-chars",
          "prop-entries-with-at-sign",
          "prop-entries-with-hyphen",
          "prop-range-format",
          "is-discrete",
          "entropy-for-discrete-values"
        ],
        "activeFeatureGroups" : [
          "inferred-data-type",
          "stats-of-text-length",
          "stats-of-numeric-type",
          "prop-instances-per-class-in-knearestneighbours",
          "mean-character-cosine-similarity-from-class-examples",
          "min-editdistance-from-class-examples",
          "min-wordnet-jcn-distance-from-class-examples",
          "min-wordnet-lin-distance-from-class-examples"
        ],
        "featureExtractorParams" : [
             {
              "name" : "prop-instances-per-class-in-knearestneighbours",
              "num-neighbours" : 3
             }, {
              "name" : "min-editdistance-from-class-examples",
              "max-comparisons-per-class" : 3
             }, {
              "name" : "min-wordnet-jcn-distance-from-class-examples",
              "max-comparisons-per-class" : 3
             }, {
              "name" : "min-wordnet-lin-distance-from-class-examples",
              "max-comparisons-per-class" : 3
             }
           ]
        },
    "costMatrix": [[1,0,0], [0,1,0], [0,0,1]],
    "labelData" : {"1696954974" : "name", "66413956": "address"},
    "resamplingStrategy": "ResampleToMean"
    }' \
  localhost:8080/v1.0/model/98793874


# Train model (async, use GET on model 98793874 to query state)
curl -X POST localhost:8080/v1.0/model/98793874/train

# Delete a model
curl -X DELETE  localhost:8080/v1.0/model/12341234

# Predict a specific dataset 12341234 using model. Returns prediction JSON object
curl -X POST localhost:8080/v1.0/model/98793874/predict/12341234

To use the newly added bagging resampling strategy ("Bagging", "BaggingToMax", "BaggingToMean"), additional parameters can be indicated in model post resquest: numBags and bagSize. Both parameters are integer, and if not specified, default value 100 will be used for both. Example model post request to use bagging:

curl -X POST \
 -H "Content-Type: application/json" \
 -d '{
   "description": "This is the description",
   "modelType": "randomForest",
   "classes": ["name", "address", "phone", "unknown"],
   "features": { "activeFeatures" : [ "num-unique-vals", "prop-unique-vals", "prop-missing-vals" ],
       "activeFeatureGroups" : [ "stats-of-text-length", "prop-instances-per-class-in-knearestneighbours"],
       "featureExtractorParams" : [{"name" : "prop-instances-per-class-in-knearestneighbours","num-neighbours" : 5}]
       },
   "costMatrix": [[1,0,0], [0,1,0], [0,0,1]],
   "labelData" : {"1" : "name", "1817136897" : "unknown", "1498946589" : "name", "134383522" : "phone", "463734360" : "address"},
   "resamplingStrategy": "Bagging",
   "numBags": 10,
   "bagSize": 1000
   }' \
 localhost:8080/v1.0/model

Explanation of features and the list of available features can be found here. Resampling strategies are enumerated here. Currently only randomForest is supported as a modelType through Serene API.

Semantic Modelling

Attribute ids in the source descriptions are really important since we rely on Karma code to perform semantic modelling. We have to make sure that they are unique across different data sources.

The labels (semantic types) are assumed to come in the format: className---propertyName.

The configuration for the semantic modeler is specified in modeling.properties.

Semantic Source Descriptions

Semantic source descriptions provide information how exactly a particular dataset maps into a specified ontology. They include information both about the semantic types (i.e., classes/labels) for the columns as well as information about the relationships of these semantic types. All this information is encoded in the semantic model. Before a semantic source description can be uploaded to the server, the associated datasets should be uploaded.

# Get a list of semantic source descriptions...
curl localhost:8080/v1.0/ssd

# Post a new SSD...
# Note that the max upload size is 2GB...
curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
      "name": "serene-user-example-ssd",
      "ontology": [1],
      "semanticModel": {
        "nodes": [
            {
               "id": 0,
               "label": "State",
               "type": "ClassNode"
           },
           {
               "id": 1,
               "label": "City",
               "type": "ClassNode"
           }],
       "links": [
           {
               "id":     0,
               "source": 1,
               "target": 0,
               "label": "isPartOf",
               "type": "ObjectPropertyLink"
           }]
      },
      "mappings": [
       {
            "attribute": 1997319549,
            "node": 0
       },
       {
           "attribute": 1160349990,
           "node": 1
       }],
    }' \
         localhost:8080/v1.0/ssd

# Show a single ssd
curl localhost:8080/v1.0/ssd/12341234

# Update a single ssd

# Delete a ssd
curl -X DELETE  localhost:8080/v1.0/ssd/12341234

Ontologies

Serene can handle only OWL ontologies.

# Get a list of ontologies...
curl localhost:8080/v1.0/owl

# Post a new ontology...
# Note that the max upload size is 2GB...
curl -X POST -F '[email protected]' localhost:8080/v1.0/owl

# Show a single owl
curl localhost:8080/v1.0/owl/12341234

# Update a single owl

# Delete a owl
curl -X DELETE  localhost:8080/v1.0/owl/12341234

Octopus

The octopus endpoint controls the parameters used for the Semantic Modeller of the Serene API. Octopus is the final model which performs both relational and ontological schema matching.

# List octopi
curl localhost:8080/v1.0/model

# Post octopus
curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
      "name": "hello",
      "description": "Testing octopus used for identifying phone numbers only.",
      "ssds": [1, 2, 3],
      "ontologies": [1, 2, 3],
      "modelingProps": "see below for explanations",
      "modelType": "randomForest",
      "features": ["isAlpha", "alphaRatio", "atSigns", ...],
      "resamplingStrategy": "ResampleToMean",
      "numBags": 10,
      "bagSize": 10
    }' \
         localhost:8080/v1.0/octopus
  
# Train octopus (async, includes training for the schema matcher model, use GET on octopus 98793874 to query state)
curl -X POST localhost:8080/v1.0/octopus/98793874/train

# Delete a single octopus
curl -X DELETE  localhost:8080/v1.0/octopus/12341234

# Suggest a list of semanctic models for a specific dataset 12341234 using octopus. Returns prediction JSON object
curl -X POST localhost:8080/v1.0/octopus/98793874/predict/12341234

Modeling properties:

Ontology inference properties govern the construction of the alignment graph and regulate how many nodes and links will be additionally inferred from the ontology:

Property Name Type Default Description
compatibleProperties Boolean true Governs construction of ontology cache (extends alignment graph with inferred nodes and links from the ontology)
ontologyAlignment Boolean false Governs construction of ontology cache (extends alignment graph with inferred nodes and links from the ontology)
addOntologyPaths Boolean false Extends alignment graph with inferred paths from the ontology
multipleSameProperty Boolean false Allow multiple same data properties per class node
thingNode Boolean false Add Thing node as superclass of all other classes
nodeClosure Boolean true Additional inference on nodes (closure of the node contains all the nodes that are connected to the input node by ObjectProperty or SubClass links)
propertiesDirect Boolean true Extend with direct properties
propertiesIndirect Boolean true Extend with indirect properties
propertiesSubclass Boolean true Extend with subclass properties
propertiesWithOnlyDomain Boolean true Allow properties in the ontology which have only domain indicated, but not range
propertiesWithOnlyRange Boolean true Allow properties in the ontology which have only range indicated, but not domain
propertiesWithoutDomainRange Boolean false Allow properties in the ontology which do not have domain or range

Search optimization (to better understand the search algorithms please refer to the report):

Property Name Type Default Description
numSemanticTypes Int 4 Parameter which filters possible matches per column (only Top numSemanticTypes will be considered during mapping stage)
mappingBranchingFactor Int 50 Parameter which reduces the search space for the possible mappings (mappings are built as combinations of matches)
numCandidateMappings Int 10 Parameter which reduces the search space for the heuristic STP (Steiner Tree Problem) algorithm (only Top numCandidateMappings are considered for STP)
topkSteinerTrees Int 10 number of Steiner Trees to be constructed by the algorithm (ranked according to the overall score)

Score is a weighted sum of confidence score, coherence score and size score:

Property Name Type Default Description
confidenceWeight Double 1.0 Weight of the confidence score (this is the confidence score returned by the schema matcher)
coherenceWeight Double 1.0 Weight of the coherence score (this score is calculated based on combinations of links and nodes)
sizeWeight Double 0.5 Weight of the size score (size of the semantic model)

All weights have to be in range (0,1]. Changing weights will affect the search and the results returned by the semantic modeler.

Unknown:

Property Name Type Default Description
unknownThreshold Double 0.05 If confidence score with unknown class is above this threshold and unknown is the most likely class, then the column will be discarded

Threshold must be in range [0,1].

Evaluation

Compute three metrics to compare a predicted SSD against the correct one

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
       "predictedSsd": {
         "name": "businessInfo.csv",
         "ontologies": [1],
         "semanticModel": {
           "nodes": [***],
           "links": [***]
         },
         "mappings": [***]
       },
       "correctSsd": {
         "name": "businessInfo.csv",
         "ontologies": [1],
         "semanticModel": {
           "nodes": [***],
           "links": [***]
         },
         "mappings": [***]
       },
       "ignoreSemanticTypes": true,
       "ignoreColumnNodes": true
      }' \
  localhost:8080/v1.0/evaluate

Tests

To run all tests:

sbt test

To run individual module tests, refer to the module name e.g.

sbt serene-core/test
sbt serene-matcher/test
sbt serene-modeler/test

To run an individual test spec refer to the Spec e.g.

sbt "serene-core/test-only au.csiro.data61.core.SSDStorageSpec"

To generate the code coverage report:

sbt serene-core/test serene-core/coverageReport

This will generate an HTML report at core/target/scala-2.11/scoverage-report/index.html

Notes

For the semantic modelling part 3 Karma java libraries need to be available:

  • karma-common;
  • karma-typer;
  • karma-util.

Certain changes have been made to the original Karma code:

  1. Make the following methods public: SortableSemanticModel.steinerNodes.getSizeReduction.

  2. Add method ModelLearningGraph.setLastUpdateTime:

public void setLastUpdateTime(long newTime) {
		this.lastUpdateTime = newTime;
	}
  1. Add DINT to Karma origin of semantic types:
public enum Origin {
		AutoModel, User, CRFModel, TfIdfModel, RFModel, DINT
	}
  1. Add two more parameters to the method in GraphBuilder.java:
private void updateLinkCountMap(DefaultLink link, Node source, Node target)

serene's People

Contributors

lichaoir avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

serene's Issues

Infinite training loop

This issue appears only if we want logging in DEBUG mode.
In such case we get Spark+json4s incompatibility issue which raises
java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.render(Lorg/json4s/JsonAST$JValue;)Lorg/json4s/JsonAST$JValue

However, this error is not caught by schema-matcher-api or DINT code. The model will get into the eternal "busy" state if we try to train it.

Write/read model.rf

There is some additional configuration which influences how SerializableMLibClassifier gets written/read with ObjectOutputStream/ObjectInputStream. That causes inconsitent behaviour and raises errors.
For example, name of the package "au.com.csiro.data61" / "au.com.nicta" are part of it.

Schema-matcher-api does not have this issue.

Adding fork := true to the general settings fixes this problem, but raises other issues
https://gist.github.com/ramn/5566596#file-serialization-scala

Fat jar is not properly assembled with sbt

"sbt assembly" with the current build properties does not actually create a fat jar with all required dependencies (at least spark dependencies seem to be messed up).

When running serene via "bin/serene-start", training of models fails with the error:
Error while instantiating 'org.apache.spark.sql.internal.SessionState'

Running serene via "sbt run" does not have these issues.

Training fails if we have too many features (400+)

When using char-dist-features + header features for the domain "dbpedia", we get many features (400+). The training of RandomForestClassifier with Spark fails with the error:
Cause: org.codehaus.janino.JaninoRuntimeException: Code of method "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

Apparently, there's a bug in Spark, but it's not clear if there is an easy fix for this problem:
https://issues.apache.org/jira/browse/SPARK-16845
http://stackoverflow.com/questions/40044779/find-mean-and-corr-of-10-000-columns-in-pyspark-dataframe
https://issues.apache.org/jira/browse/SPARK-17092

SparkTestSpec reproduces this error currently.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.