Git Product home page Git Product logo

scior-dataset's Introduction

Scior-Dataset

Dataset with results of Scior tests using the Scior-Tester automation tool performed on the OntoUML/UFO Catalog.

Description

The Scior-Dataset is composed of files with results of Scior tests performed via the Scior-Tester on the OntoUML/UFO Catalog.

The FAIR Model Catalog for Ontology-Driven Conceptual Modeling Research, short-named OntoUML/UFO Catalog, is a structured and open-source catalog that contains OntoUML and UFO ontology models. The catalog was conceived to allow collaborative work and to be easily accessible to all its users. Its goal is to support empirical research in OntoUML and UFO, as well as for the general conceptual modeling area, by providing high-quality curated, structured, and machine-processable data on why, where, and how different modeling approaches are used. The catalog offers a diverse collection of conceptual models, created by modelers with varying modeling skills, for a range of domains, and for different purposes.

The tests were performed using the automation tool named Scior-Tester, which runs over Scior. Scior is the abbreviated name for Identification of Ontological Categories for OWL Ontologies, a software that aims to support the semi-automatic semantic improvement of lightweight web ontologies. We aim to reach the referred semantic improvement via the association of gUFO—a lightweight implementation of the Unified Foundational Ontology (UFO)—concepts to the OWL entities. The aim of gUFO is "to provide a lightweight implementation of the Unified Foundational Ontology (UFO) suitable for Semantic Web OWL 2 DL applications".

This document presents the structure of the files generated during the Scior-Tester execution. For a complete comprehension of the tests (regarding scope, objectives, implementation, etc.), please refer to the Scior-Tester description file.

The aim of the publication of the resulting datasets is to share with the community data that can be analyzed in different ways, even though all executed tests are totally reproducible.

Contents

Nomenclature of Files and Folders

For avoiding long names for files and directories, all content available in the datasets in this repository follows the nomenclature here presented:

  1. Numbers with up to three digits are always presented with three digits (e.g., 001). Numbers higher than three digits must be presented without additional digits
  2. All numbers must be attached directly to its corresponding item (e.g., test, execution, etc.)
  3. The following words must be changed for the corresponding simplifications:
    • test: tt
    • taxonomy: tx
    • execution: ex
    • percentage: pc
  4. The Scior parameters must be represented using the following simplifications:
    • automatic: a
    • interactive: i
    • complete: c
    • incomplete: n
  5. The automation parameter (a or i) must come first, and the completion parameter must follow it (c or n)
  6. The parameters must be displayed integrated (e.g., ac, in, etc.)
  7. Files names must be without spaces, which must be substituted by hyphens
  8. Separation between different items in the file name must be done using underlines
  9. The following item order must be used whenever possible: file name, dataset name, test name/number, test parameters, taxonomy number, execution number, percentage number

Build Generated Files

The Scior-Tester creates a directory for each one of the catalog's datasets that are tested. Each directory contains other folders with the results of the tests that were performed, but they also contain two different files generated by the Scior-Tester to be used as input for the tests. For generating these files, the Tester decomposes the original taxonomy from a dataset in its (possibly multiple) independent taxonomies (isolated group of classes related via specialization/generalization relations between each other). Both files are presented in this document, as well as a hashes register file.

Taxonomical Graph ttl File

Each XXX_txYYY.ttl file (with XXX being the dataset name and YYY ranging from 001 to the number of independent taxonomies available in the dataset's OntoUML model) contains an isolated taxonomical graph in OWL (in turtle syntax) got from the OWL taxonomy provided in the catalog's dataset to be tested. An example of a generated taxonomy file is: aguiar2018rdbs-o_tx001.ttl.

For instance, a single model that has two not connected hierarchical structures of concepts will generate two files, each one containing only the following properties: rdfs:subClassOf, owl:Class, and rdf:type.

For generating the concept's URIs, the Scior-Tester uses the following namespace for all taxonomies generated for all datasets: http://taxonomy.model/

Taxonomical Graph Information csv File

Each data_XXX_txYYY.csv file (with XXX being the dataset name and with YYY ranging from 01 to the number of independent taxonomies available in the dataset's OntoUML model) contains information about all the classes that are part of the taxonomical graph with the corresponding number (i.e., the file data_aguiar2018rdbs-o_tx001.csv refers to the taxonomy saved in the file aguiar2018rdbs-o_tx001.ttl). The difference between the results of a test and the inputted data should use this file, as it contains the source data.

The generated csv file contains the following columns:

  • class_name: name of the OntoUML class as it is in the original model (i.e., without namespace)
  • ontouml_stereotype: the class's OntoUML stereotype as was attributed by its modeler
  • gufo_classification: the class's OntoUML stereotype mapped to a gUFO endurant type (click here for more information)
  • is_root: Boolean value that shows if the class is a root node in the taxonomical graph (i.e., if it has no superclasses)
  • is_leaf: Boolean value that shows if the class is a leaf node in the taxonomical graph (i.e., if it has no subclasses)
  • is_intermediate: Boolean value that shows if the class is an intermediate node in the taxonomical graph (i.e., if it has subclasses and superclasses)
  • number_superclasses: the sum of the number of all direct and indirect superclasses that the class have
  • number_subclasses: the sum of the number of all direct and indirect subclasses that the class have

As every class must be a root, a leaf, or an intermediate node, note that this file would be inconsistent if:

  • is_root OR is_leaf OR is_intermediate != True, or if
  • is_root AND is_leaf AND is_intermediate != False

Taxonomies Resume csv File

This file, named taxonomies.csv, contains information about all taxonomies created in all datasets during the build function. The aim of this file is to display information to the user in a simple way so she/he can analyze it for creating tests or manipulating tests’ results.

The generated csv file contains the following columns:

  • taxonomy_name: a string with the name of the dataset file (e.g., abrahao2018agriculture-operations_tx001.ttl)
  • dataset_name: a string with the dataset that contains this taxonomy (e.g., abrahao2018agriculture-operations)
  • num_mapped_classes: an integer representing the number of classes that the taxonomy has that have classifications different than the string "other"
  • num_other_classes: an integer representing the number of classes that the taxonomy has that are classified with the string "other"
  • num_classes: an integer representing the number of classes that the taxonomy has

Note that the sum of num_mapped_classes and num_other_classes must equal num_classes. These fields classifications are related to the mapping process (described here).

A single taxonomies.csv file, located in the /catalog folder is created after the build function is completed.

Hashes Register CSV File

For traceability, the Scior-Tester provides a function for generating a SHA256 hash of its generated files and of the files that originated them. The whole dataset contains a single csv register file named hash_sha256_register.csv, containing four columns of data that are incremented every time the Tester creates new files. The columns are:

  • file_name: complete path of the file being hashed
  • file_hash: SHA256 hash of the file
  • source_file_name: file used as a source for the generation of the file being hashed
  • source_file_hash: SHA256 hash of the source file

We could cite as an example of use of this file the case where a user would like to know if he is using the same source data for generating his results, so he can get the SHA256 hash of the files she/he is using check if it exists in the hashes register file.

Tests – Generated Files and their Descriptions

Currently, datasets generated from the execution of two tests are available. Please use the following links for accessing the tests descriptions and results.

Related Repositories

  • Scior: software for identification of ontological categories for OWL ontologies.
  • Scior-Tester: used for automating tests on Scior.
  • Scior-Dataset: contains data resulting from the Scior-Tester.
  • OntoUML/UFO Catalog: source of models used for the performed tests.

Contributors

Acknowledgements

This work is a collaboration between the Free University of Bozen-Bolzano, the University of Twente, and Accenture Israel Cybersecurity Labs.

scior-dataset's People

Contributors

mozzherina avatar pedropaulofb avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

scior-dataset's Issues

Usefulness Analysis for the Tool's Usage: Performance

Considering only taxonomies that do not have inconsistencies and divergences, plot 2 line graphs (for ac and for an execution modes) with 3 lines each and with error bars (standard error?).

Plotted Lines:

  • Line A: average execution time for all rules (total_time of Times csv file) of all executions of all these 3 sets of taxonomies (NUMBER_OF_EXECUTIONS_PER_DATASET_PER_PERCENTAGE).
    • A1: using the set of 20% taxonomies of smaller sizes
    • A2: using the set of 20% taxonomies of intermediate size
    • A3: using the set of 20% taxonomies of bigger size

Axis:

  • Y axis: execution time in ms
  • X axis: percentage of classes used as input (from PERCENTAGE_INITIAL to PERCENTAGE_FINAL, increasing PERCENTAGE_RATE)

Final evaluation matrices

  1. Generate one resulting matrix (issue #15) for all percentages tested (for AC and for AN)
  2. Create two new matrices, one for AC and one for AN. These matrix must have:
  • row 1 - first line of the matrix for test with 10% input
  • row 2 - first line of the matrix for test with 20% input
  • row 3 - first line of the matrix for test with 30% input
  • (...)
  • row 9 - first line of the matrix for test with 90% input

The number of columns will be 15, because this is the number of columns of the resulting matrix generated on issue #15.

I.e., the amount of lines that the resulting matrix will have is the same amount of percentages tested and each line is going to be the first line of the matrix resulting from the evaluation performed on issue #15.

Project name changed to Scior

The project's name was changed from OntCatOWL to Scior.

We need to substitute every occurrence of the term OntCatOWL to Scior in code, resources, and documentations.

In this case: Scior-Dataset

Usefulness Analysis for the Tool's Objective: CLASSES Inferred knowledge

Considering only valid taxonomies, which are the ones that:

  1. do not have inconsistencies
  2. do not have divergences
  3. have more than MINIMUM_ALLOWED_NUMBER_CLASSES classes that are not classified as "other"

please plot 2 line graphs (for "ac" and for "an" execution modes) with 2 lines each. Verify the necessity to plot error bars.

Use the average of values of all executions of all these 3 sets of taxonomies (NUMBER_OF_EXECUTIONS_PER_DATASET_PER_PERCENTAGE).

Plotted Lines:

  • Line A: percentage of totally unknown classes after the execution (result). These will probably be a "curve".
    • Use the column: classes_a_tu_classes_types_p.
  • Line B: percentage of partially known classes after the execution (result). These will probably be a "curve".
    • Use the column: classes_a_pk_classes_types_p.
  • Line C: percentage of totally known classes after the execution (result). These will probably be a "curve".
    • Use the column: classes_a_tk_classes_types_p.

Obs: A+B+C = 100%

All variables are found in the statistics_XXX_tt002_MM_txYYY.csv file.

Axis:

  • Y axis: percentage of classes (0 to 100%).
  • X axis: percentage of classes used as input (from PERCENTAGE_INITIAL to PERCENTAGE_FINAL, increasing PERCENTAGE_RATE)

Calculate:

  • On average, with how much input we have more totally known than totally unknown classes?
  • On average, with how much input we have more totally known than partially known classes?
  • On average, with how much input can we reach 100% of totally known classes? (if it is possible...)

Usefulness Analysis for the Tool's Objective: best percentage analysis

Considering the knowledge matrices of TEST 2: files names matrix_XXX_tt002_MM_txYYY_exZZZ_pcKKK.csv for test 2 "ac" and "an"

Selection of files:

i) Select the percentage with more knowledge gain identified in Issue #9
ii) Use all knowledge matrices for that percentage for all valid taxonomies

Activities:

  1. For all selected knowledge matrices, transform their values in percentages (you can simply divide the matrice values by the number of classes in the taxonomy)
  2. Sum the values of all matrices doing a simple addition of matrices of the same size (Entrywise sum)
  3. Convert the resulting matrix's values to percentage

With the final matrix, we will generate a heat map.

Create Links

Some documentation may have links to be created. Verify if there are any.

Usefulness Analysis for the Tool's Objective: CLASSIFICATIONS Inferred knowledge

Considering only valid taxonomies, which are the ones that:

do not have inconsistencies
do not have divergences
have more than MINIMUM_ALLOWED_NUMBER_CLASSES classes that are not classified as "other"

please plot 2 line graphs (for ac and for an execution modes) with 2 lines each. Verify the necessity to plot error bars.

Plotted Lines:

  • Line A: percentage of known classifications used as input (range from PERCENTAGE_INITIAL to PERCENTAGE_FINAL). This may be a straight line.
    • Use the column classif_b_known_classif_types_p.
  • Line B: percentage of known classifications after the execution (result). These will probably be a "curve".
    • Use the column classif_a_known_classif_types_p.

Both variables are found in the statistics_XXX_tt002_MM_txYYY.csv file.

Use the average of values of all executions of all these 3 sets of taxonomies (NUMBER_OF_EXECUTIONS_PER_DATASET_PER_PERCENTAGE).

Axis:

  • Y axis: percentage of known classifications (i.e., percentage of known knowledge). For types, 14*number_of_classes correspond to 100 %.
  • X axis: percentage of classes used as input (from PERCENTAGE_INITIAL to PERCENTAGE_FINAL, increasing PERCENTAGE_RATE)

Calculate:

  • On average, with how much input we can reach 100% of known classifications?
  • Which is the most efficient and the most inefficient percentage of input concerning inferred classifications?
    • Most efficient: corresponds to the highest difference between plotted line A and each one of B lines.
    • Most efficient: corresponds to the lowest difference between both plotted line A and each one of B lines.
    • Please generate a table with values of differences between input and output percentages (lines A and B) for all tested percentages.

Query (Selection): Taxonomy numbers

Considering the results of issues #4, #5, #6, and #7:

a) How many taxonomies the build function generates (i.e., all generated taxonomies, even the ones with inconsistencies and divergences)
b) How many taxonomies do not have inconsistencies?
c) How many taxonomies do not have divergences?
d) How many taxonomies do not have inconsistencies AND divergences?
e) How many taxonomies do not have inconsistencies AND divergences AND that have more than 20 classes?

Usefulness Analysis for the Tool's Usage: Scalability (Individual)

Considering only taxonomies that do not have inconsistencies and divergences, plot 2 line graphs (for ac and for an execution modes) with 3 lines each and with error bars (standard error?).

Plotted Lines:

  • Line A: average execution time for all rules (total_time of Times csv file) of all executions of each taxonomy (NUMBER_OF_EXECUTIONS_PER_DATASET_PER_PERCENTAGE).
    • A1: using 10% of known classes as input
    • A2: using 50% of known classes as input
    • A3: using 90% of known classes as input

Axis:

  • Y axis: execution time in ms
  • X axis: number of classes the taxonomy has (use all taxonomies according to the consideration to be adopted)

Calculate:

  • The line/curve function for evidencing that the software is polynomial

Usefulness Analysis for the Tool's Usage: Scalability (Complete)

Considering only taxonomies that do not have inconsistencies and divergences, plot 2 line graphs (for ac and for an execution modes) with a single line each and with error bars (standard error?).

Plotted Lines:

  • Line A: average execution time for all rules (total_time of Times csv file) of all executions of each taxonomy (NUMBER_OF_EXECUTIONS_PER_DATASET_PER_PERCENTAGE) creating a mean for for all percentages of known classes as input.

Axis:

  • Y axis: execution time in ms
  • X axis: number of classes the taxonomy has (use all taxonomies according to the consideration to be adopted)

Calculate:

  • The line/curve function for evidencing that the software is polynomial

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.