Git Product home page Git Product logo

domain-discovery-d4's Introduction

Data-Driven Domain Discovery (D4)

License DOI

About

D4 implements a data-driven domain discovery approach for collections of related tabular (structured) datasets. Given collection of datasets, D4 outputs a set of domains discovered from the collection in a holistic fashion, by taking all the data into account.

Similar to word embedding methods such as Word2Vec, D4 gathers contextual information for terms. But unlike these methods which aim to build context for terms in unstructured text, we aim to capture context for terms within columns in a set of tables. The intuition is that terms from the same domain frequently occur together in columns or at least with similar sets of terms.

For more information about D4, please have a look at our paper Data-Driven Domain Discovery for Structured Datasets at VLDB 2020:

Masayo Ota, Heiko Müller, Juliana Freire, and Divesh Srivastava. Data-driven domain discovery for structured datasets. Proc. VLDB Endow. 13, 7 (March 2020), 953–967. DOI:https://doi.org/10.14778/3384345.3384346

Note: This repository merges the relevant parts of previously separated repositories urban-data-core and urban-data-db.

Installation

The D4 jar-file can be build using Apache Maven. After cloning the repository, run mvn clean install to build the D4 jar-file.

git clone [email protected]:VIDA-NYU/domain-discovery-d4.git
cd domain-discovery-d4
mvn clean install
cp target/D4-jar-with-dependencies.jar /home/user/lib/D4.jar

Domain Discovery Pipeline

The D4 algorithm operates on a set of CSV files. All input files are currently expected within a single directory. D4 will consider all files in the directory with suffix .csv, .csv.gz, .tsv, or .tsv.gz. Files with suffix .csv or .csv.gz are expected to be comma-separated files. Files with suffix .tsv or .tsv.gz are expected to be tab-delimited files. The first line in each file is expected to contain the column header. The D4.jar file supports nine different commands:

$> java -jar /home/user/lib/D4.jar --help

Usage:
  <command> [

      Data preparation
      ----------------
      columns
      term-index
      eqs

      D4 pipeline
      -----------
      signatures
      expand-columns
      local-domains
      strong-domains

      Alternatives
      ------------
      no-expand
      columns-as-domains

      Explore Results
      ---------------
      export

  ] <args>

The first four commands are required for transforming the input datasets into the internal format that is used by D4.

Data Preparation

Generate Column Files: The first command converts a set of CSV files into a set of column files, one file for each column in the dataset collection. The resulting column files are tab-delimited and contain a list of distinct terms for the respective column and the frequencies for individual terms. Each column has a unique identifier. Column metadata (i.e., column name and dataset file) are written to a metadata file. During this step all column values will be converted to upper case (to make the domain-discovery process case-insensitive).

The cacheSize parameter specifies the size of the memory cache for each column that is used to generate the list of distinct terms and their counts.

$> java -jar  /home/user/lib/D4.jar columns --help
D4 - Data-Driven Domain Discovery - Version (0.28.0)

columns
  --input=<directory> [default: 'tsv']
  --metadata=<file> [default: 'columns.tsv']
  --cacheSize=<int> [default: 1000]
  --verbose=<boolean> [default: true]
  --threads=<int> [default: 6]
  --output=<directory> [default: 'columns']

Create Index of Unique Terms in the Data Collection: The term index contains a list of unique terms across all columns in the collection. In this step the user has the option to consider only a subset of all columns in the collection (e.g., only those columns that were classified as text columns). The --textThreshold parameter specifies the fraction of distinct terms in a column that have to be classified as text for the column to be included in the term index..

$> java -jar /home/user/lib/D4.jar term-index --help
D4 - Data-Driven Domain Discovery - Version (0.28.0)

term-index
  --input=<directory | file> [default: 'columns']
  --textThreshold=<constraint> [default: 'GT0.5']
  --membuffer=<int> [default: 10000000]
  --validate=<boolean> [default: false]
  --threads=<int> [default: 6]
  --verbose=<boolean> [default: true]
  --output=<file> [default: 'text-columns.txt']

Generate Equivalence Classes: D4 operates on sets of equivalence classes. Equivalence classes are sets of terms that always occur in the same set of columns.

$> java -jar /home/user/lib/D4.jar eqs --help
D4 - Data-Driven Domain Discovery - Version (0.28.0)

eqs
  --input=<file> [default: 'term-index.txt.gz']
  --verbose=<boolean> [default: true]
  --output=<file> [default: 'compressed-term-index.txt.gz']

Domain Discovery

D4 has three main steps: signature generation, column expansion, and domain discovery.

Robust Signatures: The first step creates signatures that capture the context of terms taking into account term co-occurrence information over all columns in the collection. These signatures are made robust to noise, heterogeneity, and ambiguity. Robust signatures capture the context of related terms while blending out noise. They are essential to our approach in addressing the challenges of incomplete columns (through column expansion) and term ambiguity in heterogeneous and noisy data.

$> java -jar /home/user/lib/D4.jar signatures --help
D4 - Data-Driven Domain Discovery - Version (0.28.0)

signatures
  --eqs=<file> [default: 'compressed-term-index.txt.gz']
  --sim=<str> [default: JI | LOGJI | TF-ICF]
  --robustifier=<str> [default: LIBERAL | COMMON-COLUMN | IGNORE-LAST]
  --fullSignatureConstraint=<boolean> [default: true]
  --ignoreLastDrop=<boolean> [default: false]
  --ignoreMinorDrop=<boolean> [default: true] 
  --threads=<int> [default: 6]
  --verbose=<boolean> [default: true]
  --signatures=<file> [default: 'signatures.txt.gz']

The robust signature step starts by computing context signatures for each term. A context signature is a vector containing similarities of a term with all other terms in the dataset. Elements in the context signature are sorted in decreasing order of similarity. The --sim parameter specifies the similarity function that is used to create the context signature. the following similarity functions are currently supported:

  • JI: Jaccard-Index similarity between the sets of columns that two equivalence classes.
  • LOGJI: Logarithm of the Jaccard-Index similarity
  • TF-ICF: Weighted Jaccard-Index similarity. The weights for each term are computed using a tf-idf-like measure.

For signature robustification the context signature is first divided into blocks of elements based on the idea of consecutive steepest drop , i.e., the maximum difference between consecutive elements in the sorted context signature. The --ignoreMinorDrop parameter can be used to avoid splitting the context signature in too many blocks based on irrelevant steepest drops in regions of low variablility. A minor drop is detected if the next steepest drop is smaller than the difference of the elements in the block that preceeds the drop. If the --ignoreMinorDrop parameter is true all remaining elements will be placed in a single final block if a minor drop occurs.

D4 then prunes all blocks starting from noisy block and only retains blocks that occur before that noisy block. There are three different strategies to identify the noisy block (controlled via the --robustifier parameter):

  • COMMON-COLUMN: The noisy block is the first block where NOT all terms in the block occur together in at least one column. The motivation here is that blocks are supposed to represent subsets of domains that a term belongs to. A block that contains terms that never occur to gether in at least one column is likely to contains terms that do not belong to the same domain.
  • IGNORE-LAST: Keeps all blocks except the last block. The only exception is if the context signature contains only a single block. This pruning strategy is particularly intended to be used in combination with --ignireMinorDrop=true. If a minor drop is detected all remaining elements in the context signature are placed in a single (final) block.
  • LIBERAL: Default setting considers the number of terms in each block. The larges block is then considered as the noisy block for pruning.

Expanded Columns: The column expansion step addresses the challenge of incomplete columns by adding terms to a column that are likely to belong to the same domain as the majority of terms in the column. D4 makes use of robust signatures to expand columns: it adds a term only if it has sufficient support (controlled by expandThreshold) from the robust signatures of terms in the column. This leads to high accuracy in domain discovery. We take an iterative approach to column expansion. The idea is that adding a term to a column may provide additional support for other terms to be added as well (controlled by iterations and decrease).

$> java -jar /home/user/lib/D4.jar expand-columns --help
D4 - Data-Driven Domain Discovery - Version (0.28.0)

expand-columns
  --eqs=<file> [default: 'compressed-term-index.txt.gz']
  --signatures=<file> [default: 'signatures.txt.gz']
  --trimmer=<string> [default: CENTRIST] (alternatives: CONSERVATIVE, LIBERAL)
  --expandThreshold=<constraint> [default: 'GT0.25']
  --decrease=<double> [default: 0.05]
  --iterations=<int> [default: 5]
  --threads=<int> [default: 6]
  --verbose=<boolean> [default: true]
  --columns=<file> [default: 'expanded-columns.txt.gz']

Valid values for the --trimmerparameter are:

  • CONSERVATIVE
  • CENTRIST
  • LIBERAL

Local Domains: This step derives from each column a set of domain candidates, called local domains. Local domains are clusters of terms in an (expanded) column that are likely to belong to the same type.

$> java -jar /home/user/lib/D4.jar local-domains --help
D4 - Data-Driven Domain Discovery - Version (0.28.0)

local-domains
  --eqs=<file> [default: 'compressed-term-index.txt.gz']
  --columns=<file> [default: 'expanded-columns.txt.gz']
  --signatures=<file> [default: 'signatures.txt.gz']
  --trimmer=<string> [default: CENTRIST]
  --threads=<int> [default: 6]
  --verbose=<boolean> [default: true]
  --localdomains=<file> [default: 'local-domains.txt.gz']

Strong Domains: In this step, D4 applies a data-driven approach to narrow down the set of local domains and create a smaller set of strong domains to be presented to the user. Strong domain are sets of local domains that provide support for each other based on overlap.

$> java -jar /home/user/lib/D4.jar strong-domains --help
D4 - Data-Driven Domain Discovery - Version (0.28.0)

strong-domains
  --eqs=<file> [default: 'compressed-term-index.txt.gz']
  --localdomains=<file> [default: 'local-domains.txt.gz']
  --domainOverlap=<constraint> [default: 'GT0.5']
  --supportFraction=<double> [default: 0.25]
  --threads=<int> [default: 6]
  --verbose=<boolean> [default: true]
  --strongdomains=<file> [default: 'strong-domains.txt.gz']

Alternatives

No expansion: Use the no-expand option when discovering local domains on the original dataset columns (without expansion). This option will output a columns file in the same format as the expand-columns step that can be used as input for the local-domains step.

$> java -jar /home/user/lib/D4.jar no-expand --help
D4 - Data-Driven Domain Discovery - Version (0.30.1)

no-expand
  --eqs=<file> [default: 'compressed-term-index.txt.gz']
  --verbose=<boolean> [default: true]
  --columns=<file> [default: 'expanded-columns.txt.gz']

Whole column as domain: Instead of discovering local domains within (expanded) columns there is now an option to treat each unique (expanded) column as a local domain.

$> java -jar /home/user/lib/D4.jar columns-as-domains --help
D4 - Data-Driven Domain Discovery - Version (0.30.1)

columns-as-domains
  --eqs=<file> [default: 'compressed-term-index.txt.gz']
  --columns=<file> [default: 'expanded-columns.txt.gz']
  --verbose=<boolean> [default: true]
  --localdomains=<file> [default: 'local-domains.txt.gz']

Explore Results

Export Domains: The discovered strong domains can be exported as JSON files for exploration. For each domain a separate file will be created in the output directory. The --sampleSize parameter controls the maximum number of terms that are included in the result for each equivalence class.

$> java -jar /home/user/lib/D4.jar export --help
D4 - Data-Driven Domain Discovery - Version (0.28.0)

export-domains
  --eqs=<file> [default: 'compressed-term-index.txt.gz']
  --terms=<file> [default: 'term-index.txt.gz']
  --columns=<file> [default: 'columns.tsv']
  --domains=<file> [default: 'strong-domains.txt.gz']
  --sampleSize=<int> [default: 100]
  --writePrimary=<boolean> [default: true]
  --output=<direcory> [default: 'domains']

Each output file contains (i) a domain name (derived from frequent tokens in the domain columns), (ii) the list of columns to which the strong domain is assigned, and (iii) the list of terms in the local domains that for the strong domain. Each term in the strong domain is assigned a weight which is computed based on the number of local domain that contain the term. Terms in the strong domain are grouped into blocks based on their weights. An example output file is shown below:

{
  "name": "source",
  "columns": [
    {
      "id": 2852,
      "name": "source",
      "dataset": "sftu-nd43"
    },
    {
      "id": 3985,
      "name": "source",
      "dataset": "9dsr-3f97"
    }
  ],
  "terms": [
    [
      {
        "id": 41923498,
        "name": "WEB PAGE",
        "weight": "1.00000000"
      },
      {
        "id": 33736409,
        "name": "DBI",
        "weight": "1.00000000"
      },
      {
        "id": 33385994,
        "name": "BSMPERMITS",
        "weight": "1.00000000"
      },
      {
        "id": 39231762,
        "name": "STREETSPACEREQUEST",
        "weight": "1.00000000"
      }
    ],
    [
      {
        "id": 41923373,
        "name": "WEB",
        "weight": "0.50000000"
      }
    ]
  ]
}

If the --writePrimary is set true true a second text file is created for each strong domain containing the terms in the first block of each strong domain.

Evaluation Datasets

DOI DOI DOI DOI DOI

We evaluated D4 using three collections of data from New York City, and one collection of data from the State of Utah. Both datasets were downloaded from the Socrata API. We are making the data available as five different repositories:

domain-discovery-d4's People

Contributors

dependabot[bot] avatar heikomuller avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

domain-discovery-d4's Issues

Improve Algorithm for Merging Similar Equivalence Classes

The current implementation for SimilarTermIndexGenerator is rather naive. It merges all equivalence classes in a connected component based on similarity between pairs of equivalence classes. This approach has the strong disadvantage of potentially merging dis-similar equivalence classes because similarity is not transitive.

One improvement could be to pick equivalence classes as strong seeds and then merge them with all other equivalence classes that are similar to the seed. While this could still merge dis-similar equivalence classes there is the guarantee that they all at least satisfy the similarity threshold with the seed equivalence class.

Frequencies for Equivalence Classes

Add option to compute frequency of an equivalence class for each column C as either

  • min. column frequency of all EQ terms in C
  • max. column frequency of all EQ terms in C
  • sum of frequencies for all EQ terms in C (default)
  • average (rounded) of frequencies for all EQ terms in C

add ref to paper

Heiko,

Can you please add a reference to our VLDB paper in the readme? This may be useful for people that try to use the tool.

Thanks,
Juliana

Modify Strong Domain Discovery

The strong domain discovery step can be modified in the following way:

  • Cluster local domains that support each other. Each cluster forms a strong domain
  • For each strong domain rank terms based on the number of columns (in the strong domain) that they occur in
  • Use steepest drop to group terms based on their weights

Keep track of term frequencies

We should keep track of term frequencies in the column files (and the terms index and compressed term index). This would allow us to use similarity measures for terms/equivalence classes that are based on some notion of tf-idf.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.