olehmberg / winter Goto Github PK

WInte.r is a Java framework for end-to-end data integration. The WInte.r framework implements well-known methods for data pre-processing, schema matching, identity resolution, data fusion, and result evaluation.

License: Apache License 2.0

Java 100.00%

data-fusion identity-resolution schema-matching tabular-data data-integration

winter's Introduction

Web Data INTEgRation Framework (WInte.r)

The WInte.r framework [5] provides methods for end-to-end data integration. The framework implements well-known methods for data pre-processing, schema matching, identity resolution, data fusion, and result evaluation. The methods are designed to be easily customizable by exchanging pre-defined building blocks, such as blockers, matching rules, similarity functions, and conflict resolution functions. In addition, these pre-defined building blocks can be used as foundation for implementing advanced integration methods.

Functionality
Use cases
Contact
License
Acknowledgements
References

Quick Start: The section below provides an overview of the functionality of the WInte.r framework. As alternatives to acquaint yourself with the framework, you can also read the WInte.r Tutorial or have a look at the code examples in our Wiki!

Using WInte.r

You can include the WInte.r framework via the following Maven dependency:

<dependency>
    <groupId>de.uni-mannheim.informatik.dws.winter</groupId>
    <artifactId>winter-framework</artifactId>
    <version>1.4.1</version>
</dependency>

Functionality

The WInte.r framework covers all central steps of the data integration process, including data loading, pre-processing, schema matching, identity resolution, as well as data fusion. This section gives an overview of the functionality and the alternative algorithms that are provided for each of these steps.

Data Loading: WInte.r provides readers for standard data formats such as CSV, XML and JSON. In addition, WInte.r offers a specialized JSON format for representing tabular data from the Web together with meta-information about the origin and context of the data, as used by the Web Data Commons (WDC) Web Tables Corpora.

Pre-processing: During pre-processing you prepare your data for the methods that you are going to apply later on in the integration process. WInte.r WebTables provides you with specialized pre-processing methods for tabular data, such as:

Data type detection
Unit of measurement normalization
Header detection
Subject column detection (also known as entity name column detection)

Schema Matching: Schema matching methods find attributes in two schemata that have the same meaning. WInte.r provides three pre-implemented schema matching algorithms which either rely on attribute labels or data values, or exploit an existing mapping of records (duplicate-based schema matching) in order to find attribute correspondences.

Label-based schema matching
Instance-based schema matching
Duplicate-based schema matching

Identity Resolution: Identity resolution methods (also known as data matching or record linkage methods) identify records that describe the same real-world entity. The pre-implemented identity resolution methods can be applied to a single dataset for duplicate detection or to multiple datasets in order to find record-level correspondences. Beside of manually defining identity resolution methods, WInte.r also allows you to learn matching rules from known correspondences. Identity resolution methods rely on blocking (also called indexing) in order to reduce the number of record comparisons. WInte.r provides following pre-implemented blocking and identity resolution methods:

Blocking by single/multiple blocking key(s)
Sorted-Neighbourhood Method
Token-based identity resolution
Rule-based identity resolution

Data Fusion: Data fusion methods combine data from multiple sources into a single, consolidated dataset. For this, they rely on the schema- and record-level correspondences that were discovered in the previous steps of the integration process. However, different sources may provide conflicting data values. WInte.r allows you to resolve such data conflicts (decide which value to include in the final dataset) by applying different conflict resolution functions.

11 pre-defined conflict resolution functions for strings, numbers and lists of values as well as data type independent functions.

Use cases

WInte.r can be used out-of-the-box to integrate data from multiple data sources. The framework can also be used as foundation for implementing more advanced, use case-specific integration methods. In the following we provide an example use case from each category.

Integration of Multiple Data Sources: Building a Movie Dataset

The WInte.r framework is used to integrate data from multiple sources within the Web Data Integration course offered by Professor Bizer at the University of Mannheim. The basic case study in this course is the integration of product data from multiple Web data sources. In addition, student teams use the WInte.r framework to integrate data about different topics as part of the projects that they conduct during the course.

Integration of Large Numbers of Data Sources: Augmenting the DBpedia Knowledge base with Web Table Data

Many web sites provide data in the form of HTML tables. Millions of such data tables have been extracted from the CommonCrawl web corpus by the Web Data Commons project [3]. Data from these tables can be used to fill missing values in large cross-domain knowledge bases such as DBpedia [2]. An example of how pre-defined building blocks from the WInte.r framework are combined into an advanced, use-case specific integration method is the T2K Match algorithm [1]. The algorithm is optimized to match millions of Web tables against a central knowledge base describing millions of instances belonging to hundreds of different classes (such a people or locations) [2]. The full source code of the algorithm, which includes advanced matching methods that combine schema matching and identity resolution, is available in the WInte.r T2K Match project.

Pre-processing for large-scale Matching: Stitching Web Tables for Improving Matching Quality

Tables on web pages ("web tables") cover a diversity of topics and can be a source of information for different tasks such as knowledge base augmentation or the ad-hoc extension of datasets. However, to use this information, the tables must first be integrated, either with each other or into existing data sources. The challenges that matching methods for this purpose have to overcome are the high heterogeneity and the small size of the tables. To counter these problems, web tables from the same web site can be stitched before running any of the existing matching systems. This means that web tables are combined based on a schema mapping, which results in fewer and larger stitched tables [4]. The source code of the stitching method is available in the Web Tables Stitching project.

Data Search for Data Mining (DS4DM)

Analysts increasingly have the problem that they know that some data which they need for a project is available somewhere on the Web or in the corporate intranet, but they are unable to find the data. The goal of the 'Data Search for Data Mining' (DS4DM) project is to extend the data mining platform RapidMiner with data search and data integration functionalities which enable analysts to find relevant data in potentially very large data corpora, and to semi-automatically integrate the discovered data with existing local data.

Contact

If you have any questions, please refer to the Winte.r Tutorial, Wiki, and the JavaDoc first. For further information contact alex [dot] brinkmann [at] informatik [dot] uni-mannheim [dot] de

License

The WInte.r framework can be used under the Apache 2.0 License.

If you use the WInte.r framework in any publication, please cite [5].

Acknowledgements

WInte.r is developed at the Data and Web Science Group at the University of Mannheim.

References

[1] Ritze, D., Lehmberg, O., & Bizer, C. (2015, July). Matching html tables to dbpedia. In Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics (p. 10). ACM.

[2] Ritze, D., Lehmberg, O., Oulabi, Y., & Bizer, C. (2016, April). Profiling the potential of web tables for augmenting cross-domain knowledge bases. In Proceedings of the 25th International Conference on World Wide Web (pp. 251-261). International World Wide Web Conferences Steering Committee.

[3] Lehmberg, O., Ritze, D., Meusel, R., & Bizer, C. (2016, April). A large public corpus of web tables containing time and context metadata. In Proceedings of the 25th International Conference Companion on World Wide Web (pp. 75-76). International World Wide Web Conferences Steering Committee.

[4] Lehmberg, O., & Bizer, C. (2017). Stitching web tables for improving matching quality. Proceedings of the VLDB Endowment, 10(11), 1502-1513.

[5] Lehmberg, O., Brinkmann, A., & Bizer, C. WInte. r - A Web Data Integration Framework. ISWC 2017.

winter's People

Contributors

Stargazers

Watchers

winter's Issues

Use SLF4J instead of the dependency on log4j

Feature request: It might be nice that instead of having the dependency on log4j explicitly, using slf4j would let this project better integrate into other java projects that may or may not use a different logging framework (which I think does use log4j by default).

Instanciation of DataFusionEvaluator with RecordGroupFactory

Would be good if its possible to instanciate the DataFusionEvaluator without a RecordGroupFactory. If null is given, the evaluation breaks. Either checking for null or internal initiate on your internally with new RecordGroupFactory.

Build failure due to tests

I just cloned the repository and wanted to follow the tutorial steps, but running mvn install in the winter-framework folder yields 3 failed tests.

testTypeValue(de.uni_mannheim.informatik.dws.winter.webtables.detectors.PatternbaseTypeDetectorTest)  Time elapsed: 0.003 sec  <<< FAILURE!
junit.framework.AssertionFailedError: expected:<null> but was:<de.uni_mannheim.informatik.dws.winter.preprocessing.units.Unit@72758afa>
	at junit.framework.Assert.fail(Assert.java:57)
	at junit.framework.Assert.failNotEquals(Assert.java:329)
	at junit.framework.Assert.assertEquals(Assert.java:78)
	at junit.framework.Assert.assertEquals(Assert.java:86)
	at junit.framework.TestCase.assertEquals(TestCase.java:253)
	at de.uni_mannheim.informatik.dws.winter.webtables.detectors.PatternbaseTypeDetectorTest.testTypeValue(PatternbaseTypeDetectorTest.java:36)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at junit.framework.TestCase.runTest(TestCase.java:176)
	at junit.framework.TestCase.runBare(TestCase.java:141)
	at junit.framework.TestResult$1.protect(TestResult.java:122)
	at junit.framework.TestResult.runProtected(TestResult.java:142)
	at junit.framework.TestResult.run(TestResult.java:125)
	at junit.framework.TestCase.run(TestCase.java:129)
	at junit.framework.TestSuite.runTest(TestSuite.java:252)
	at junit.framework.TestSuite.run(TestSuite.java:247)
	at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:86)
	at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252)
	at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141)
	at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
	at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
	at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115)
	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75)

testCheckUnit(de.uni_mannheim.informatik.dws.winter.preprocessing.units.UnitCategoryParserTest)  Time elapsed: 0.001 sec  <<< FAILURE!
junit.framework.AssertionFailedError
	at junit.framework.Assert.fail(Assert.java:55)
	at junit.framework.Assert.assertTrue(Assert.java:22)
	at junit.framework.Assert.assertNotNull(Assert.java:256)
	at junit.framework.Assert.assertNotNull(Assert.java:248)
	at junit.framework.TestCase.assertNotNull(TestCase.java:417)
	at de.uni_mannheim.informatik.dws.winter.preprocessing.units.UnitCategoryParserTest.testCheckUnit(UnitCategoryParserTest.java:27)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at junit.framework.TestCase.runTest(TestCase.java:176)
	at junit.framework.TestCase.runBare(TestCase.java:141)
	at junit.framework.TestResult$1.protect(TestResult.java:122)
	at junit.framework.TestResult.runProtected(TestResult.java:142)
	at junit.framework.TestResult.run(TestResult.java:125)
	at junit.framework.TestCase.run(TestCase.java:129)
	at junit.framework.TestSuite.runTest(TestSuite.java:252)
	at junit.framework.TestSuite.run(TestSuite.java:247)
	at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:86)
	at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252)
	at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141)
	at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
	at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
	at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115)
	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75)

testTypeValue(de.uni_mannheim.informatik.dws.winter.preprocessing.datatypes.ValueNormalizerTest)  Time elapsed: 0.029 sec  <<< FAILURE!
junit.framework.AssertionFailedError: expected:<1500000.0> but was:<1.5>
	at junit.framework.Assert.fail(Assert.java:57)
	at junit.framework.Assert.failNotEquals(Assert.java:329)
	at junit.framework.Assert.assertEquals(Assert.java:78)
	at junit.framework.Assert.assertEquals(Assert.java:86)
	at junit.framework.TestCase.assertEquals(TestCase.java:253)
	at de.uni_mannheim.informatik.dws.winter.preprocessing.datatypes.ValueNormalizerTest.testTypeValue(ValueNormalizerTest.java:32)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at junit.framework.TestCase.runTest(TestCase.java:176)
	at junit.framework.TestCase.runBare(TestCase.java:141)
	at junit.framework.TestResult$1.protect(TestResult.java:122)
	at junit.framework.TestResult.runProtected(TestResult.java:142)
	at junit.framework.TestResult.run(TestResult.java:125)
	at junit.framework.TestCase.run(TestCase.java:129)
	at junit.framework.TestSuite.runTest(TestSuite.java:252)
	at junit.framework.TestSuite.run(TestSuite.java:247)
	at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:86)
	at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252)
	at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141)
	at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
	at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
	at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115)
	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75)

Intersection throws NULL pointer

Running the intersection method, causes a null pointer, if the two sets do not contain any value.

Union does not have the problem, as the list is initialized, which does not happen in Intersection.java.

approximate FD by TANE

Hello,
Have you run your program to test the result of approximate FD? I generate same result as in the original paper if the error threshold is 0. But for the approximate FD result with error threshold 0.01, 0.05, etc, I cannot obtain same result as in the published experiment. Have you repeated the same experimental result for approximate FDs?
Thank you!

Inconsistent Use of Verbose

MatchingEvaluator and DataFusionEvaluator have a different usage of verbose. Is it possible to make this consistent? Rather in the constructor or setVerbose()

Architecture Overview Figure

Hi Oli,

Great work. Can you add the/a overview figure for the classes for fusion and IR? In the former release, we had one of those right?

Cheers,
Robert

Handle Null values when performing Identity Resolution

Hi ,
i want to perform some Identity Resolution where the entities have many attributes, but some of the attribute values can be null. Is there any way to handle this from winter ?

potential bug/error in RDFMatchableReader

Main issue: From the example in 'RDFRecordReaderTest', it seems that the schema/attributes are incorrectly assigned.

Possible causes: As can be seen from the definition of abstract method readLine() in RDFMatchableReader , the the actual attribute is not passed to the dataset population method. In the implementation (RDFRecordReader class), attribute identifier looks fine, but the attribute value is assigned to the attribute name and the values are assigned again to the record. This seems not correct to me.

Median Conflict Resolution Utility throws IndexOutOfBoundsException

What happened:
I use the Median conflict resolution in one of my AttributeValueFusers and get an IndexOutOfBounds exception on some of my RecordGroups.

What I expect to happen:
There should be no exception thrown in the library.

Root cause:
The linked list that is used internally covers all cases that are 0 and > 1, but in case the list is of size one the exception is thrown.
https://github.com/olehmberg/winter/blob/master/winter-framework/src/main/java/de/uni_mannheim/informatik/dws/winter/datafusion/conflictresolution/numeric/Median.java#L48

A possible fix would include an update to the if statement like this:

boolean isEven = list.size() % 2 == 0;
if (list.size() == 0) {
    return new FusedValue<>((Double) null);
} else if (list.size() == 1) { // Return the only element in the list as median if length == 1
    return new FusedValue<>(list.get(0));
} else if (isEven) {
    double middle = ((double) list.size() + 1.0) / 2.0;
    double median1 = list.get((int) Math.floor(middle) - 1);
    double median2 = list.get((int) Math.ceil(middle) - 1);

    return new FusedValue<>((median1 + median2) / 2.0);
} else {
    int middle = list.size() / 2;

    return new FusedValue<>(list.get(middle - 1)); // Throws indexOutOfBoundsException if middle = 0
}

Another possibility would be to round the list.size() / 2 correctly. In the current implementation all decimal places will just be removed.
See https://stackoverflow.com/a/2654897/6059889 on how to get a correctly rounded integer value.

Change naming of short debug file

The method writeDebugMatchingResultsToFile of the MatchingRule writes a debug log to two separate files:

"/path/to/debugResults.csv"
"/path/to/debugResults.csv_short"

A user only supplies the path "/path/to/debugResults.csv".
The suffix "_short" is simply appended to the path to determine the path to the short debug log.
This results in a file of type .csv_short, which is no common file type.

Requirement:
The file path of the short debug log should be "/path/to/debugResults_short.csv".

Run

hello
thank you for published this project.
where is inputs?
thank you in advanced.

Logging of lists of objects

When the data model contains an attribute which is a lists of objects, e.g. main entity album has the attribute tracks with each track being a separate object, the log files output the java object id rather than the actual values.

No Schema when loading a XML-File in Default

When I try loading a XML in without making an extra model, like described in the documentation. There will not be a schema designed to the dataset.
Loading itself works. You can access the data loaded in the dataset. I tried to find our what is going wrong.
When load a csv-file a schema is there. When constructing an extra model for the xml-File. The schema is there, too.
When looking into the XMLMatchableReader there is a schema there for defaultLoading but it disappears when going back to the original starting class.

That makes it impossible to use any matching tools.

FusedDataSet cannot print attribute density

After creating a fused dataset, the density printing is not possible as the attributes are not given

Workaround:

addAttribute for each attribute