vida-nyu / domain-discovery-d4 Goto Github PK
View Code? Open in Web Editor NEWData-Driven Domain Discovery for Structured Datasets
License: Apache License 2.0
Data-Driven Domain Discovery for Structured Datasets
License: Apache License 2.0
Add option to compute frequency of an equivalence class for each column C as either
Heiko,
Can you please add a reference to our VLDB paper in the readme? This may be useful for people that try to use the tool.
Thanks,
Juliana
Allow user to switch on/off output printed to the console by different D4 components.
We should keep track of term frequencies in the column files (and the terms index and compressed term index). This would allow us to use similarity measures for terms/equivalence classes that are based on some notion of tf-idf.
Add step for replacing local domains as equivalence classes to the the main D4 workflow.
The default input file for the ter-index step D4 is listed as text-columns.txt
. This should be columns/
.
The current implementation for SimilarTermIndexGenerator
is rather naive. It merges all equivalence classes in a connected component based on similarity between pairs of equivalence classes. This approach has the strong disadvantage of potentially merging dis-similar equivalence classes because similarity is not transitive.
One improvement could be to pick equivalence classes as strong seeds and then merge them with all other equivalence classes that are similar to the seed. While this could still merge dis-similar equivalence classes there is the guarantee that they all at least satisfy the similarity threshold with the seed equivalence class.
The strong domain discovery step can be modified in the following way:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.