Git Product home page Git Product logo

Comments (9)

tfmorris avatar tfmorris commented on May 16, 2024

From tfmorris on May 10, 2010 19:26:52:
That sounds more like an enhancement request than a defect report.

To generalize things a bit, support for a standard data access API would allow people
to plug in multiple DB backends.

from openrefine.

tfmorris avatar tfmorris commented on May 16, 2024

From [email protected] on May 10, 2010 19:32:58:
One of the major challenges here is how to support undo/redo when changes can get into
the back-end database without going through Gridworks. Another major challenge is where
to store metadata (such as reconciliation records) that is specific to Gridworks and
not native to any existing back-end database.

from openrefine.

tfmorris avatar tfmorris commented on May 16, 2024

From mjlissner on May 10, 2010 19:36:53:
Hmm...I presume a new project would have to be created when doing this, and that
could hold the meta information.

As for undo/redo, maybe adding a commit button would make that easier. So changes can
be made to a snapshot of the data, but then no changes are made to the DB itself
until commit is pressed?

from openrefine.

tfmorris avatar tfmorris commented on May 16, 2024

From iainsproat on May 10, 2010 19:46:01:
You can either keep it synchronous with the database (effectively using the database
as the backend), but lose undo/redo support and reconciliation with Freebase (unless
you add suitable tables to the database). You're then using Gridworks for just the
facets really.

The other way, as you suggest, is to make it similar to a disconnected session with a
commit transaction. The issue is then ensuring consistency between the remote
database and the snapshot held in Gridworks. Merging the two back in would be an
issue. You'd also need to hold keys from the remote database in Gridworks for
updating records.

Frameworks such as Hibernate or Spring would be worth considering for their database
abstraction layers.

from openrefine.

tfmorris avatar tfmorris commented on May 16, 2024

From thadguidry on May 10, 2010 20:05:13:
I have read that good ORMs such as Hibernate now support ordered lists and other
features now incorporated into JPA 2.0 as of Dec 2009.

from openrefine.

tfmorris avatar tfmorris commented on May 16, 2024

From [email protected] on November 11, 2010 07:08:54:
Initially, what I believe is most important/useful, is simply having the ability to direct-connect to MySQL/PostgreSQL/etc. data sources from the get-go(create a new project). Also initially, being able to set/save multiple data sources, and multiple dBs within those sources. When the user creates a new project, this would in effect create a disconnected session(as mentioned above), wherein the data is treated as is now the data. Adding 'commit' features can come next, followed by more advanced connectivity options, until such a point synchronous functionality is in place to one degree or another. But for now, it would certainly be nice to add data sources and pull data from those sources!

Eric Jarvies

from openrefine.

tfmorris avatar tfmorris commented on May 16, 2024

From thadguidry on November 11, 2010 14:36:42:
Eric, A cleanup tool using industry best practices is best used offline within a process. There are existing ETL tools that easily consume from MySQL/PostgreSQL etc and offer excellent flow control, exporting, and connectivity to produce delimited files with relative ease. Talend is one such product that I use along with Google Refine. Talend (Open Source Community edition) does my scheduled daily gathering from 3 databases (MySQL and Oracle) and then dumps a customized TSV file that I open with Google Refine for further analysis and sometimes clean up. There are other tools that provide ETL (Extract, Transform, Load) like Talend. I'm not 100% sure if the team really feels the need to copy that and flesh out a full ETL platform, since Talend and other tools fill that need very nicely. Incidentally, using Google Refine and a bit of clustering, I was able to find a few loop holes in our data storage processing that we fixed with a few stored procedures within Oracle. Google Refine was instrumental as a discovery tool for that. Talend does have an MDM component but does not have the interactivity of a discovery tool like Google Refine does. If you do NOT need a daily process, but only one time cleanup, just dumping with MySQL or PostgreSQL would offer about the same and depending on the size of database takes only secs to minutes. Dumping can also avoid potential live database locks, that if Refine supported might have to tip-toe around, depending on the teams' chosen implementation of database connectivity. If you have large database size needs, give Talend or another ETL tool a try with Google Refine, and you'll soon see the powerful left-right combination. I'm not sure how far the team ultimately decides to absorb direct connectivity support within Google Refine. I'd like to hear other opinions as well on this Issue-12.

from openrefine.

tfmorris avatar tfmorris commented on May 16, 2024

From [email protected] on April 14, 2011 14:23:04:
I agree with thaguidry. Let the Refine team focus on bringing data quality issues to light. Let Talend focus on data Quality (they do have an data profiling tool that can identify some of this stuff) Talend is what we use for basic ETL. You could write some SQL to get the data out of MySQL anyway. If you analyze directly connected to db server for data quality against an entire large table your dba might become angry too.

C

from openrefine.

thadguidry avatar thadguidry commented on May 16, 2024

#1277 Is being worked on to support this issue !

from openrefine.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.