Git Product home page Git Product logo

Comments (8)

mroiter-larus avatar mroiter-larus commented on June 6, 2024 1

@soneymathew I had a test case but it didn't reproduce the error. Anyway i'll try your suggestion.

from neo4j-etl.

soneymathew avatar soneymathew commented on June 6, 2024

I tested the following strategy for a table with ~1 million rows to complete the load in ~46 seconds

set the DB_URL so that it can be refered later

call apoc.static.set('DB_URL', '<connection url>')

not sure if this matters but I set the constraints required for the Node first

CREATE CONSTRAINT ON (t:<TableName>) ASSERT t.id IS UNIQUE;

use apoc procedures to load the data in parallel with a batchsize of 10k

CALL apoc.periodic.iterate('
CALL apoc.static.get("DB_URL") yield value WITH apoc.convert.toString(value) AS DB_URL
call apoc.load.jdbc(DB_URL,"<TableName>") YIELD row RETURN row',
'CREATE (c:<TableName>) SET c += row', {batchSize:10000, parallel:true})

from neo4j-etl.

jexp avatar jexp commented on June 6, 2024

it's odd the csv export / import should handle json correctly. I have a deeper look at the issue

What is the data type in your relational database?

Yep apoc.periodic.iterate rocks it :)

from neo4j-etl.

jexp avatar jexp commented on June 6, 2024

It's actually an interesting proposal. Let's discuss this.

So instead of doing the transformation + batching (optionally parallel) in Java, we would do it in apoc instead.

from neo4j-etl.

soneymathew avatar soneymathew commented on June 6, 2024

My initial exploration was to manually do what the ETL tool is facilitating currently.
but as I went thru multiple stages I learned that the APOC approach can be amazingly performant and skips the translation overhead of CSV creation and loading.

my sample dataset is a postgres schema containing a mix of the following

  • normal tables/columns with column values as database supported base data types
  • some columns are of datatype text with content as multiline JSON ( this is the data that messed with the ETL tool usage)
  • some columns are of datatype text with content as XML

on a side note, I noticed that the etl tool silently ignores unsupported datatypes Examples are oid and tsvector it will be cool to log it as errors or choices that can be pondered upon perhaps?

from neo4j-etl.

mroiter-larus avatar mroiter-larus commented on June 6, 2024

@soneymathew Can you help me to reproduce this issue? For example can you told me the sample database that you use?

Thank you

from neo4j-etl.

soneymathew avatar soneymathew commented on June 6, 2024

@mroiter-larus sorry I won't be able to share my database.
you could probably reproduce this by storing a big pretty printed json file content in a text column
Example: The mapping json generated is a good candidate for test data.

from neo4j-etl.

soneymathew avatar soneymathew commented on June 6, 2024

@mroiter-larus is it important that you need to reproduce it?

Users could be having bad data in database with non-printable characters or any multiline content that can break your CSV. There is a possible risk to data loss along the translations steps DB -> CSV -> LOADCSV step.

I raised this issue to explore if neo4j-etl can be updated to bypass the CSV generation step all together.

I believe it will help your users from the following perspectives.

  • Space savings (CSVs generated out of a large DB can be pretty large as it is uncompressed)
  • Avoid risk of data lost in translation by cutting off the CSV step
  • possibly better performance as you avoid the intermediate process and straight away import to neo4j from RDBMS
  • leverage/experiment with better strategies like concurrent access, batching etc to increase the overall speed in importing

from neo4j-etl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.