The batchperformance from jexp

batchperformance's Introduction

Parallel Batch inserter with Neo4j

Uses the LMAX Disruptor to parallelize operations during batch-insertion.

The 6 operations are:

property encoding
property-record creation
relationship-id creation and forward handling of reverse relationship chains
writing node-records
writing relationship-records
writing property-records

Dependencies:

(1)<--(2)<--(6)
(2)<--(5)-->(3)   
(2)<--(4)-->(3)

It uses the above dependency setup of disruptor handlers to execute the different concerns in parallel. A ringbuffer of about 2^18 elements is used and a heap size of 5-20G, MMIO configuration within the heap limits.

Execution:

MAVEN_OPTS="-Xmx5G -Xms5G -server -d64 -XX:NewRatio=5" mvn exec:java -Dexec.mainClass=org.neo4j.batchimport.DisruptorTest

current limitations, constraints:

only up to 2bn relationships (due to an int based multi-map)
have to know max # of rels per node, properties per node and relationship
relationships have to be pre-sorted by min(start,end)

future improvements:

implement batch-importer CSV "API" on top of this
stripe writes across store-files (i.e. strip the relationship-record file over 10 handlers, according to CPUs)
parallelize writing to dynamic string and arraystore too
change relationship-record updates for backwards pointers to run in a separate handler that is RandomAccessFile-based (or nio2) and just writes the 2 int values directly at file-pos
add a csv analyser / sorter that
add support & parallelize index addition
good support for index based lookup for relationship construction (kv-store, better in-memory structure, e.g. a collection of long[])
use id-compression internally to save memory in structs (write a CompressedLongArray)
reuse PropertyBlock, PropertyRecords, RelationshipRecords, NodeRecords, probably subclass them and override getId() etc. or copy the code from the Store's to work with interfaces

Recommend Projects