DBeam

A connector tool to extract data from SQL databases and import into GCS using Apache Beam.

This tool is runnable locally, or on any other backend supported by Apache Beam, e.g. Cloud Dataflow.

DEVELOPMENT STATUS: Alpha. Usable in production already.

Overview

DBeam is a Scio based single threaded pipeline that reads all the data from single SQL database table, and converts the data into Avro and stores it into appointed location, usually in GCS. Scio runs on Apache Beam.

DBeam requires the database credentials, the database table name to read, and the output location to store the extracted data into. DBeam first makes a single select into the target table with limit one to infer the table schema. After the schema is created the job will be launched which simply streams the table contents via JDBC into target location as Avro.

`dbeam` Java/Scala package features

Support both PostgreSQL and MySQL JDBC connectors
Supports CloudSQL managed databases
Currently output only in Avro format
Read password from a mounted password file (--passwordFile)
Can filter only records of the current day with the --partitionColumn parameter
Check and fail on too old partition dates. Snapshot dumps are not filtered by a given date/partition, when running for a too old partition, the job fails to avoid new data in old partitions. (can be disabled with --skipPartitionCheck)
It has dependency on Apache Beam SDK and scio-core. No internal dependencies to Spotify. Easy to open source.

`dbeam` arguments

--connectionUrl: the JDBC connection url to perform the dump
--table: the database table to query and perform the dump
--output: the path to store the output
--username: the database user name
--password: the database password
--passwordFile: a path to a local file containing the database password
--limit: limit the output number of rows, indefinite by default
--avroSchemaNamespace: the namespace of the generated avro schema, "dbeam_generated" by default
--partitionColumn: the name of a date/timestamp column to filter data based on current partition
--partition: the date of the current partition, parsed using ISODateTimeFormat.localDateOptionalTimeParser
--partitionPeriod: the period in which dbeam runs, used to filter based on current partition and also to check if executions are being run for a too old partition
--skipPartitionCheck: when partition column is not specified, by default fail when the partition parameter is not too old; use this avoid this behavior
--minPartitionPeriod: the minimum partition required for the job not to fail (when partition column is not specified), by default now() - 2*partitionPeriod

Building

Build with SBT package to get a jar that you can run with java -cp. Notice that this won't create a fat jar, which means that you need to include dependencies on the class path.

sbt package

You can also build the project with SBT pack, which will create a dbeam-pack/target/pack directory with all the dependencies, and also a shell script to run DBeam.

sbt pack

Now you can run the script directly from created dbeam-pack directory:

./dbeam-pack/target/pack/bin/jdbc-avro-job

TODO: We will be improving the packaging and releasing process shortly.

Examples

java -cp CLASS_PATH dbeam-core_2.12.jar com.spotify.dbeam.JdbcAvroJob \
  --output=gs://my-testing-bucket-name/ \
  --username=my_database_username \
  --password=secret \
  --connectionUrl=jdbc:postgresql://some.database.uri.example.org:5432/my_database \
  --table=my_table

For CloudSQL:

java -cp CLASS_PATH dbeam-core_2.12.jar com.spotify.dbeam.JdbcAvroJob \
  --output=gs://my-testing-bucket-name/ \
  --username=my_database_username \
  --password=secret \
  --connectionUrl=jdbc:postgresql://google/database?socketFactory=com.google.cloud.sql.postgres.SocketFactory&socketFactoryArg=project:region:cloudsql-instance \
  --table=my_table

Replace postgres with mysql if you are using MySQL.
More details can be found at CloudSQL JDBC SocketFactory

To validate a data extraction one can run:

java -cp CLASS_PATH dbeam-core_2.12.jar com.spotify.dbeam.JdbcAvroJob \
  --output=gs://my-testing-bucket-name/ \
  --username=my_database_username \
  --password=secret \
  --connectionUrl=jdbc:postgresql://some.database.uri.example.org:5432/my_database \
  --table=my_table \
  --limit=10 \
  --skipPartitionCheck

Requirements

DBeam is built on top of Scio and supports both Scala 2.12 and 2.11.

To include DBeam library in a SBT project add the following in build.sbt:

  libraryDependencies ++= Seq(
   "com.spotify" %% "dbeam-core" % dbeamVersion
  )

Development

Make sure you have sbt installed. For editor, IntelliJ IDEA with scala plugin is recommended.

To test and verify during development, run:

sbt clean scalastyle test:scalastyle coverage test coverageReport coverageAggregate

This project adheres to the Open Code of Conduct. By participating, you are expected to honor this code.

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

mattfinkel / dbeam Goto Github PK

dbeam's Introduction

DBeam

Overview

`dbeam` Java/Scala package features

`dbeam` arguments

Building

Examples

Requirements

Development

License

dbeam's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

mattfinkel / dbeam Goto Github PK

dbeam's Introduction

DBeam

Overview

dbeam Java/Scala package features

dbeam arguments

Building

Examples

Requirements

Development

License

dbeam's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org

`dbeam` Java/Scala package features

`dbeam` arguments