artie-labs / transfer Goto Github PK

Database replication platform that leverages change data capture. Stream production data from databases to your data warehouse (Snowflake, BigQuery, Redshift) in real-time.

Home Page: https://artie.so

License: Other

Makefile 0.12% Go 99.82% Dockerfile 0.05%

snowflake cdc change-data-capture golang bigquery kafka apache-kafka data-integration data-pipelines database

transfer's Introduction

Artie Transfer

⚡️ Blazing fast data replication between OLTP and OLAP databases ⚡️

Learn more »

Artie Transfer is a real-time data replication solution for databases and data warehouses/data lakes.

Typical ETL solutions rely on batched processes or schedulers (i.e. DAGs, Airflow), which means the data in the downstream data warehouse is often several hours to days old. This problem is exacerbated as data volumes grow, as batched processes take increasingly longer to run.

Artie leverages change data capture (CDC) and stream processing to perform data syncs in a more efficient way, which enables sub-minute latency.

Benefits of Artie Transfer:

Sub-minute data latency: always have access to live production data.
Ease of use: just set up a simple configuration file, and you're good to go!
Automatic table creation and schema detection: Artie infers schemas and automatically merges changes to downstream destinations.
Reliability: Artie has automatic retries and processing is idempotent.
Scalability: handle anywhere from 1GB to 100+ TB of data.
Monitoring: built-in error reporting along with rich telemetry statistics.

Take a look at this guide to get started!

Architecture

Pre-requisites

As you can see from the architecture diagram above, Artie Transfer is a Kafka consumer and expects CDC messages to be in a particular format.

The optimal set-up looks something like this:

Debezium or Artie Reader depending on the source
Kafka
- One Kafka topic per table, such that you can toggle the number of partitions based on throughput.
- The partition key should be the primary key for the table to avoid out-of-order writes at the row level.

Please see the supported section on what sources and destinations are supported.

Examples

To run Artie Transfer's stack locally, please refer to the examples folder.

Getting started

Getting started guide

What is currently supported?

Transfer is aiming to provide coverage across all OLTPs and OLAPs databases. Currently Transfer supports:

Message Queues
- Kafka (default)
- Google Pub/Sub
Destinations:
- Snowflake
- BigQuery
- Redshift
- Microsoft SQL Server
- S3
Sources:
- MongoDB
- DocumentDB
- PostgreSQL
- MySQL
- DynamoDB

If the database you are using is not on the list, feel free to file for a feature request.

Configuration File

Telemetry

Artie Transfer's telemetry guide

Tests

Transfer is written in Go and uses counterfeiter to mock. To run the tests, run the following commands:

make generate
make test

Release

Artie Transfer is released through GoReleaser, and we use it to cross-compile our binaries on the releases as well as our Dockerhub. If your operating system or architecture is not supported, please file a feature request!

License

Artie Transfer is licensed under ELv2. Please see the LICENSE file for additional information. If you have any licensing questions please email [email protected].

transfer's People

Contributors

Stargazers

Watchers

transfer's Issues

[Telemetry] Add end-end latency

Today, we only have Kafka ingestion lag.

How about the total lag?

BigQuery Destination to support more than 10 mb

Streaming Insert has the limitation of 10MB, merge doesn't. If we do get hit by the 10Mb limit, just iterate over the 10 Mb batch

Full context

[BigQuery] Support NUMERIC properly

[Enhancement] Remove the need to have primary key within the partition key

Remove this need so we can loosen the requirements that we have around partition keys. In the future, we can explore workloads that have a "custom" primary key (e.g uniq constraint but not specified primary key).

We can get this from the retData from an event.

Look into supporting BigQuery CDC

BigQuery CDC: https://cloud.google.com/blog/products/data-analytics/bigquery-gains-change-data-capture-functionality

How to develop a transfer client for a new destination data warehouse?

Hi,I have another data warehouse not in transfer supported destination list. My data warehouse provide jdbc drive and python connector but without a go driver. In this case, How to develop a new transfer client for this new destination data warehouse?

Implement per-table flush semantics for max efficiency

As summed up from this PR: #94

Move the lock from the database to the table.

When doing so, add in race detection.

Oops please delete me

Support MySQL

Support DBZ proto

debezium/debezium#1717

Supporting ClickHouse

Hi!
Thanks for great project.

Would you like to add support for ClickHouse?
https://clickhouse.com/docs/en/

CollapsingMergeTree could be good table engine for stream replication transaction logs

Use BigQuery Storage Write API

https://cloud.google.com/bigquery/docs/write-api

Context https://artie-technologies.slack.com/archives/C04Q7GG6DT9/p1693997817550579

large table with 400+ columns， while sync from postgres to snowflake， logrus_error="can not add field \"topic\"" error="cannot unmarshall key, key: , err: key is nil"

2023-06-24 22:13:24 time="2023-06-24T14:13:24Z" level=warning msg="skipping message..." logrus_error="can not add field "topic"" error="cannot unmarshall key, key: , err: key is nil" key= offset=17 value="{"schema":{"type":"struct","fields":[{"type":"struct","fields":[{"type":"int32","optional":false,"default":0,"field":"id"},{"type":"string","optional":false,"field":"first_name"},{"type":"string","optional":false,"field":"last_name"},{"type":"string","optional":false,"field":"email"},{"type":"string","optional":true,"field":"column_added_0001"},{"type":"string","optional":true,"field":"column_added_0002"},{"type":"string","optional":true,"field":"column_added_0003"},{"type":"string","optional":true,"field":"column_added_0004\

Debezium MongoDB to support schemas

With this PR #27, Postgres will support schemas.

We should have feature parity for MongoDB so we avoid the same issues listed in the PR.

Migrate documentation to Gitbooks

Search and reading experience is significantly better.

Gitbooks also has CI and previews built-in.

Kafka failover causes intermittent spikes in ingestion lag

We are relying on the Kafka message timestamp to emit ingestion lag.

Whenever there is a broker failover, the timestamp gets out of sync for a few minutes.

Base de Datos SQL Server

Muy buenos dias,
Como se puede trabajar Artie con bases de datos Microsoft SQL Server?

Supporting S3

Ability to specify append-only
Ability to specify parquet

Supporting PostGIS data types

Inspired by: https://news.ycombinator.com/item?id=36848483

BigQuery temp tables are not dropped after merge

v. 2.0.58

Transfer generates BigQuery temp tables with suffixes that include capital letters, e.g. my_table___artie_dM9YWyeakM.

However, those suffixes are downcased in the drop statements, e.g. my_table___artie_dm9ywyeakm. This has the effect that the temporary table is not dropped at all, as it is does not exist. Lots of temp tables accrete in the dataset until they expire several hours later.

how to config Debezium application.properties

Hi,I have several kilometers tables to be sync from Postgres to snowflake, I want config a rule to map those n
source tables to 1 topic to n destination tables,followed by this guideline
https://docs.artie.so/tutorials/debezium-topic-reroute-smt ,
1,in a docket compose,which config file should I put parameters about # Re-route them all to the same topic?
2,how to configure reroute mapping rule such as:
n source tables to 1 topic to n destination tables

n source tables to 1 topic to 1 destination table

1 source tables to 1 topic to n destination tables

Allow a third variable to be specified to trigger flush

Check the in-memory DB and if the size exceeds X, then flush.

Effectively, the equation becomes, whichever one is sooner:

Flush interval (10s)
Numbers of rows within the in-memory DB (15k)
Size of the in-memory DB

Fix grammar-typos in readme

Line 39 - "To see the current supported databases" should be "To see the currently supported databases"

Filter tombstone messages from Debezium

Tombstone messages has no data in the body, let's filter it out.

Supporting DynamoDB as a source

DynamoStream exports have a marshalled JSON structure that may match other sources too

[Snowflake] Support various TIMESTAMP types and formats

From: https://docs.snowflake.com/en/sql-reference/intro-summary-data-types.html

We are currently only supporting TIMESTAMP which strips the TZ locale from the value.

Kafka Topics autodiscovery

So users can set it and forget it.

The fully qualified name for a table within a DB looks like this: schema.table.

Allow 2 following options to be added:

RegEx (schema.*)
Explicit "ignore" of tables.

AD will work if the table is found in (1) and not in (2). This is to maintain parity with Debezium. Artie and Debezium will only allow this search for non-system tables.

Refactor and simplify MERGE

Right now, MERGE code is all independent.

There's quite a bit of bloat by having to copy similar LOC across each destination.

PostgreSQL destination support

Hello !

Thanks for this amazing project !

Do you plan to add PostgreSQL support as a destination?