Git Product home page Git Product logo

artie-labs / transfer Goto Github PK

View Code? Open in Web Editor NEW
530.0 7.0 24.0 11.1 MB

Database replication platform that leverages change data capture. Stream production data from databases to your data warehouse (Snowflake, BigQuery, Redshift) in real-time.

Home Page: https://artie.so

License: Other

Makefile 0.12% Go 99.82% Dockerfile 0.05%
snowflake cdc change-data-capture golang bigquery kafka apache-kafka data-integration data-pipelines database

transfer's Introduction

Artie Transfer

⚡️ Blazing fast data replication between OLTP and OLAP databases ⚡️


Learn more »

Artie Transfer is a real-time data replication solution for databases and data warehouses/data lakes.

Typical ETL solutions rely on batched processes or schedulers (i.e. DAGs, Airflow), which means the data in the downstream data warehouse is often several hours to days old. This problem is exacerbated as data volumes grow, as batched processes take increasingly longer to run.

Artie leverages change data capture (CDC) and stream processing to perform data syncs in a more efficient way, which enables sub-minute latency.

Benefits of Artie Transfer:

  • Sub-minute data latency: always have access to live production data.
  • Ease of use: just set up a simple configuration file, and you're good to go!
  • Automatic table creation and schema detection: Artie infers schemas and automatically merges changes to downstream destinations.
  • Reliability: Artie has automatic retries and processing is idempotent.
  • Scalability: handle anywhere from 1GB to 100+ TB of data.
  • Monitoring: built-in error reporting along with rich telemetry statistics.

Take a look at this guide to get started!

Architecture

Pre-requisites

As you can see from the architecture diagram above, Artie Transfer is a Kafka consumer and expects CDC messages to be in a particular format.

The optimal set-up looks something like this:

  • Debezium or Artie Reader depending on the source
  • Kafka
    • One Kafka topic per table, such that you can toggle the number of partitions based on throughput.
    • The partition key should be the primary key for the table to avoid out-of-order writes at the row level.

Please see the supported section on what sources and destinations are supported.

Examples

To run Artie Transfer's stack locally, please refer to the examples folder.

Getting started

Getting started guide

What is currently supported?

Transfer is aiming to provide coverage across all OLTPs and OLAPs databases. Currently Transfer supports:

  • Message Queues

    • Kafka (default)
    • Google Pub/Sub
  • Destinations:

    • Snowflake
    • BigQuery
    • Redshift
    • Microsoft SQL Server
    • S3
  • Sources:

    • MongoDB
    • DocumentDB
    • PostgreSQL
    • MySQL
    • DynamoDB

If the database you are using is not on the list, feel free to file for a feature request.

Configuration File

Telemetry

Artie Transfer's telemetry guide

Tests

Transfer is written in Go and uses counterfeiter to mock. To run the tests, run the following commands:

make generate
make test

Release

Artie Transfer is released through GoReleaser, and we use it to cross-compile our binaries on the releases as well as our Dockerhub. If your operating system or architecture is not supported, please file a feature request!

License

Artie Transfer is licensed under ELv2. Please see the LICENSE file for additional information. If you have any licensing questions please email [email protected].

transfer's People

Contributors

dependabot[bot] avatar nathan-artie avatar tamasno1 avatar tang8330 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

transfer's Issues

large table with 400+ columns, while sync from postgres to snowflake, logrus_error="can not add field \"topic\"" error="cannot unmarshall key, key: , err: key is nil"

2023-06-24 22:13:24 time="2023-06-24T14:13:24Z" level=warning msg="skipping message..." logrus_error="can not add field "topic"" error="cannot unmarshall key, key: , err: key is nil" key= offset=17 value="{"schema":{"type":"struct","fields":[{"type":"struct","fields":[{"type":"int32","optional":false,"default":0,"field":"id"},{"type":"string","optional":false,"field":"first_name"},{"type":"string","optional":false,"field":"last_name"},{"type":"string","optional":false,"field":"email"},{"type":"string","optional":true,"field":"column_added_0001"},{"type":"string","optional":true,"field":"column_added_0002"},{"type":"string","optional":true,"field":"column_added_0003"},{"type":"string","optional":true,"field":"column_added_0004\

Supporting S3

  • Ability to specify append-only
  • Ability to specify parquet

BigQuery temp tables are not dropped after merge

v. 2.0.58

Transfer generates BigQuery temp tables with suffixes that include capital letters, e.g. my_table___artie_dM9YWyeakM.

However, those suffixes are downcased in the drop statements, e.g. my_table___artie_dm9ywyeakm. This has the effect that the temporary table is not dropped at all, as it is does not exist. Lots of temp tables accrete in the dataset until they expire several hours later.

how to config Debezium application.properties

Hi,I have several kilometers tables to be sync from Postgres to snowflake, I want config a rule to map those n
source tables to 1 topic to n destination tables,followed by this guideline
https://docs.artie.so/tutorials/debezium-topic-reroute-smt ,
1,in a docket compose,which config file should I put parameters about # Re-route them all to the same topic?
2,how to configure reroute mapping rule such as:
n source tables to 1 topic to n destination tables

n source tables to 1 topic to 1 destination table

1 source tables to 1 topic to n destination tables

BR

Kafka Topics autodiscovery

So users can set it and forget it.

The fully qualified name for a table within a DB looks like this: schema.table.

Allow 2 following options to be added:

  1. RegEx (schema.*)
  2. Explicit "ignore" of tables.

AD will work if the table is found in (1) and not in (2). This is to maintain parity with Debezium. Artie and Debezium will only allow this search for non-system tables.

Refactor and simplify MERGE

Right now, MERGE code is all independent.

There's quite a bit of bloat by having to copy similar LOC across each destination.

APM support

As per title. Support spans, traces and custom tags into context

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.