Git Product home page Git Product logo

airbytehq / airbyte Goto Github PK

View Code? Open in Web Editor NEW
14.0K 175.0 3.6K 558.89 MB

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

Home Page: https://airbyte.com

License: Other

Shell 0.34% Dockerfile 0.90% Java 16.78% HTML 0.01% CSS 0.04% JavaScript 0.27% Python 68.41% Handlebars 0.13% Makefile 0.01% PLpgSQL 0.11% TSQL 0.02% Jinja 0.10% PLSQL 0.04% Kotlin 12.83%
data pipeline data-analysis data-engineering java python etl change-data-capture data-collection data-integration

airbyte's Introduction

Airbyte

Data integration platform for ELT pipelines from APIs, databases & files to databases, warehouses & lakes

Test Release Slack YouTube Channel Views Build License License

We believe that only an open-source solution to data movement can cover the long tail of data sources while empowering data engineers to customize existing connectors. Our ultimate vision is to help you move data from any source to any destination. Airbyte already provides the largest catalog of 300+ connectors for APIs, databases, data warehouses, and data lakes.

Airbyte OSS Connections UI Screenshot taken from Airbyte Cloud.

Getting Started

Try it out yourself with our demo app, visit our full documentation and learn more about recent announcements. See our registry for a full list of connectors already available in Airbyte or Airbyte Cloud.

Join the Airbyte Community

The Airbyte community can be found in the Airbyte Community Slack, where you can ask questions and voice ideas. You can also ask for help in our Airbyte Forum, or join our Office Hours. Airbyte's roadmap is publicly viewable on GitHub.

For videos and blogs on data engineering and building your data stack, check out Airbyte's Content Hub, Youtube, and sign up for our newsletter.

Dedicated support with direct access to our team is also available for Open Source users. If you are interested, please fill out this form.

Contributing

If you've found a problem with Airbyte, please open a GitHub issue. To contribute to Airbyte and see our Code of Conduct, please see the contributing guide. We have a list of good first issues that contain bugs that have a relatively limited scope. This is a great place to get started, gain experience, and get familiar with our contribution process.

Security

Airbyte takes security issues very seriously. Please do not file GitHub issues or post on our public forum for security vulnerabilities. Email [email protected] if you believe you have uncovered a vulnerability. In the message, try to provide a description of the issue and ideally a way of reproducing it. The security team will get back to you as soon as possible.

Airbyte Enterprise also offers additional security features (among others) on top of Airbyte Open Source.

License

See the LICENSE file for licensing information, and our FAQ for any questions you may have on that topic.

Thank You

Airbyte would not be possible without the support and assistance of other open-source tools and companies! Visit our thank you page to learn more about how we build Airbyte.

airbyte's People

Contributors

alafanechere avatar arsenlosenko avatar artem1205 avatar avaidyanatha avatar bazarnov avatar benmoriceau avatar bnchrch avatar cgardens avatar christopheduong avatar darynaishchenko avatar davinchia avatar davydov-d avatar edgao avatar erohmensing avatar evantahler avatar girarda avatar grubberr avatar jamakase avatar johnlafleur avatar jrhizor avatar lazebnyi avatar lmossman avatar marcosmarxm avatar maxi297 avatar michel-tricot avatar octavia-squidington-iii avatar sherifnada avatar subodh1810 avatar timroes avatar tuliren avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

airbyte's Issues

jetty to java json parsing

i need to figure out if i should actually be doing toString here. i have to double check how jetty parses the any type in an http request. i don't know if it gets parsed as a string or a json object and then what that actually gets cast to in java.

Feedback when adding a source or destination

When adding a source or destination, from the Onboarding or New source, we should test the connections before moving on.
Here are the screens showcasing the feedback:

@cgardens

https://airbytehq.slack.com/archives/C02RQFX0X89/p1640022762001700?thread_ts=1640022762.001700&cid=C02RQFX0X89

Improve database & config migration lifecycle

Tell us about the problem you're trying to solve

Today we don't have a good process to apply db migrations or config migrations. As we iterate on the project that will cause issues to user who want to upgrade to the next version.

Describe the solution you’d like

Not yet defined, but I am thinking of having a version specific container that runs before any others and applies migrations to the db if necessary.

Track events not well logged on docs.dataline.io

Expected Behavior

Tell us what should happen.
When visiting a documentation page, in Segment's debugger (https://app.segment.com/dataline/sources/dataline-website/debugger), I should see 2 events:

  • TRACK: "Viewed [Page Name] Page"
  • PAGE: /deployment/deploying-dataline/with-docker (for instance)

Current Behavior

Tell us what happens instead of the expected behavior.
We currently only see the PAGE event. The issue with this is that we won't see this issue in the analytics tools (Amplitude / etc).

Steps to Reproduce

  1. Go to any docs page
  2. Check debugger on Segment, and you'll only see the PAGE event, and not the TRACK one.

Severity of the bug for you

Very low / Low / Medium / High / Critical
Medium

Additional context

Environment, version, integration...
We had a similar issue in the past. The solution is in passing the name of the page in the Segment event, so that "Track Named Pages" works on Segment. As an example, this would be a page call with a name of Pricing passed in:
analytics.page('Pricing');

Always produce the most up to date images to help for local development

Tell us about the problem you're trying to solve

I would like to have all the images available on dockerhub when after I push to master. That will help me setup an up to date environment when developing locally.

Describe the solution you’d like

Github pushes to dockerhub as part of the master build

make integrations self-contained

  • integrations need to be responsible for defining the config spec
  • integration images need an interface to produce the config spec
  • integrations should fail if the provided config is invalid
  • integration tests should fail if validation is not being performed properly or if it can't access the spec
  • integrations should be able to optionally convert their config to a different format (currently only required for BQ, but could be used in the future for postgres, others)

Onboarding - set up a source

Once user set up their preferences, they should see the following screen:
https://www.figma.com/file/AODY4oi15iI5l0fbxHvobt/Daxtarity?node-id=5782%3A42748

Setting up the name for the source and selecting a connector for the source should be compulsory.
So the button "Set up source" should be disabled as long as all fields are not filled.

Once the user selected a connector, 2 things should happen:

  • new fields should be displayed to set up the source (defined by the API)
  • a link towards our documentation for this connector should be displayed. Clicking on the link should display the docs page in a new tab.

@cgardens

Source detail page - Status tab with list of logs

When clicking on a cell in the list of sources in "Sources", this should display the status detail page"
https://www.figma.com/file/AODY4oi15iI5l0fbxHvobt/Daxtarity?node-id=5782%3A36940

A few features:

The v1.0 doesn't include a full refresh.

@cgardens

API doesn't fail on incorrect params are passed.

Expected Behavior

  • the API should throw an exception when it receives a request body for a route that does not match what is defined in swagger.

Current Behavior

  • the API just tries to process the incomplete body.

Steps to Reproduce

  1. curl -H "Content-Type: application/json" -X POST localhost:8001/api/v1/workspaces/update -d '{ "anonymousDataCollection": false, "email": "[email protected]", "news": false, "securityUpdates": true, "workspaceId": "5ae6b09b-fdec-41af-aaf7-7d94cfc33ef6" }' note: initialSetupComplete field is missing.

Fix CDP configuration in Gradle

Expected Behavior

CDP shouldn't throw warning

Current Behavior

CDP throws of warnings:

WARNING: Due to the absence of 'LifecycleBasePlugin' on project ':dataline-analytics' the task ':cpdCheck' could not be added to task graph. Therefore CPD will not be executed. To prevent this, manually add a task dependency of ':cpdCheck' to a 'check' task of a subproject.
1) Directly to project ':dataline-integrations:singer:bigquery:destination':
    check.dependsOn(':cpdCheck')
2) Indirectly, e.g. via project ':dataline-analytics':
    project(':dataline-integrations:singer:bigquery:destination') {
        plugins.withType(LifecycleBasePlugin) { // <- just required if 'java' plugin is applied within subproject
            check.dependsOn(cpdCheck)
        }
    }

Steps to Reproduce

  1. Just run build within a subproject: ./gradlew :dataline-server:build

Severity of the bug for you

Very low

Testing please ignore

Expected Behavior

Tell us what should happen.

Current Behavior

Tell us what happens instead of the expected behavior.

Steps to Reproduce

Severity of the bug for you

Very low / Low / Medium / High / Critical

Additional context

Environment, version, integration...

Inject job id into the logs of workers

All logs output by a worker should include job id for debugging purposes. This should be controlled outside of the integration implementations that an OSS contributor makes.

Onboarding - set up your destination

Once user set up their source in Onboarding, they should be invited to set up their destination:
https://www.figma.com/file/AODY4oi15iI5l0fbxHvobt/Daxtarity?node-id=5782%3A44298

in v1.0, we only support one destination. This might change in the future.

It should be the same behavior as for sources. But there should be 2 differences:

  • user should be able to click on (1) create a source in the tabs of the header to go back to previous step (note that clicking on "(3) set up connection" shouldn't do anything until you reached that point)
  • user should be able to click on "Source name" in the summary to go back to previous step as well

Summary of expected behavior (similar to source):
Setting up the name for the source and selecting a connector for the destination should be compulsory. So the button "Set up destination" should be disabled as long as all fields are not filled.

Once the user selected a connector, 2 things should happen:

  • new fields should be displayed to set up the destination (defined by the API)
  • a link towards our documentation for this connector should be displayed. Clicking on the link should display the docs page in a new tab.

@cgardens

Redshift as a destination

Tell us about the problem you're trying to solve

I'd like to be able to push data to Redshift. 

Describe the solution you’d like

The ability to choose Redshift as a destination in the UI. 

manual sync failing in API

Expected Behavior

  • setting a null schedule should make a connection manual sync only.

Current Behavior

  • sending null returns a json error saying schedule is required

Other

  • glanced at the config and it looks like it right. new

move JsonSchemaValidator into its own module

Tell us about the problem you're trying to solve

This class depends on no commons-like libraries. Right now it is in commons so that it can be shared with the worker, but that is not idea because it injects these dependencies into all modules that use commons. Ideally we'd like this class split out into its own module.

Allows workers to split their logs

Tell us about the problem you're trying to solve

Having one big log file is not manageable to understand what went wrong with a worker

Describe the solution you’d like

I would like all the logs to be scoped by job id

Onboarding - set up the connection with frequency

In onboarding, after setting up the destination, the user should be asked to select a frequency to the syncing:
https://www.figma.com/file/AODY4oi15iI5l0fbxHvobt/Daxtarity?node-id=5782%3A45365

Here are the values of frequency we offer in v1.0:
manual, 5 min, 15 min, 30 min, 1 hour, 2 hours, 3 hours, 6 hours, 8 hours, 12 hours, 24 hours

User should also be able to go back to previous steps by clicking in tabs of header or source/destination in summary cell

@cgardens

Speed up image building

Tell us about the problem you're trying to solve

Building all application images takes ~10minutes and is not incremental on dev environment.
It is needs to go faster

Simplify Schedule persistence

Tell us about the problem you're trying to solve

StandardSchedule lifecycle is the same as StandardSync but the code treat them as different entities.

Describe the solution you’d like

StandardSchedule should belong to the StandardSync

Sources section - List of sources

"Sources" should be the default page on which users land after the onboarding or in a new session.
Here is how the list should look like: https://www.figma.com/file/AODY4oi15iI5l0fbxHvobt/Daxtarity?node-id=5856%3A0

In v1.0, user shouldn't be able to sort the list yet (this might change in the future).
The list should display some indicators:

  • status of the last sync with icons
  • source name
  • source connector
  • last sync date
  • frequency of the syncs
  • "Enabled" status

User should have the ability to disable / enable a source from this list.
When a source has experienced a failure at the last sync, the status should displayed the failed status, but also the background should be red, so it stands out.

In the case of a source manually being synced, the ENABLED toggle should be replaced by a button. When clicking this button, a feedback should be displayed: https://www.figma.com/file/AODY4oi15iI5l0fbxHvobt/Daxtarity?node-id=5906%3A810

@cgardens

Support incremental syncs

Tell us about the problem you're trying to solve

Dataline currently only supports dumping the entire data source into the destination. I want to only sync the data that has changed since the last sync, not dump the entire dataset every time.

Describe the solution you’d like

When configuring a connection from the UI, I'd like to be able to configure it to be incremental instead of full refresh.

1st screen - Set up your preferences

The 1st time the user launches the app, they should see this screen:
https://www.figma.com/file/AODY4oi15iI5l0fbxHvobt/Daxtarity?node-id=5894%3A3685

In this screen, we ask for their email address, and some other options (data anonymization, newsletter, security updates) that are inactive until the user puts an email address.
This step should be optional to the user, as they can click on "Continue" without entering their email address.
If they enter their email address, the default settings should be:

  • Anonymize my data: off
  • Subscribe to newsletter: off
  • Get security updates: on

Once the user clicks on "Continue", they should be redirected towards the onboarding.

@cgardens

Validate Inputs for connection/create

Expected Behavior

  • The API should throw an exception if either the sourceImplementationId or the destinationImplementationId provided to connection/create do not exist. Currently, it will save the configuration without complaint.

┆Issue is synchronized with this Asana task by Unito

Remove ConfigFetchers

Tell us about the problem you're trying to solve

There is duplicated code between server and scheduler

Describe the solution you’d like

Find a way to remove these files

Pass only needed information in `StandardTapConfig`

Tell us about the problem you're trying to solve

StandardTapConfig currently contains a lot of "wide" data types which contain a lot more data than is needed to pull data from a Singer tap. This makes writing tests more tedious than is necessary and makes it harder to understand which inputs impact behavior.

Describe the solution you’d like

Create new types to capture exactly the information needed for taps, and remove any extra cruft.

Clean up API exceptions handling

Tell us about the problem you're trying to solve

The exception story in the API needs to be improve to ensure we send the proper HTTP responses.
eg:

  • validation can throw configNotFound

Decrease Logging

Expected Behavior

  • We are okay logging a lot, especially at the beginning, but they should be useful and human readable. Right now we are logging a lot of useless stuff which is making the useful stuff hard to consume.

Current Behavior

  • JsonReferenceProcessor logs a ton useless stuff. we should filter it.
dataline-scheduler | 2020-09-10 00:49:15 TRACE JsonReferenceProcessor:191 - key=type
dataline-scheduler | 2020-09-10 00:49:15 TRACE JsonReferenceProcessor:145 - processed: []
  • this log line happens every second.
dataline-scheduler | 2020-09-10 00:51:46 INFO  JobSubmitter:54 - Running job-submitter...
  • Type Validator prints all the time (even when nothing is invalid)
dataline-scheduler | 2020-09-10 00:52:23 DEBUG TypeValidator:136 - validate( {"connectionId":"60413b71-0fdb-4b7a-959f-7f316e2adfb0","manual":true}, {"connectionId":"60413b71-0fdb-4b7a-959f-7f316e2adfb0","manual":true}, $)
dataline-scheduler | 2020-09-10 00:52:23 DEBUG RequiredValidator:136 - validate( {"connectionId":"60413b71-0fdb-4b7a-959f-7f316e2adfb0","manual":true}, {"connectionId":"60413b71-0fdb-4b7a-959f-7f316e2adfb0","manual":true}, $)
dataline-scheduler | 2020-09-10 00:52:23 DEBUG AdditionalPropertiesValidator:136 - validate( {"connectionId":"60413b71-0fdb-4b7a-959f-7f316e2adfb0","manual":true}, {"connectionId":"60413b71-0fdb-4b7a-959f-7f316e2adfb0","manual":true}, $)
dataline-scheduler | 2020-09-10 00:52:23 DEBUG PropertiesValidator:136 - validate( {"connectionId":"60413b71-0fdb-4b7a-959f-7f316e2adfb0","manual":true}, {"connectionId":"60413b71-0fdb-4b7a-959f-7f316e2adfb0","manual":true}, $)
dataline-scheduler | 2020-09-10 00:52:23 DEBUG RequiredValidator:136 - validate( {"connectionId":"60413b71-0fdb-4b7a-959f-7f316e2adfb0","manual":true}, {"connectionId":"60413b71-0fdb-4b7a-959f-7f316e2adfb0","manual":true}, $)
dataline-scheduler | 2020-09-10 00:52:23 DEBUG TypeValidator:136 - validate( "60413b71-0fdb-4b7a-959f-7f316e2adfb0", {"connectionId":"60413b71-0fdb-4b7a-959f-7f316e2adfb0","manual":true}, $.connectionId)
dataline-scheduler | 2020-09-10 00:52:23 DEBUG UUIDValidator:136 - validate( "60413b71-0fdb-4b7a-959f-7f316e2adfb0", {"connectionId":"60413b71-0fdb-4b7a-959f-7f316e2adfb0","manual":true}, $.connectionId)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.