Git Product home page Git Product logo

Comments (4)

thomasborgen avatar thomasborgen commented on September 20, 2024 1

Hi again @ChameleonTartu
I did a test where i changed a avro schema into a json schema with kaiba using the Kaiba App and it worked.
However, it made it clear that we have a limitation in kaiba. We are unable to transform values into keys. for example in avro a field is defined like this:

{"name": "field_name", "type": "string"}

But in JSONSchema a field's name is its key as in:

{
  "properties": {
    "field_name": {
      "type": "string"
    }
  }
}

I think this is something that we should look into supporting since it could be very powerful. This would include getting a keys name instead of its value and also maybe extending the kaiba object to make it possible to have a dynamic name.

For Protobuf, since its not JSON data we can't map directly to it. We can only map to the correct structure and let a post-processor handle the dump from json to protobuf.

Here is how I changed the avro schema into a jsonschema:

Given this Avro schema

{
  "type": "record",
  "namespace": "Tutorialspoint",
  "name": "Employee",
  "fields": [
    {"name": "Name", "type": "string"},
    {"name": "Age", "type": "int"}
  ]
}

And this kaiba config

{
  "name": "root",
  "array": false,
  "iterators": [],
  "attributes": [
    {
      "name": "title",
      "default": "Employee"
    },
    {
      "name": "type",
      "default": "object"
    }
  ],
  "objects": [
    {
      "name": "properties",
      "array": false,
      "objects": [
        {
          "name": "Name",
          "array": false,
          "iterators": [],
          "attributes": [
            {
              "name": "type",
              "data_fetchers": [
                {
                  "path": ["fields", 0, "type"]
                }
              ]
            }
          ]
        },
        {
          "name": "Age",
          "array": false,
          "iterators": [],
          "attributes": [
            {
              "name": "type",
              "data_fetchers": [
                {
                  "path": ["fields", 1, "type"]
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}

You can produce this:

{
  "title": "Employee",
  "type": "object",
  "properties": {
    "Name": {
      "type": "string"
    },
    "Age": {
      "type": "int"
    }
  }
}

from kaiba.

thomasborgen avatar thomasborgen commented on September 20, 2024

Hi @ChameleonTartu Thanks for the question :)

After a quick look into avro it seems to me that its a format used for transferring data quickly. It contains a schema, which the avro writer needs to validate incoming data it is going to write. if its valid, the avro writer now encodes the schema + data into a binary blob. Any avro reader can now read and decode the data properly because the schema is included in the blob.

From what I understand. Kaiba could be used both in front before data injection into avro to make arbitrary data conform with what avro expects. or behind, after the avro reader has read the data and output some json to turn it into a more desired format.
My initial thinking is that I dont think Kaiba should need to handle the schema part of avro. But I'll give this more thought.

I've been contemplating adding some pre/post processors directly into kaiba core. But i'm not sure if its the right place.

I've also just checked protobuf quickly right now and I was wondering if you could explain a bit more about the usecase. Is your usecase to change the .proto schema data into a JSONSchema decleration? Or is it again to change the data before injection and after reading?

from kaiba.

ChameleonTartu avatar ChameleonTartu commented on September 20, 2024

@thomasborgen I will bring a bit of context as my use cases are from Data Engineering and working with Apache Kafka, not from the integration space where we have used Kaiba.

CONTEXT:

In a broad sense, Apache Kafka is a Data Bus where you dump your data (produce) and read it after (consume).
Kafka doesn't care about intake; it knows and stores bytes. It has topics that are practically different channels/queues where you put your messages for separation.

A reader or writer of Kafka needs an extra Schema Registry. Schema registry stores schemas in AVRO, JSON (JSON Schema), or PROTOBUF formats. The algorithm that readers and writers follow will be:

Writer:

  1. Create schema
  2. Store schema in Schema registry
  3. Generate data based on the schema
  4. Send to Kafka

Reader:

  1. Retrieve schema
  2. Validate incoming data against the schema
  3. Process further

The common part, independently of the schema format, is

schema = read_schema()
validate(message, schema)

PROBLEM:

There is no simple way to change the schema, so I cannot take AVRO to convert to JSON Schema or JSON Schema to PROTOBUF.
The expected behavior will be to convert schema to schema with no pain:

AVRO <-> JSON
PROTOBUF <-> JSON

The reason for this use-case to exist and correlation with Kaiba:

1. Read messages from topic ABC. Message in JSON format
2. Enrich or trim the messages and post in topic XYZ. A message should be in AVRO format. (Kaiba-related)
3. Make a decision and post to ATP, and the message should be in PROTOBUF format.

Enrichment or data manipulation is truly Kaiba's existence story, but integration between formats must be solved.
As I wrote before, JSON to AVRO and JSON to PROTOBUF are partially industry-solved, while conversion between JSON Schema and other schemas is a widely open question that, to my knowledge, still needs to be solved.

QUESTIONS:
Is it kaiba-related? Yes, partially because Kaiba is great at manipulating data based on the schema.
Do you think this particular request should go to Kaiba-core? Not necessarily, it can go to Kaiba eco-system and help to promote it in Data Engineering niche.

I am open to discussion and ready to contribute to this branch of the project as I see a great need for it myself.

from kaiba.

ChameleonTartu avatar ChameleonTartu commented on September 20, 2024

@thomasborgen This works, the only issue it doesn't do any magic, it is very manual based and it requires understanding of both formats very well and kaiba itself.

Even though, I think this is a great solution, so I include it in the manual of transfomation AVRO to JSON.

Do I understand correctly that there is no "reverse" transformation availble, yet?

from kaiba.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.