Git Product home page Git Product logo

jitsucom / jitsu Goto Github PK

View Code? Open in Web Editor NEW
3.8K 40.0 266.0 37.78 MB

Jitsu is an open-source Segment alternative. Fully-scriptable data ingestion engine for modern data teams. Set-up a real-time data pipeline in minutes, not days

Home Page: https://jitsu.com

License: MIT License

Dockerfile 0.20% JavaScript 1.06% HTML 0.22% TypeScript 97.42% Shell 0.55% CSS 0.54%
data-integration clickhouse golang bigquery data-collection data-connectors redshift snowflake postgres

jitsu's People

Contributors

absorbb avatar echozio avatar jspizziri avatar omimakhare avatar scotteadams avatar vklimontovich avatar zjalicflw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jitsu's Issues

SQL queries logs

Problem
As EventNative connects to different data sources, modifies storages, and inserts data, it would be nice to have logs of queries executed at different storages. This could help to troubleshoot if any errors happen on DDL queries or data modification.

Solution
EventNative configuration should have a property that determines if queries should be logged. Also, there should be a property that sets the output file name. If there is no file name, a global system logger should be used.

Retrospective user recognition

We have different ways to identify user. The way that always works relies on cookies. For some users, other ids can be known (through id() call). Example:

Right after event3 we can amend event1 and event2 and add email=[email protected]

Proposed architecture (highlevel):

  • Do not support all storages, support only storages where modifications are available (like ReplacingMergeTree in ClickHouse).
  • Once event3 happened, we need to go to statemanager (see below) and check if the particular pair has been processed.
  • If not, "pair" should get to queue
  • Once queue worker get to the "pair", it should get all events with anonymousId=1 and UPDATE records.

Possible ideas for state manager and queue:

  • Standalone postgress (assume that we won't have many "pair" events).
  • Redis (they have pub/sub)
  • Reuse destination. Probably bad idea, since most destinations are not suited well for KV operations

Error during non-docker app start

Description

If the app built from sources and https://docs.eventnative.org/deployment/build-from-sources and started without config, error message appears in log

Steps to reproduce / actual behavior

Build binary from sources and run it without config. See following error in logs

[ERROR]: open /home/eventnative/app/res/: no such file or directory

Expected behavior

Error shouldn't happen. All defaults config variables should be rather paths relative to binary, not a values convenient for docker environment. The app should create those folders and put clear message of the location in output. Example:

  • Log directory ~/go/src/github.com/jitsucom/eventnative/build/dist/logs is missing, creating
  • Resources directory ~/go/src/github.com/jitsucom/eventnative/build/dist/logs is missing, creating

Notes

To keep docker build functioning we need to:

  • Supply default ./eventnative.docker.yaml with /home/eventnative/* as default path
  • Support config merging

Schema typing

At the moment, EventNative writes all JSON fields as a string. We need to introduce typing. Types we're going to support: string, integer (64-bit), float (64-bit), timestamp (UTC)

for more details: https://docs.eventnative.dev/typecast

track.js should be built by babel

Just copying file will not work, we need support of older browsers. Besides, we'll need it anyway if we want to use TypeScript in the future

Add JavaScript tracker validation

We need to validate our JS files. Main validation points:

  • Syntax
  • Lack of alert() calls
  • Test that tracker initialization works (presence of window.eventN object)

Get server.name from a file

In case when server.name isn't specified in EventNative config, we should read it from a file (from app/res folder) and if the file doesn't exist -> create a new one with the generated server name. Let's give this filename - "server.name"

Cache last,error,resulting events in Redis

Backrground

For better transparency and diagnostics EventNative should be able to "answer" the following questions

  • What are the last N events which has been injected
  • How each of those events is transformed to DB record
  • What are the last N events that are didn't went to DB due to error and what is this error?

Implementation

Now, we have an API call for question #1 ("What are the last N events which has been injected"). However, the events are being kept in memory. It means a) no persistence across restarts b) to get the full picture a client should query the data of each node.

We're going to switch from memory to Redis (which still should remain optional: no Redis no diagnostics, but system is still functional).

Here are the "table"

  • Name: last_events /Key: timestamp/event_id /Value: event_json, db_record, error
{
  "destination_id": "<id>",
  "table": "<id>",
  "record": [
    {
      "field": "<name>",
      "type": "sql_type",
      "value": "value"
    },
    ...
  ]
}

API for direct event collection

Users should be able to post events directly from the a app or backend. The
the endpoint should be similar to /api/v1/event, but:

  • No IP address should be resolved from headers
  • Geo lookup should be done only if IP address is present

Enrichment rule

Concept

Enrichment rule is a piece of business logic that transforms original JSON event. The rule properties are:

  • Input: JSON node path as /json/path
  • Output: JSON node path as /json/out
  • Rule: string

The rule takes should be represented as F(json_node) → json_node.

Supported rules

So far we need to support two rules:

  • ip_lookup
  • user_agent_parse

We already have the code, we just need to wrap it into new structures.

Rules configuration

Rules should be configured on per destination basis:

destinations:
  destination_name:
    enrichment:
       - 
          name: ip_lookup
          from: /ip_address
          to: /geo/

Implicit rules

Some rules should always exist. They mainly needed for events coming from web browser

    enrichment:
       - 
          name: ip_lookup
          from: /source_ip
          to: /eventn_ctx/location
       - 
          name:  user_agent_parse
          from: /eventn_ctx/user_agent
          to: /eventn_ctx/parsed_ua     

Also, It is supported in server to server integration

    enrichment:
       - 
          name: ip_lookup
          from: /device_ctx/location/ip
          to: /eventn_ctx/location
       - 
          name:  user_agent_parse
          from: /device_ctx/user_agent
          to: /eventn_ctx/parsed_ua     

Autocapture all events

It seems a good idea to implement an "autocapture" feature so that the user didn't have to track each action manually. A good example of how it should work may be found at posthog documentation.
We must also take care that no sensitive data will be captured by the tracker (like passwords, credit cards, and so on).

Represent string columns as Text

Problem

At present EventNative creates string columns as character varying(8192) and cuts strings which are greater.
Also, EventNative represents arrays as strings, and in this case, the array value might become invalid.

Solution

EventNative should create strings columns as Text instead of character varying(8192).

Hot reload of configuration

To support hot reload of configuration (destination section). User should be able to change destination settings and mapping without service interruption.

Reload should be initiated by SIGHUP signal

JS Tracker improvements & dependencies fix

I've noticed several small flaws in the current setup of the npm module (./web/):

  • Dependencies should be moved to devDependencies as they aren't required in the production code.
  • It would be nice to have file changes watcher with auto recompilation
  • Usage of scripts like buildjs.sh seems to be not a very good practice, popular projects more often use npm scripts (e.g. https://github.com/mobxjs/mobx/blob/mobx6/package.json)
  • I would suggest using port choosing utility from react-dev-utils/WebpackDevServerUtils (as 80 port is often busy on devs' machines)
  • And it would be nice to have some health checks after building a new version of JS Tracker

NPM package improvements

Problem

Our npm package can be improved. This is a 'blanket' issue since each improvement is too small
and doesn't deserve a separate issue:

  • NPM does not export d.ts file that makes completion in most IDEs not-available
  • JS should capture other cookies (ga, segment and etc). It is related to #115: we need to direct only non captured events to GA.
  • We need to switch from import {eventN} from '...' to import eventN from '...'. However, we need to keep backward compatibility
  • Every parameter of init() call should be properly documented
  • Export of NPM version doesn't work

Solution

Proposed solution: architecture, implementation details etc

Delete and bulk insert operation

Problem

For syncing from third-party sources, EventNative should be able to run 'bulk replace' operations. Bulk replace is
a equivalent of

DELETE from destination where field=value;
INSERT INTO destination VALUES (...);
...
INSERT INTO destination VALUES (...);

executed in a single transaction

Solution

EventNative should have an insert and replace API with the following payload:

{
  "replace": {
    "where": [
      {
        "field": "value"
      }
    ]
  },
  "__destiation_table": "<table>",
  "__primary_key": "<field>",
  "events": [
    {}, {}
  ]
}

Payload should not go through stream pipeline but be written as synchronous call.

Scope

The scope of this issue include:

  • Adding (optional) delete and bulk insert operations into Destination interface
  • Change sync executor into new API: where section should contain __chunk=''
  • Save __chunk with every event

URL randomization

/api.{random 5 lowercase digits}?p_{random 5 lowercase digits}={id}

should be equal (handled the same way as)

/api/v1/event?token={id}

Allow define a chain of events through inline.js

  • If a parameter &event=XXX is present in GET-parameters: inline.js?event=XXX, add following line to the end of js output:

eventN.track('XXX');

  • Support multiple parameters: inline.js?e=XXX&e=YYY:

eventN.track('XXX'); eventN.track('YYY')

Authorization tokens reload

Token auth (server.auth) can accept multiple types of values:

  • String starting from http:// — download token JSON from http path
  • String starting from "file:///" — get token JSON from file
  • JSON object (or array) - parse embedded JSON
  • String: treat as a single token

For option (1) and (2) server should support hot-reload.

Token JSON syntax:

  • JSON array
  • Each element can be either string, or JSON object
  • String "" is equivalent of {"token": "str"}
  • Token object syntax:
{
  "token": "<public token>", //used in javascript tracking
  "origins": ["abc.com", "*.abc.com"] //check Origin header in XHR calls. "val" is treated as singleton array
}

Configure default mapping behavior

At present we have mapping rules and fields which aren't included in mapping are set to table as is. We should have a configuration flag to override this behavior to delete all fields which aren't included in mapping rules.

Get rid of ClickHouse Nullable types

According to the ClickHouse performance issue we should make all fields non-nullable by default, unless they listed in nullable fields config. in ClickHouse Data types. Instead of null we should set default values (it would be better to do it on the ClickHouse side).
All fields will be non-null by default, but we should change the config parameter of ClickHouse engine non-null fields -> nullable fields.

Privacy: Run self-hosted tracker without cookie

First: This looks like a very promising project! Thank you very much for your work!
Once the Clickhouse integration is finished, this could replace our own hacky solution for tracking events.

But there is one caveat and this is the GDPR and the cookie regulations we see here in Europe. Using cookies for tracking purposes requires us to inform the user and collect an opt-in (which they most likely won't do anyways... ).

Could you imagine to add a cookie-less mode for the tracking script? Some projects, like Fathom, Ackee or Goatcounter are implementing ways to track without a cookie (you can find a very interesting insight here: https://github.com/zgoat/goatcounter/blob/master/docs/sessions.markdown).

pSQL support

We want to support pSQL despite the fact it's not usually used as DWH and DataLake. However, certain low-volume installation can benefit from simplicity of pSQL.

Unlike BQ and RedShift we need to write to pSQL directly, bypassing log files (however, log files should be written regardless unless user explicitely opted-out). Once event is recorded, it should be stored in in-memory queue. A separate worker should process the queue and send events to pSQL.

Notes:

  • We need to think about parallelism. What happens if two instances decide to update schema? We need to do a database or / table lock on each of such operations
  • What happens if some events remain in queue during the shutdown? We shall store a snapshot it file
  • If SQL is down, we still need to keep the system up-to-date. If the event wasn't processed, we need to put it back to the queue

Discussion points:

  • What kind of queue shall we use? Any good libs in go?
  • Are 3rd-party queues (RabbitMQ) is an option?

Go app should serve static files relative to binary, not to PWD

Description

./eventnative binary serves static files relative to PWD, not the binary. It makes app availability depend on where binary is located

Steps to reproduce / ## Actual behavior

Actual behavior

Binary doesn't start:

2020/11/22 17:07:24 [WARN]: Custom eventnative.yaml wasn't provided
2020-11-22 17:07:24 [INFO]: *** Creating new AppConfig ***
2020-11-22 17:07:24 [INFO]: Server Name: unnamed-server
2020-11-22 17:07:24 [WARN]: Server public url: will be taken from Host header
2020-11-22 17:07:24 [ERROR]: open /home/eventnative/app/res/: no such file or directory
....
2020-11-22 17:07:24 [ERROR]: Error reading ./web/welcome.html file: open ./web/welcome.html: no such file or directory
2020-11-22 17:07:24 [ERROR]: Error reading static file dir ./web/ open ./web/: no such file or directory
2020-11-22 17:07:24 [INFO]: Started server: 0.0.0.0:8001

Expected behavior

The code should resolve static file names not relative to PWD, but relative to binary location.
cd ~ ; ~/go/src/github.com/jitsucom/eventnative/build/dist/eventnative should be equivalent of cd ~/go/src/github.com/jitsucom/eventnative/build/dist/; ./eventnative

A clear and concise description of what you expected to happen.

Actual behavior

What's happening now and how it's different from actual behavior

EventNative caching last events

EventNative should keep the last N(=100 by default) accepted events by apikey in memory.

Those events should be available via secured(available by admin_token via X-Admin-Token header) GET /cache/events end-point with the following response structure:

{"events": [{payload_event1}, {payload_eventN}]}

Please consider limit query parameter for limiting the amount of response events.
Also, consider apikeys filter query parameter which can contain comma-separated list (or single value) of api keys strings.

Segment UserID loss

UserId field exists in Segment /p (page) requests, but EN doesn't capture this field. It might happen because analytics.js enrich requests with additional parameters (like userId) after middleware.

Step to reproduce: go to the track demo page, send page event, and look through Segment and EN requests in dev tools
Expected result: EN request contains userId in src_payload.obj.userId
Actual result: EN request doesn't contain userId

Destination: Postgres | SSL

Hi! 👋

Received this error when attempted to spin up EventNative using a posgres db on my localhost.
Does postgres require SSL for the application to work? I'd like to avoid configuring SSL on my local instance.

api_1 | 2020/08/17 03:59:02 Error initializing postgres_ksense destination of type postgres: pq: SSL is not enabled on the server api_1 | 2020/08/17 03:59:02 There is no configured event batch destinations

Thank you for the information.

S3 as destination: flat file support

Is there a way to get EventNative to buffer/ship tracked line items to flat files somewhere like S3? I would imagine you already have to do this anyway for loading into Redshift, so the goal would be to be able to keep the original buffered staging files, but as the final destination.

EventNative service discovery

If the coordination layer (etcd) is set up, every EventNative instance should report it's existence to that layer (as a heartbeat).

Each EventNative instance should have a secured (by admin key) end-point to report all instances (with last heartbeat date). Let's report all instances active within last N minutes, where N is an optional parameter (2 by default)

NPM package rename

Since the company renamed from ksense → jitsu, we need to rename our npm package @ksense/eventnative → @jitsu/eventnative. Since direct renaming is not possible here's what we need to do:

  • Make a new package @jitsu/eventnative
  • Deprecated @ksense/eventnative, remove all content and put a dependency on @jitsu/eventnative. Also proxy eventN exports
  • Update all documentation
  • Test that system remains backward compatible

tracking_hostname filter for destinations

Let's say you want to track events on several different websites, and for business or legal reasons, event logs need to be stored separately.

You most likely want to use one global eventnative infra with several instances and a load balancer for this.

It would then be great if you could set a tracking_hostname filter for each destination, so that only the right events are streamed to them.

Two destination modes: streaming and batch

Two modes should be available for a destination. 'Streaming' and 'batch'. For low-volume types of events (example: conversions) it doesn't make sense to send them in batches. They could be written directly.

Streaming

Once an event received (and enriched), it should put to the internal queue. A separate thread should process events and send them to the destination (schema should be checked on each request?). We still will write events to logs, but just for information purposes.

Batch

Just keep current logic

Configuration

For each destination, following property should be supported mode: stream | batch (batch by default)

Destinations support

  • pSQL - both
  • RedShift - both
  • BQ - both
  • ClickHouse - both
  • S3 - batch only

Support intercepting events even in JS hasn't been added in correct way

We require our users to add eventN tracker after the GA / Segment code. However, we want our code to work even if the tracker has been added before. In that case, we need to check of precence of GA/analytics object. If it's not present at all, we need to set timeout and wait for a while. Let's make it configurable: up to N cycles, X ms wait time for each of then

Snowflake support

We need to support Snowflake. So far it's unclear what API should be used to load data

Lock while schema patching using etcd

When database schema is patched, we should be able to use distributed locks if multiple instances of the server are used. The lock server should be configurable. The default implementation is a dummy one, which does not lock anything.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.