jitsucom / jitsu Goto Github PK

Jitsu is an open-source Segment alternative. Fully-scriptable data ingestion engine for modern data teams. Set-up a real-time data pipeline in minutes, not days

Home Page: https://jitsu.com

License: MIT License

Dockerfile 0.20% JavaScript 1.06% HTML 0.22% TypeScript 97.42% Shell 0.55% CSS 0.54%

data-integration clickhouse golang bigquery data-collection data-connectors redshift snowflake postgres

jitsu's People

Contributors

Stargazers

Watchers

Forkers

dubrowsky stvhanna kokizzu cxz doytsujin m8e smonami ngaut themotu zeta1999 cybernetics sbauch moshloop diki thetechband duoduo369 zxlzr hadryan isgasho fanahova dwtcourses elkornacio soyoo mistobaan mdeora matsuyoshi30 hubayirp tanghong123 w95 vineshg nabdtran bevale chapayghub 0x00dec0de iaroslavscript dimonl obieq nmcclain velebak rampapampapampapam amasser kartikhimself trackzon agilee doubaokun fourspaces jasont14 rogervaas vklimontovich pratikfalke purelooper amxml alessar gorodet-sky nguyentuansi antonefremov ilywkaa maniacs-oss jorgelig imorph 90180360 m0nk-tnd suryatmodulus zkan onlyone0001 brave-care christianechevarria jbzhang99 vietdevs muscatis admariner ipatrikeev developgo fire-depot efgefg0001 ntq-dongnv haydn mysky528 fesp21 saadhypng miesh-io dharmeshptl rptntobias cdodev santakd abordin bazooka720 documize tobiasmcvey pponnada younghai positioner leeeo shubham-kaushal glorang-hub lhfei ezradiniz cachengo data-bytes305 shizanysa

jitsu's Issues

SQL queries logs

Problem
As EventNative connects to different data sources, modifies storages, and inserts data, it would be nice to have logs of queries executed at different storages. This could help to troubleshoot if any errors happen on DDL queries or data modification.

Solution
EventNative configuration should have a property that determines if queries should be logged. Also, there should be a property that sets the output file name. If there is no file name, a global system logger should be used.

Use beacon API (if available)

Some browsers support beacon API. The API works better for tracking (for example, browser will send a request even if the page was closed). Here's the link to API spec: https://developer.mozilla.org/en-US/docs/Web/API/Beacon_API

If this API is available we need to use it, instead of sending XHR

iOS SDK

It should utilize #3

Retrospective user recognition

We have different ways to identify user. The way that always works relies on cookies. For some users, other ids can be known (through id() call). Example:

event1: anonymousId=1
event2: anonymousId=1
event3: anonymousId=1, email=[email protected]
event4: anonymousId=1, email=[email protected]

Right after event3 we can amend event1 and event2 and add email=[email protected]

Proposed architecture (highlevel):

Do not support all storages, support only storages where modifications are available (like ReplacingMergeTree in ClickHouse).
Once event3 happened, we need to go to statemanager (see below) and check if the particular pair has been processed.
If not, "pair" should get to queue
Once queue worker get to the "pair", it should get all events with anonymousId=1 and UPDATE records.

Possible ideas for state manager and queue:

Standalone postgress (assume that we won't have many "pair" events).
Redis (they have pub/sub)
Reuse destination. Probably bad idea, since most destinations are not suited well for KV operations

Error during non-docker app start

Description

If the app built from sources and https://docs.eventnative.org/deployment/build-from-sources and started without config, error message appears in log

Steps to reproduce / actual behavior

Build binary from sources and run it without config. See following error in logs

[ERROR]: open /home/eventnative/app/res/: no such file or directory

Expected behavior

Error shouldn't happen. All defaults config variables should be rather paths relative to binary, not a values convenient for docker environment. The app should create those folders and put clear message of the location in output. Example:

Log directory ~/go/src/github.com/jitsucom/eventnative/build/dist/logs is missing, creating
Resources directory ~/go/src/github.com/jitsucom/eventnative/build/dist/logs is missing, creating

Notes

To keep docker build functioning we need to:

Supply default ./eventnative.docker.yaml with /home/eventnative/* as default path
Support config merging

Schema typing

At the moment, EventNative writes all JSON fields as a string. We need to introduce typing. Types we're going to support: string, integer (64-bit), float (64-bit), timestamp (UTC)

for more details: https://docs.eventnative.dev/typecast

track.js should be built by babel

Just copying file will not work, we need support of older browsers. Besides, we'll need it anyway if we want to use TypeScript in the future

Add JavaScript tracker validation

We need to validate our JS files. Main validation points:

Syntax
Lack of alert() calls
Test that tracker initialization works (presence of window.eventN object)

Create application status page

Add it to heroku app.json -> "success_url"

Get server.name from a file

In case when server.name isn't specified in EventNative config, we should read it from a file (from app/res folder) and if the file doesn't exist -> create a new one with the generated server name. Let's give this filename - "server.name"

Cache last,error,resulting events in Redis

Backrground

For better transparency and diagnostics EventNative should be able to "answer" the following questions

What are the last N events which has been injected
How each of those events is transformed to DB record
What are the last N events that are didn't went to DB due to error and what is this error?

Implementation

Now, we have an API call for question #1 ("What are the last N events which has been injected"). However, the events are being kept in memory. It means a) no persistence across restarts b) to get the full picture a client should query the data of each node.

We're going to switch from memory to Redis (which still should remain optional: no Redis no diagnostics, but system is still functional).

Here are the "table"

Name: last_events /Key: timestamp/event_id /Value: event_json, db_record, error

{
  "destination_id": "<id>",
  "table": "<id>",
  "record": [
    {
      "field": "<name>",
      "type": "sql_type",
      "value": "value"
    },
    ...
  ]
}

API for direct event collection

Users should be able to post events directly from the a app or backend. The
the endpoint should be similar to /api/v1/event, but:

No IP address should be resolved from headers
Geo lookup should be done only if IP address is present

Enrichment rule

Concept

Enrichment rule is a piece of business logic that transforms original JSON event. The rule properties are:

Input: JSON node path as /json/path
Output: JSON node path as /json/out
Rule: string

The rule takes should be represented as F(json_node) → json_node.

Supported rules

So far we need to support two rules:

ip_lookup
user_agent_parse

We already have the code, we just need to wrap it into new structures.

Rules configuration

Rules should be configured on per destination basis:

destinations:
  destination_name:
    enrichment:
       - 
          name: ip_lookup
          from: /ip_address
          to: /geo/

Implicit rules

Some rules should always exist. They mainly needed for events coming from web browser

    enrichment:
       - 
          name: ip_lookup
          from: /source_ip
          to: /eventn_ctx/location
       - 
          name:  user_agent_parse
          from: /eventn_ctx/user_agent
          to: /eventn_ctx/parsed_ua

Also, It is supported in server to server integration

    enrichment:
       - 
          name: ip_lookup
          from: /device_ctx/location/ip
          to: /eventn_ctx/location
       - 
          name:  user_agent_parse
          from: /device_ctx/user_agent
          to: /eventn_ctx/parsed_ua

Autocapture all events

It seems a good idea to implement an "autocapture" feature so that the user didn't have to track each action manually. A good example of how it should work may be found at posthog documentation.
We must also take care that no sensitive data will be captured by the tracker (like passwords, credit cards, and so on).

Docker image contains unnecessary layouts and cached files

Hi!
After reading your blog at habr.com and experimenting with EventNative I've found that docker image contains unnecessary layers which makes it so huge.
Please see my pull request #82

Represent string columns as Text

Problem

At present EventNative creates string columns as character varying(8192) and cuts strings which are greater.
Also, EventNative represents arrays as strings, and in this case, the array value might become invalid.

Solution

EventNative should create strings columns as Text instead of character varying(8192).

Hot reload of configuration

To support hot reload of configuration (destination section). User should be able to change destination settings and mapping without service interruption.

Reload should be initiated by SIGHUP signal

JS Tracker improvements & dependencies fix

I've noticed several small flaws in the current setup of the npm module (./web/):

Dependencies should be moved to devDependencies as they aren't required in the production code.
It would be nice to have file changes watcher with auto recompilation
Usage of scripts like buildjs.sh seems to be not a very good practice, popular projects more often use npm scripts (e.g. https://github.com/mobxjs/mobx/blob/mobx6/package.json)
I would suggest using port choosing utility from react-dev-utils/WebpackDevServerUtils (as 80 port is often busy on devs' machines)
And it would be nice to have some health checks after building a new version of JS Tracker

Support GA measurement protocol as a destination

Some users want to send data to GA instead of DWH

This issue is about research if this feature will work nicely along with our data infrastructure. Here's a link to GA protocol The preliminary design is:

User configures mapping that maps incoming JSON into one-level JSON:

{
   "tid": 1,
   "sid": 2
   ....
}

The adapted should treat property as measurement protocol parameter, convert JSON to parameter set and send them to GA.

Telemetry: usage metrics implementation

EventNative should have a telemetry package and should send server start/stop events

NPM package improvements

Problem

Our npm package can be improved. This is a 'blanket' issue since each improvement is too small
and doesn't deserve a separate issue:

NPM does not export d.ts file that makes completion in most IDEs not-available
JS should capture other cookies (ga, segment and etc). It is related to #115: we need to direct only non captured events to GA.
We need to switch from import {eventN} from '...' to import eventN from '...'. However, we need to keep backward compatibility
Every parameter of init() call should be properly documented
Export of NPM version doesn't work

Solution

Proposed solution: architecture, implementation details etc

Delete and bulk insert operation

Problem

For syncing from third-party sources, EventNative should be able to run 'bulk replace' operations. Bulk replace is
a equivalent of

DELETE from destination where field=value;
INSERT INTO destination VALUES (...);
...
INSERT INTO destination VALUES (...);

executed in a single transaction

Solution

EventNative should have an insert and replace API with the following payload:

{
  "replace": {
    "where": [
      {
        "field": "value"
      }
    ]
  },
  "__destiation_table": "<table>",
  "__primary_key": "<field>",
  "events": [
    {}, {}
  ]
}

Payload should not go through stream pipeline but be written as synchronous call.

Scope

The scope of this issue include:

Adding (optional) delete and bulk insert operations into Destination interface
Change sync executor into new API: where section should contain __chunk=''
Save __chunk with every event

URL randomization

/api.{random 5 lowercase digits}?p_{random 5 lowercase digits}={id}

should be equal (handled the same way as)

/api/v1/event?token={id}

Allow define a chain of events through inline.js

If a parameter &event=XXX is present in GET-parameters: inline.js?event=XXX, add following line to the end of js output:

eventN.track('XXX');

Support multiple parameters: inline.js?e=XXX&e=YYY:

eventN.track('XXX'); eventN.track('YYY')

Authorization tokens reload

Token auth (server.auth) can accept multiple types of values:

String starting from http:// — download token JSON from http path
String starting from "file:///" — get token JSON from file
JSON object (or array) - parse embedded JSON
String: treat as a single token

For option (1) and (2) server should support hot-reload.

Token JSON syntax:

JSON array
Each element can be either string, or JSON object
String "" is equivalent of {"token": "str"}
Token object syntax:

{
  "token": "<public token>", //used in javascript tracking
  "origins": ["abc.com", "*.abc.com"] //check Origin header in XHR calls. "val" is treated as singleton array
}

Configure default mapping behavior

At present we have mapping rules and fields which aren't included in mapping are set to table as is. We should have a configuration flag to override this behavior to delete all fields which aren't included in mapping rules.

Get rid of ClickHouse Nullable types

According to the ClickHouse performance issue we should make all fields non-nullable by default, unless they listed in nullable fields config. in ClickHouse Data types. Instead of null we should set default values (it would be better to do it on the ClickHouse side).
All fields will be non-null by default, but we should change the config parameter of ClickHouse engine non-null fields -> nullable fields.

Privacy: Run self-hosted tracker without cookie

First: This looks like a very promising project! Thank you very much for your work!
Once the Clickhouse integration is finished, this could replace our own hacky solution for tracking events.

But there is one caveat and this is the GDPR and the cookie regulations we see here in Europe. Using cookies for tracking purposes requires us to inform the user and collect an opt-in (which they most likely won't do anyways... ).

Could you imagine to add a cookie-less mode for the tracking script? Some projects, like Fathom, Ackee or Goatcounter are implementing ways to track without a cookie (you can find a very interesting insight here: https://github.com/zgoat/goatcounter/blob/master/docs/sessions.markdown).

pSQL support

We want to support pSQL despite the fact it's not usually used as DWH and DataLake. However, certain low-volume installation can benefit from simplicity of pSQL.

Unlike BQ and RedShift we need to write to pSQL directly, bypassing log files (however, log files should be written regardless unless user explicitely opted-out). Once event is recorded, it should be stored in in-memory queue. A separate worker should process the queue and send events to pSQL.

Notes:

We need to think about parallelism. What happens if two instances decide to update schema? We need to do a database or / table lock on each of such operations
What happens if some events remain in queue during the shutdown? We shall store a snapshot it file
If SQL is down, we still need to keep the system up-to-date. If the event wasn't processed, we need to put it back to the queue

Discussion points:

What kind of queue shall we use? Any good libs in go?
Are 3rd-party queues (RabbitMQ) is an option?

Go app should serve static files relative to binary, not to PWD

Description

./eventnative binary serves static files relative to PWD, not the binary. It makes app availability depend on where binary is located

Steps to reproduce / ## Actual behavior

Build eventnative according to documentation https://docs.eventnative.org/deployment/build-from-sources
cd ~ ; ~/go/src/github.com/jitsucom/eventnative/build/dist/eventnative

Actual behavior

Binary doesn't start:

2020/11/22 17:07:24 [WARN]: Custom eventnative.yaml wasn't provided
2020-11-22 17:07:24 [INFO]: *** Creating new AppConfig ***
2020-11-22 17:07:24 [INFO]: Server Name: unnamed-server
2020-11-22 17:07:24 [WARN]: Server public url: will be taken from Host header
2020-11-22 17:07:24 [ERROR]: open /home/eventnative/app/res/: no such file or directory
....
2020-11-22 17:07:24 [ERROR]: Error reading ./web/welcome.html file: open ./web/welcome.html: no such file or directory
2020-11-22 17:07:24 [ERROR]: Error reading static file dir ./web/ open ./web/: no such file or directory
2020-11-22 17:07:24 [INFO]: Started server: 0.0.0.0:8001

Expected behavior

The code should resolve static file names not relative to PWD, but relative to binary location.
cd ~ ; ~/go/src/github.com/jitsucom/eventnative/build/dist/eventnative should be equivalent of cd ~/go/src/github.com/jitsucom/eventnative/build/dist/; ./eventnative

A clear and concise description of what you expected to happen.

Actual behavior

What's happening now and how it's different from actual behavior

EventNative caching last events

EventNative should keep the last N(=100 by default) accepted events by apikey in memory.

Those events should be available via secured(available by admin_token via X-Admin-Token header) GET /cache/events end-point with the following response structure:

{"events": [{payload_event1}, {payload_eventN}]}

Please consider limit query parameter for limiting the amount of response events.
Also, consider apikeys filter query parameter which can contain comma-separated list (or single value) of api keys strings.

add slack notification

EventNative should send slack notifications if system errors occur

Segment UserID loss

UserId field exists in Segment /p (page) requests, but EN doesn't capture this field. It might happen because analytics.js enrich requests with additional parameters (like userId) after middleware.

Step to reproduce: go to the track demo page, send page event, and look through Segment and EN requests in dev tools
Expected result: EN request contains userId in src_payload.obj.userId
Actual result: EN request doesn't contain userId

Destination: Postgres | SSL

Hi! 👋

Received this error when attempted to spin up EventNative using a posgres db on my localhost.
Does postgres require SSL for the application to work? I'd like to avoid configuring SSL on my local instance.

api_1 | 2020/08/17 03:59:02 Error initializing postgres_ksense destination of type postgres: pq: SSL is not enabled on the server api_1 | 2020/08/17 03:59:02 There is no configured event batch destinations

Thank you for the information.

Create ClickHouse tables with default partition

We should use the default partition clause: PARTITION BY toYYYYMM(_timestamp)

S3 as destination: flat file support

Is there a way to get EventNative to buffer/ship tracked line items to flat files somewhere like S3? I would imagine you already have to do this anyway for loading into Redshift, so the goal would be to be able to keep the original buffered staging files, but as the final destination.

Java SDK

Should be based on an API (https://docs.eventnative.dev/api) and published to maven repo

EventNative service discovery

If the coordination layer (etcd) is set up, every EventNative instance should report it's existence to that layer (as a heartbeat).

Each EventNative instance should have a secured (by admin key) end-point to report all instances (with last heartbeat date). Let's report all instances active within last N minutes, where N is an optional parameter (2 by default)

Add Postgres/Clickhouse integration tests

We should add integration tests using testcontainers. Make sure that all integration tests pass locally and in CircleCI.

NPM package rename

Since the company renamed from ksense → jitsu, we need to rename our npm package @ksense/eventnative → @jitsu/eventnative. Since direct renaming is not possible here's what we need to do:

Make a new package @jitsu/eventnative
Deprecated @ksense/eventnative, remove all content and put a dependency on @jitsu/eventnative. Also proxy eventN exports
Update all documentation
Test that system remains backward compatible

Setup a travis automated build

Also, add build status to github README

tracking_hostname filter for destinations

Let's say you want to track events on several different websites, and for business or legal reasons, event logs need to be stored separately.

You most likely want to use one global eventnative infra with several instances and a load balancer for this.

It would then be great if you could set a tracking_hostname filter for each destination, so that only the right events are streamed to them.

ClickHouse as destination

Add support of ClickHouse as a destination.

Two destination modes: streaming and batch

Two modes should be available for a destination. 'Streaming' and 'batch'. For low-volume types of events (example: conversions) it doesn't make sense to send them in batches. They could be written directly.

Streaming

Once an event received (and enriched), it should put to the internal queue. A separate thread should process events and send them to the destination (schema should be checked on each request?). We still will write events to logs, but just for information purposes.

Batch

Just keep current logic

Configuration

For each destination, following property should be supported mode: stream | batch (batch by default)

Destinations support

pSQL - both
RedShift - both
BQ - both
ClickHouse - both
S3 - batch only

User agent parser

Once we receive a user agent from client, we need to parse it during the "Enrichment" phase. Following parameters shall be extracted. Let's use https://github.com/ua-parser/uap-go and record all available parameters

Support intercepting events even in JS hasn't been added in correct way

We require our users to add eventN tracker after the GA / Segment code. However, we want our code to work even if the tracker has been added before. In that case, we need to check of precence of GA/analytics object. If it's not present at all, we need to set timeout and wait for a while. Let's make it configurable: up to N cycles, X ms wait time for each of then

    data_layout:
      table_name_template: 'events_{{._timestamp.Format "YYYY_MM"}}'

_timestamp value from database: 2020-11-12 12:46:24.337047

should be: "events_2020_11"
but creating "events_yyyy_mm"

jitsucom / jitsu Goto Github PK

jitsu's People

Contributors

Stargazers

Watchers

Forkers

jitsu's Issues

Description

Steps to reproduce / actual behavior

Expected behavior

Notes

Backrground

Implementation

Concept

Supported rules

Rules configuration

Implicit rules

Problem

Solution

Some users want to send data to GA instead of DWH

Problem

Solution

Problem

Solution

Scope

Description

Steps to reproduce / ## Actual behavior

Actual behavior

Expected behavior

Actual behavior

Streaming

Batch

Configuration

Destinations support

Recommend Projects

Recommend Topics

Recommend Org