mlcraft-io / mlcraft Goto Github PK

View Code? Open in Web Editor NEW

469.0 6.0 19.0 3.89 MB

Synmetrix – open source semantic layer / Boost your LLM precision

Home Page: https://synmetrix.org

License: Other

Shell 3.83% PLpgSQL 14.15% JavaScript 68.48% Dockerfile 3.48% Batchfile 0.06% TypeScript 10.00%

business-intelligence clickhouse redshift big-data cube cubejs databricks prestodb druid firebolt

mlcraft's Introduction

Website • Docs • Cube.js Models docs • Docker Hub • Slack community

Readme in English • Readme in Chinese • Readme in Russian

Synmetrix

Synmetrix (prev. MLCraft) is an open source data engineering platform and semantic layer for centralized metrics management. It provides a complete framework for modeling, integrating, transforming, aggregating, and distributing metrics data at scale.

Key Features

Data modeling and transformations: Flexibly define metrics and dimensions using SQL and Cube data models. Apply transformations and aggregations.
Semantic layer: Consolidate metrics from across sources into a unified, governed data model. Eliminate metric definition differences.
Scheduled reports and alerts: Monitor metrics and get notified of changes via configurable reports and alerts.
Versioning: Track schema changes over time for transparency and auditability.
Role-based access control: Manage permissions for data models and metrics access.
Data exploration: Analyze metrics through the UI, or integrate with any BI tool via the SQL API.
Caching: Optimize performance using pre-aggregations and caching from Cube.
Teams: Collaborate on metrics modeling across your organization.

Overview

Synmetrix leverages Cube (Cube.js) to implement flexible data models that can consolidate metrics from across warehouses, databases, APIs and more. This unified semantic layer eliminates differences in definitions and calculations, providing a single source of truth.

The metrics data model can then be distributed downstream to any consumer via a SQL API, allowing integration into BI tools, reporting, dashboards, data science, and more.

By combining best practices from data engineering, like caching, orchestration, and transformation, with self-service analytics capabilities, Synmetrix speeds up data-driven workflows from metrics definition to consumption.

Use cases

Data Democratization: Synmetrix makes data accessible to non-experts, enabling everyone in an organization to make data-driven decisions easily.
Business Intelligence (BI) and Reporting: Integrate Synmetrix with any BI tool for advanced reporting and analytics, enhancing data visualization and insights.

Integrating Synmetrix with Apache Superset (Video)

Embedded Analytics: Use the Synmetrix API to embed analytics directly into applications, providing users with real-time data insights within their workflows.

Semantic Layer for LLM: Enhance LLM's accuracy in data handling and queries with Synmetrix's semantic layer, improving data interaction and precision.

Synmetrix with Large Language Model (LLM) example (Video)

Getting Started

Prerequisite Software

Ensure the following software is installed before proceeding:

Step 1: Download the docker-compose file

The repository mlcraft-io/mlcraft/install-manifests houses all the necessary installation manifests for deploying Synmetrix anywhere. You can download the docker compose file from this repository:

# Execute this in a new directory
wget https://raw.githubusercontent.com/mlcraft-io/mlcraft/main/install-manifests/docker-compose/docker-compose.yml
# Alternatively, you can use curl
curl https://raw.githubusercontent.com/mlcraft-io/mlcraft/main/install-manifests/docker-compose/docker-compose.yml -o docker-compose.yml

NOTE: Ensure to review the environment variables in the docker-compose.yml file. Modify them as necessary.

Step 2: Launch Synmetrix

Execute the following command to start Synmetrix along with a Postgres database for data storage.

$ docker-compose pull stack && docker-compose up -d

Verify if the containers are operational:

$ docker ps

CONTAINER ID IMAGE                 ... CREATED STATUS PORTS          ...
c8f342d086f3 synmetrix/stack       ... 1m ago  Up 1m  80->8888/tcp ...
30ea14ddaa5e postgres:12           ... 1m ago  Up 1m  5432/tcp

The installation of all dependencies will take approximately 5-7 minutes. Wait until you see the Synmetrix Stack is ready message. You can view the logs using docker-compose logs -f to confirm if the process has completed.

Step 3: Explore Synmetrix

You can access Synmetrix at http://localhost/
The GraphQL endpoint is located at http://localhost/v1/graphql
The Admin Console (Hasura Console) can be found at http://localhost/console
The Cube Swagger API can be found at http://localhost:4000/docs

Important Notes

Admin Console Access: Ensure to check HASURA_GRAPHQL_ADMIN_SECRET in the docker-compose file. This is mandatory for accessing the Admin Console. The default value is adminsecret. Remember to modify this in a production environment.
Environment Variables: Set up all necessary environment variables. Synmetrix will function with the default values, but certain features might not perform as anticipated.
Preloaded Seed Data: The project is equipped with preloaded seed data. Use the credentials below to sign in:
- Email: [email protected]
- Password: demodemo
This account is pre-configured with two demo datasources and their respective SQL API access. For SQL operations, you can use the following credentials with any PostgreSQL client tool such as DBeaver or TablePlus:

Host Port Database User Password

localhost 15432 db demo_pg_user demo_pg_pass

localhost 15432 db demo_clickhouse_user demo_clickhouse_pass

Host	Port	Database	User	Password
localhost	15432	db	demo_pg_user	demo_pg_pass
localhost	15432	db	demo_clickhouse_user	demo_clickhouse_pass

Documentation

Demo online

Demo: app.synmetrix.org

Login: [email protected]
Password: demodemo

Database demo credentials

Database type	Host	Port	Database	User	Password	SSL
ClickHouse	gh-api.clickhouse.tech	443	default	play	no password	true
PostgreSQL	demo-db-examples.cube.dev	5432	ecom	cube	12345	false

Data Modeling

Synmetrix leverages Cube for flexible data modeling and transformations.

Cube implements a multi-stage SQL data modeling architecture:

Raw data sits in a source database such as Postgres, MySQL, etc.
The raw data is modeled into reusable data marts using Cube Data Models files. These models files allow defining metrics, dimensions, granularities and relationships.
The models act as an abstraction layer between the raw data and application code.
Cube then generates optimized analytical SQL queries against the raw data based on the model.
The Cube Store distributed cache optimizes query performance by caching query results.

This modeling architecture makes it simple to create fast and complex analytical queries with Cube that are optimized to run against large datasets.

The unified data model can consolidate metrics from across different databases and systems, providing a consistent semantic layer for end users.

Cube Store

For production workloads, Synmetrix uses Cube Store as the caching and query execution layer.

Cube Store is a purpose-built database for operational analytics, optimized for fast aggregations and time series data. It provides:

Distributed querying for scalability
Advanced caching for fast queries
columnar storage for analytics performance
Integration with Cube for modeling

By leveraging Cube Store and Cube together, Synmetrix benefits from excellent analytics performance and flexibility in modeling metrics.

Benchmarks

Synmetrix with Cube: Caching and Highload

Ecosystem

Repository	Description
mlcraft-io/mlcraft	Synmetrix Monorepo
mlcraft-io/client-v2	Synmetrix Client
mlcraft-io/docs	Synmetrix Docs
mlcraft-io/examples	Synmetrix Examples

Community support

For general help using Synmetrix, please refer to the official Synmetrix documentation. For additional help, you can use one of these channels to ask a question:

Slack / For live discussion with the Community and Synmetrix team
GitHub / Bug reports, Contributions
Twitter / Updates and news
Youtube / Video tutorials and demos

Roadmap

Check out our roadmap to get informed on what we are currently working on, and what we have in mind for the next weeks, months and years.

License

The core Synmetrix is available under the Apache License 2.0 (Apache-2.0).

All other contents are available under the MIT License.

Hardware requirements

Component	Requirement
Processor (CPU)	3.2 GHz or higher, modern processor with multi-threading and virtualization support.
RAM	8 GB or more to handle computational tasks and data processing.
Disk Space	At least 30 GB of free space for software installation and storing working data.
Network	Internet connectivity is required for cloud services and software updates.

Authors

@ifokeev, @Libertonius, @ilyozzz

mlcraft's People

Contributors

Stargazers

Watchers

Forkers

bloodyburger shobhitsinghal624 em3ndez mikeyhodl keevol admariner fshabashev hirajanwin datalearns syllogy fknaopen alexeynovichenko won21kr osd1syc jackchongs alinakhay ravindra-kumar25 mvandermeulen pulumi-ce

mlcraft's Issues

Logs table slow loading

Kuberenetes implementation

Is there any plan to provide a kubernetes deployment for MLCraft? It doesn't seem too big a lift vs. the docker solution and might actually make this project feasible to run in place of Looker. In the current form I could see my way to playing around with this locaclly a bit, but not ever suggesting that my team try to use it.

Scheduled reports

Acceptance criteria:

User should be able to set up a scheduled report with a schedule in cron format
User selects metric name, granularity (hour, day, month, year...) and a date range sliding window (1...n days, 1...n months, 1...n years)
User can select how to deliver the message: webhook, slack (webhook with slack url), email

How to:

On submit create a specific cron job in Hasura
Create table scheduled_reports with columns: id, created_at, updated_at, user_id, team_id, name, schedule, exploration_id
While submitting you have to create an exploration too

Docker Compose up

Hello 👋

I'm having issues with the quickstart. Seems like hasura exits with code 0.

You can recreate it by running docker-compose up.

I'm using a Mac M1 btw, don't know if that would be the issue. I'll try again with a ~~X86~~x64 later today. 👍

stepci github action

wrong format for json table cells

Bug: generate schema overrides all data models

schema-overrides.mov

StepCI integration

Categories in tooltips

Categories should be visible in tooltip by value: to make this work we have to pivot categories.

Bug: SelectBranch component UX

Screen.Recording.2023-05-16.at.20.14.28.mov

Data models versioning

Acceptance criteria:

User should be able to create a branch from main data models version
User should be able to return to the previous saved versions or switch between branches
User should be able requesting merge to the main branch

How to:

Add column version to the dataschemas table
Allow user to create specific branches

clickhouse on demo, socket is hang up

Dear mlcraft.io team, thanks you for your effort guys! Demo looks great
I tried to add clickhouse data source on https://app.mlcraft.org/

host: github.demo.altinity.cloud
port 8443
use SSL 
user: demo
password: demo

when press "test datasource" failed with error
request

{"query":"mutation TestDatasourceMutation($input: TestConnectionInput!) {\n  testConnection(input: $input) {\n    message\n    __typename\n  }\n}\n","operationName":"TestDatasourceMutation","variables":{"input":{"dataSourceId":2}}}

response

{"errors":[{"message":"Unexpected error value: { error: \"socket hang up\" }","locations":[{"line":2,"column":3}],"path":["testConnection"]}],"data":null}

but datasource is live

curl -vvv "https://demo:[email protected]:8443/?query=SHOW+TABLES"

looks ok

could you help me resolve issue?

sqlRunner error is hidden

no way to add empty string filter

docker image for mlcraft stack

Data models import

Acceptance criteria:

User should be able to import all data models from exported zip-archive

How to:

Upload zip-archive to s3 bucket
Validate if it's a correct archive
Create data models from files

disable owner to leave the team

change nginx to traefik on staging

arguments in sql view

we need to replace it with arguments

Docs generation from data models

Acceptance criteria:

Users should have generated documentation based on metrics, dimensions, segments from cubes
We generate docs per cube, cube name in the sidebar
Every metric, dimension and segment should have info about sql, data type and meta: description, comments, author, collaborators

How to:

On every new version saving we have to generate a documentation for cubes
It should be an additional section in the sidebar called docs
It's also necessary to have a page which could be shared by link and an option to download markdown file

Acceptance criteria:

User should be able to set up an alert based on confidence values
User selects metric name, granularity (hour, day, month, year...) and a date range sliding window (1...n days, 1...n months, 1...n years)
User setting up a confidence interval: lower bound and/or upper bound
User can select how to deliver the message: webhook, slack (webhook with slack url), email

How to:

On submit create a specific cron job in Hasura
Create table alerts with columns: id, created_at, updated_at, user_id, team_id, name, schedule, exploration_id, lower_bound, upper_bound
lower_bound, upper_bound could be NULL, if selected only one value – we do alerting based on the one value
While submitting you have to create an exploration too

Preaggregations logs

Acceptance criteria

We need to implement views for endpoints: https://github.com/cube-js/cube/blob/377e123c1f4e6db89b369b4baff9664918dbf5b2/packages/cubejs-api-gateway/src/gateway.ts#L403-L455

How to

Create UI

Acceptance criteria

User could select time dimension with grouping: https://github.com/cube-js/cube/blob/master/packages/cubejs-client-core/src/utils.js#L5-L15

How to

Implement time dimension selector in Explore

Acceptance criteria:

User should be able to export all data models in one zip-archive

How to:

Save all data models in one zip-archive and upload to public s3
Redirect user to this link to download the archive
Archive should have a description yaml file where all data models represented. We need it to validate the archive on the import

no public docker images

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.