jefflester / minitrino Goto Github PK

View Code? Open in Web Editor NEW

30.0 30.0 3.0 759 KB

A tool that makes it easy to run modular Trino environments locally.

License: Apache License 2.0

Python 88.81% Shell 10.86% Dockerfile 0.33%

minitrino's People

Contributors

Stargazers

Watchers

Forkers

joechu1 dyndl maura16

minitrino's Issues

Add --worker-count option to Minipresto provision command

Minipresto should be able to provision a multi-node cluster if specified. Example:

minipresto provision -c snowflake-jdbc elasticsearch --worker-count 2

The resulting workers should be built using the Docker SDK, as forcing this through Compose would get messy real quick (Compose ought to be used where it is absolutely needed). The high level flow of this should look like:

Provision everything before the workers
Spin up n worker containers from the standard Starburst Presto image (with Presto server only and no CLI)
- Copy everything from coordinator's etc/presto/ directory and make necessary modifications to those files (mainly node.properties and config.properties)
  - This should ideally be done by copying coordinator volumes to the worker containers, but this may not be possible given the lack of labels on non-persistent volumes
- Apply label sets to worker containers
- Run a health check to verify worker connectivity to the coordinator

This will be a semi-substantial lift, but will improve the flexibility of the tool in a way that doesn't bind us more to Compose shell commands.

Update

After some messing around, it's looking like volume copying might be impossible with the way things are currently set up. The current idea is to:

Copy all volumes from coordinator to worker
Modify config.properties and node.properties in the worker node(s)

This would be nice and simple, but for simple volumes, there is no accessible link between non-persistent volumes mounted to a container and the container itself. For example, if we mount an elasticsearch.properties file to the coordinator, we cannot detect that relationship through the SDK and copy that to the new worker.

I am thinking of how best to tackle this. The very ugly and inefficient brute force technique would be to copy the entire etc and plugin directories from the coordinator to the host filesystem, then copy those files to the worker. This would be less than ideal...

Automate config tests

Tests for the config command should be automated and incorporated into the runner.py test runner file.

The challenge with this is that the command opens a text editor. I am trying to figure out a solid way to capture the correct exit code and also hop out of the text editor automatically. So far, my efforts aren't going anywhere.

For the time being, human QA will be run on the config command prior to any releases. This should be considered a high-priority testing item to ensure all commands are getting equal testing coverage in the testing library.

Provisioning from a snapshot library does not work

When running the provision.sh script from a snapshot, an error is produced claiming the module does not exist. This needs to be fixed and a test case for it should be added.

Automate release cycle via GitHub Actions

Create pull request (master <- #.#.#-update) with updated version numbers
- Version numbers in: version file, readme, and setup.py
- Verify version numbers all equal the version number of the release branch name
Delete existing 0.0.0 release/tag
Create 0.0.0 release/tag based on PR branch (allows for pulling Minitrino library in automated tests for the latest build)
Wait for CLI tests to pass (run through tests in the test directory)
Create release/tag - mark as pre-release
Test PyPi upload, then test package download, library pull, and a basic set of CLI commands
Upload to production PyPi, then test package download, library pull, and a basic set of CLI commands
Merge PR
Unmark release as pre-release, and mark as “most recent release”
Delete update branch

Improve license propagation

Currently, you are required to uncomment the volume mount in ~/.minitrino/lib/docker-compose.yml and set ${STARBURST_LIC_PATH}. We should not expect the user to need to uncomment a volume mount, so let's figure out a cleaner approach to get this file propagated. In my mind, if you have the variable set, that is all that should be required from a user-facing perspective.

Miscellaneous Functionality Cleanup

A few things could be touched up in Minitrino:

Instead of using bootstrap scripts for config.properties management, we could use a more user-friendly approach involving environment variables defined in the Compose YAML. Something like:

trino:
  environment:
    CONFIG_PROPERTIES: |-
      insights.jdbc.url=jdbc:postgresql://postgres-query-logger:5432/querylogger
      insights.jdbc.user=admin
      insights.jdbc.password=trinoRocks15
      insights.persistence-enabled=true
      insights.metrics-persistence-enabled=true

We'd merge these together based on the presence of configured properties in whatever modules the user wants to provision, then either run an automated script inside the Trino container or mount the collection of properties as a cohesive config.properties file to the container.

We'd want to keep the existing bootstrap functionality around while simultaneously decoupling it from managing this config file.

Remove CLI support for CONFIG and JVM_CONFIG environment variables (see: readme). This is messy and buggy in the first place, and by implementing the 1st request above, there would be little to no need to support these variables.

CommandExecutor enhancements

The CommandExecutor class could use some updates.

The class should have a suppress_output property. If given to the constructor as True, the executed command will direct all output to DEVNULL. Handled in commit e167625d2fc246d3423749a4fafb4f4ae3f0b7d3.
The default of the executor should not use shell=True for everything (this is frowned upon and viewed as a potential security hazard). Look into using the shlex module to parse complex command strings and execute the statements with Popen.

--env-override should apply to all config

The --env-override flag currently applies to the .env file in the root of the library. A user should be able to override any variable in this .env file, and they should be able to override any variable set in their minipresto.cfg file. Override checks need to be updated in core.py:

Make a dictionary of all variables from the environment file and Minipresto config file
Run each of those variables through the override check if an override is present
Return the updated dictionary

At a higher level, this raises the question of how to best manage config. Does it make sense to have it all live in one file? Probably. Perhaps everything should live in the minipresto.cfg file, the library's root .env file should be deleted.

Unable to access MinIO UI due to port redirection when using Docker

Accessing localhost:9000 redirects to something like localhost:34030 with a "request failed" message in the browser in the following config:

 minio:
    image: minio/minio
    container_name: minio
    ports:
      - '9000:9000'
    volumes:
      - './resources/minio/data/:/data'
    environment:
      MINIO_ACCESS_KEY: ${S3_ACCESS_KEY}
      MINIO_SECRET_KEY: ${S3_SECRET_KEY}
    command: server /data

I was able to work around this issue by adding 9001:9001 to the list of exposed ports and 'specifying --address ":9000" and --console-address ":9001" in the command parameter:

 minio:
    image: minio/minio
    container_name: minio
    ports:
      - '9000:9000'
      - '9001:9001'
    volumes:
      - './resources/minio/data/:/data'
    environment:
      MINIO_ACCESS_KEY: ${S3_ACCESS_KEY}
      MINIO_SECRET_KEY: ${S3_SECRET_KEY}
    command: server --address ":9000" --console-address ":9001" /data

Ongoing module list

This will serve as an ongoing list of modules we would like to add to the Minipresto library. The modules will be a mixture of Starburst Enterprise Presto features and open source PrestoSQL features.

Catalogs - Priority

Kafka
MongoDB
~~Oracle~~ - #36
~~SQL Server~~ - #28
~~Starburst Delta Lake~~ - #32
Hive-Glue

Security - Priority

User impersonation
System-level Ranger RBAC w/ Usersync (create policies for TPCH)

cannot run ./install.sh with a python version higher than 9

In setting up minitrino for the first time, new users are unable to run the ./install.sh script if they are on python version 3.10, which is the latest python version available

Add ` internal-communication.shared-secret` to all authN modules

From 378 onwards , internal-communication.shared-secret will be required if authentication is enabled on your SEP cluster.

trinodb/trino#11944

install.sh pip: command not found

There's a somewhat confusing message displayed when running the install.sh script:

./install.sh: line 5: pip: command not found

This doesn't look like an error, but more of a side effect of testing for the required Python version. However, I think users may get confused whether or not the installation succeeded if they see this message.

You should be able to modify the code to suppress this message without losing any functionality:

    if pip --version 2>&1 | grep -q "python3.[6-9]"; then
        PIP=pip
    elif pip3 --version 2>&1 | grep -q "python3.[6-9]"; then
        PIP=pip3

Otherwise if you'd rather have a pull request, let me know!

Minitrino does not work with SEP 363-e+

Starburst switched the underlying OS in the latest SEP images (363-e and later). The Dockerfile will need to be updated to accommodate for all Trino-based releases (354-e and on) with added logic to handle OS changes in 363-e and later images.

Use Maven Central URLs to download Presto CLI

Since the artifact names are consecutive, update the Presto Dockerfile to pull from Maven Central. This is currently broken in 338-e.1 and later.

Event listener factory 'event-logger' is not registered

Around version 407 (perhaps earlier), users will receive the below error thrown by the event-listener when attempting to spin up the event-logger module:

2023-04-05T17:45:46.345Z ERROR main io.trino.server.Server Event listener factory 'event-logger' is not registered. Available factories: [immuta-event-listener, starburst-atlas, starburst-open-lineage, http, mysql, audit-log] java.lang.IllegalArgumentException: Event listener factory 'event-logger' is not registered. Available factories: [immuta-event-listener, starburst-atlas, starburst-open-lineage, http, mysql, audit-log] at com.google.common.base.Preconditions.checkArgument(Preconditions.java:435) at io.trino.eventlistener.EventListenerManager.createEventListener(EventListenerManager.java:114) at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197) at java.base/java.util.Collections$2.tryAdvance(Collections.java:4853) at java.base/java.util.Collections$2.forEachRemaining(Collections.java:4861) at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921) at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682) at io.trino.eventlistener.EventListenerManager.configuredEventListeners(EventListenerManager.java:101) at io.trino.eventlistener.EventListenerManager.loadEventListeners(EventListenerManager.java:86) at io.trino.server.Server.doStart(Server.java:162) at io.trino.server.Server.lambda$start$0(Server.java:89) at io.trino.$gen.Trino_407_e_1____20230405_174525_1.run(Unknown Source) at io.trino.server.Server.start(Server.java:89) at com.starburstdata.presto.StarburstTrinoServer.main(StarburstTrinoServer.java:46)

Temporary workaround is to comment out the event-listener.properties volume from the event-logger.yaml all together.

ARM64 support

Apple's latest machines ship with M1 CPUs, built with an ARM64 architecture. Certain images (i.e. MySQL) are incompatible with these CPUs and require adjustments to the deployment in order to get it to work on the user's M1-based machine.

A one-off fix for the MySQL module involved changing the Docker Compose YAML to the following:

mysql:
  platform: "linux/amd64"
  image: "mysql:5.7"
  container_name: "mysql"

The following MySQL tags paired with platform: "linux/amd64" seem to work:

mysql:8.0.31
mysql:8
mysql:5.7

This issue can serve as a channel for planning/discovery as we figure out how to best tackle this issue.

Images detected to really cause problems on M1:

SQL Server mcr.microsoft.com/mssql/server:${SQLSERVER_VER}
Db2 ibmcom/db2:${DB2_VER}
Our own HMS image needs to run under emulation on M1 starburstdata/hive-metastore-unsecured:0cc95b

DB2 module error

When provisioning a minitrino cluster with the DB2 module, the cluster is able to come up successfully. However, if you try to query the module, you receive the following error:

`trino> show schemas from db2;

Query 20220719_163217_00001_k9r9w, FAILED, 1 node
Splits: 11 total, 0 done (0.00%)
0.51 [0 rows, 0B] [0 rows/s, 0B/s]

Query 20220719_163217_00001_k9r9w failed: DB2 SQL Error: SQLCODE=-1001, SQLSTATE=2E000, SQLERRMC=MINITRINO, DRIVER=4.25.13`

I stopped minitrino, removing docker volumes using minitrino -v remove --volumes, and deleted everything in ~/.minitrino/lib/modules/catalog/db2/resources/db2/db2-storage/

After spinning up the DB2 module again and attempting to query it, I received a new exception as shown below:
'trino> show schemas from db2;

Query 20220719_170146_00017_kfa26, FAILED, 1 node
Splits: 11 total, 0 done (0.00%)
0.67 [0 rows, 0B] [0 rows/s, 0B/s]

Query 20220719_170146_00017_kfa26 failed: [jcc][t4][2043][11550][4.25.13] Exception java.net.ConnectException: Error opening socket to server db2/172.19.0.3 on port 50,000 with message: Connection refused (Connection refused). ERRORCODE=-4499, SQLSTATE=08001'

SQLServer error: "The driver could not establish a secure connection to SQL Server by using Secure Sockets Layer (SSL) encryption."

When using the sqlserver module, users might see an error similar to the below when attempting to query the sqlserver catalog:

The driver could not establish a secure connection to SQL Server by using Secure Sockets Layer (SSL) encryption. Error: "PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target"

Same issue reported in this internal slack thread: https://starburstdata.slack.com/archives/C01FQCMJQM6/p1648827385983409

RCA is apparently a breaking change in the MS SQL JDBC driver version update between 371 and 374 - more details mentioned in linked thread

Workaround is to update the connection-url similar to either of the below options:
connection-url=jdbc:sqlserver://sqlserver:1433;database=master;encrypt=false
or
connection-url=jdbc:sqlserver://sqlserver:1433;database=master;trustServerCertificate=true

Update SQLServer module connection-url

New updates to the SQL Server connector I believe require ';encrypt=false' to be appended to the end of the connection URL. New URL should be:

connection-url=jdbc:sqlserver://sqlserver:1433;database=master;encrypt=false

Docs for reference:
https://docs.starburst.io/latest/connector/starburst-sqlserver.html#configuration

Populate module readmes

The readme documents for each module should be accompanied with a high-level description of what the module does and any important instructions for interacting with the module. Will close this issue once all modules have this and it becomes a standard process for new module PRs.

Build module testing framework

Library modules need to have testing coverage. I envision this coverage looking like this:

Create a ModuleTest class with:
- module_name: the name of the module to test
- args: commands to execute against the module. The args should be dicts with the following keys:
  - command: the command to execute
  - container: true/false, whether or not this is to be executed inside of a container or on the host shell
  - timeout: how long before the test command times out. This is to account for giving time to services to fully provision before running command against them

The class should accept any module as an input. From there, a single file (probably) can be used as a runner to test all modules.

Lastly, module tests should only run for PRs if the library changed -- this is because the tests already take long enough, and running tests for n number of modules will take an even longer time to complete.

Upgrade MySQL version environment variable from 5 --> 8+

Swap internal and external ports in docker-compose.yml files

Currently, certain modules lead to port conflicts when provisioned together (the delta-lake and hive-minio modules will cause this, for example). This is because the external and internal port mappings are backwards.

Current port mappings look like:

    ports:
      - "9083:9084"

This makes the external port 9083 and the internal port 9084. Plenty of other metastore services exist in the modules, so any module with a metastore provisioned at the same time will result in host-level port conflicts. This can be resolved by simply flipping the mapping:

    ports:
      - "9084:9083"

This way, the service will be exposed on the host on port 9084, and the internal service will still be accessible on port 9083.

Improve test quality and coverage

Tests should be developed and automated for both the CLI and library modules. Tests should also run with each PR to the master branch, and merging should be blocked if the tests fail.

CLI tests should:

Verify the functionality, outputs, and error-handling of each command

Module tests should:

Verify basic connectivity to the module and the module's connection to Presto
Verify basic and correct functionality of each module

STARBURST_LIC_PATH property for Minipresto configuration is not used

It doesn't look like the STARBURST_LIC_PATH property in the Minipresto configuration is used when provisioning a cluster. More importantly, once the STARBURST_LIC_PATH property is set and the volume is uncommented in the docker-compose.yml, Minipresto will return an error when provisioning:

[i]  Creating presto       ...
[i]  Creating presto       ... error
[i]  ERROR: for presto  Cannot create container for service presto: create .: volume name is too short, names should be at least two alphanumeric characters
[i]  Starting ranger-admin ... done
[i]  ERROR: for presto  Cannot create container for service presto: create .: volume name is too short, names should be at least two alphanumeric characters
[i]  Encountered errors while bringing up the project.
[w]  Rolling back provisioned resources due to errors encountered while provisioning the environment.
[e]  Failed to execute shell command:
[e]  COMPOSE_PROJECT_NAME="minipresto" STARBURST_VER="338-e.0" ELASTICSEARCH_VER="7.6.2" MYSQL_VER="5" OPEN_LDAP_VER="1.3.0" POSTGRES_VER="11"
     POSTGRES_HIVE_HMS_VER="11" POSTGRES_EVENT_LOGGER_VER="11" POSTGRES_RANGER_VER="12.2" SEP_RANGER_VER="2.0.24" KIBANA_VER="ver" MEMSQL_VER="ver"
     MONGODB_VER="ver" SQL_SERVER_VER="ver" S3_ENDPOINT="s3.us-west-2.amazonaws.com" S3_ACCESS_KEY="*****"
     S3_SECRET_KEY="*****" AWS_REGION="us-west-2" SNOWFLAKE_DIST_CONNECT_URL="" SNOWFLAKE_DIST_CONNECT_USER=""
     SNOWFLAKE_DIST_CONNECT_PASSWORD="" SNOWFLAKE_DIST_WAREHOUSE="" SNOWFLAKE_DIST_DB="" SNOWFLAKE_DIST_STAGE_SCHEMA="" SNOWFLAKE_JDBC_CONNECT_URL=""
     SNOWFLAKE_JDBC_CONNECT_USER="" SNOWFLAKE_JDBC_CONNECT_PASSWORD="" SNOWFLAKE_JDBC_WAREHOUSE="" SNOWFLAKE_JDBC_DB="" SNOWFLAKE_JDBC_STAGE_SCHEMA="" \
[e]  docker-compose -f /Users/joechu/git/minipresto/lib/docker-compose.yml \
[e]  -f /Users/joechu/git/minipresto/lib/modules/security/system-ranger/system-ranger.yml \
[e]  -f /Users/joechu/git/minipresto/lib/modules/security/ldap/ldap.yml \
[e]  up -d
[e]  Exit code: 1

So for some reason, STARBURST_LIC_PATH is not set as an environmental variable. Manually setting an environmental variable export STARBURST_LIC_PATH= or using the flag minipresto provision --env STARBURST_LIC_PATH= works as a workaround.

Minipresto version - commit 419edb4

minipresto.cfg

[CLI]
LIB_PATH=/Users/joechu/git/minipresto/lib
TEXT_EDITOR=/usr/bin/vim

[DOCKER]
DOCKER_HOST=unix:///var/run/docker.sock

[PRESTO]
STARBURST_LIC_PATH=/Users/joechu/Documents/Licenses/signed_trial.license
CONFIG=
    query.max-memory-per-node=8GB
    query.max-total-memory-per-node=8GB
JVM_CONFIG=
    -Xmx16G

[MODULES]
S3_ENDPOINT=s3.us-west-2.amazonaws.com
S3_ACCESS_KEY=*****
S3_SECRET_KEY=*****
AWS_REGION=us-west-2

SNOWFLAKE_DIST_CONNECT_URL=
SNOWFLAKE_DIST_CONNECT_USER=
SNOWFLAKE_DIST_CONNECT_PASSWORD=
SNOWFLAKE_DIST_WAREHOUSE=
SNOWFLAKE_DIST_DB=
SNOWFLAKE_DIST_STAGE_SCHEMA=

SNOWFLAKE_JDBC_CONNECT_URL=
SNOWFLAKE_JDBC_CONNECT_USER=
SNOWFLAKE_JDBC_CONNECT_PASSWORD=
SNOWFLAKE_JDBC_WAREHOUSE=
SNOWFLAKE_JDBC_DB=
SNOWFLAKE_JDBC_STAGE_SCHEMA=

docker-compose.yml

version: "3.7"
services:

  presto:
    build:
      context: .
      args:
        STARBURST_VER: "${STARBURST_VER}"
      labels:
        - "com.starburst.tests=minipresto"
        - "com.starburst.tests.module=presto"
    image: "minipresto/presto:${STARBURST_VER}"
    container_name: "presto"
    hostname: "presto.minipresto.starburstdata.com"
    labels:
      - "com.starburst.tests=minipresto"
      - "com.starburst.tests.module=presto"
    # Uncomment this to enable the volume mount. The variable should point to a
    # valid SEP license.
    volumes:
      - "${STARBURST_LIC_PATH}:/usr/lib/presto/etc/starburstdata.license:ro"
    ports:
      - "8080:8080"

Improve in-code documentation

Any substantive/regularly-accessed class or function in Minipresto should be accompanied with detailed doc strings explaining what the code does, what it returns, and the meaning, type, and default of any parameters/properties. Will remove this once functions have been properly documented.

Bootstrap script management

Currently, Minipresto will execute a module's bootstrap script(s) every single time a Minipresto module is provisioned––this creates obvious problems. To handle this issue, I propose the following:

To-Do

Add --skip-bootstrap to the provision command. If passed in with no string, it will skip all bootstraps. If passed in with specific module names, it will skip those modules' bootstraps.

Done

Add an initialze_container() function that adds an /opt/minipresto/ directory to the container as well as a bootstrap_status.txt file. Handled in PR #16
- Before executing a bootstrap script in any container, check the bootstrap_status.txt file for the checksum value of the file you're about to execute. If the checksum is not in the file, execute the script, then add the checksum to the file. If the checksum is in the file, skip the bootstrap script and add nothing to the bootstrap_status.txt file.