jefflester / minitrino Goto Github PK
View Code? Open in Web Editor NEWA tool that makes it easy to run modular Trino environments locally.
License: Apache License 2.0
A tool that makes it easy to run modular Trino environments locally.
License: Apache License 2.0
Minipresto should be able to provision a multi-node cluster if specified. Example:
minipresto provision -c snowflake-jdbc elasticsearch --worker-count 2
The resulting workers should be built using the Docker SDK, as forcing this through Compose would get messy real quick (Compose ought to be used where it is absolutely needed). The high level flow of this should look like:
etc/presto/
directory and make necessary modifications to those files (mainly node.properties
and config.properties
)
This will be a semi-substantial lift, but will improve the flexibility of the tool in a way that doesn't bind us more to Compose shell commands.
After some messing around, it's looking like volume copying might be impossible with the way things are currently set up. The current idea is to:
config.properties
and node.properties
in the worker node(s)This would be nice and simple, but for simple volumes, there is no accessible link between non-persistent volumes mounted to a container and the container itself. For example, if we mount an elasticsearch.properties
file to the coordinator, we cannot detect that relationship through the SDK and copy that to the new worker.
I am thinking of how best to tackle this. The very ugly and inefficient brute force technique would be to copy the entire etc and plugin directories from the coordinator to the host filesystem, then copy those files to the worker. This would be less than ideal...
Tests for the config
command should be automated and incorporated into the runner.py
test runner file.
The challenge with this is that the command opens a text editor. I am trying to figure out a solid way to capture the correct exit code and also hop out of the text editor automatically. So far, my efforts aren't going anywhere.
For the time being, human QA will be run on the config
command prior to any releases. This should be considered a high-priority testing item to ensure all commands are getting equal testing coverage in the testing library.
When running the provision.sh
script from a snapshot, an error is produced claiming the module does not exist. This needs to be fixed and a test case for it should be added.
master
<- #.#.#-update
) with updated version numbers
version
file, readme
, and setup.py
0.0.0
release/tag0.0.0
release/tag based on PR branch (allows for pulling Minitrino library in automated tests for the latest build)test
directory)Currently, you are required to uncomment the volume mount in ~/.minitrino/lib/docker-compose.yml
and set ${STARBURST_LIC_PATH}
. We should not expect the user to need to uncomment a volume mount, so let's figure out a cleaner approach to get this file propagated. In my mind, if you have the variable set, that is all that should be required from a user-facing perspective.
A few things could be touched up in Minitrino:
config.properties
management, we could use a more user-friendly approach involving environment variables defined in the Compose YAML. Something like:trino:
environment:
CONFIG_PROPERTIES: |-
insights.jdbc.url=jdbc:postgresql://postgres-query-logger:5432/querylogger
insights.jdbc.user=admin
insights.jdbc.password=trinoRocks15
insights.persistence-enabled=true
insights.metrics-persistence-enabled=true
We'd merge these together based on the presence of configured properties in whatever modules the user wants to provision, then either run an automated script inside the Trino container or mount the collection of properties as a cohesive config.properties
file to the container.
We'd want to keep the existing bootstrap functionality around while simultaneously decoupling it from managing this config file.
CONFIG
and JVM_CONFIG
environment variables (see: readme). This is messy and buggy in the first place, and by implementing the 1st request above, there would be little to no need to support these variables.The CommandExecutor class could use some updates.
suppress_output
property. If given to the constructor as True
, the executed command will direct all output to DEVNULL. Handled in commit e167625d2fc246d3423749a4fafb4f4ae3f0b7d3.shell=True
for everything (this is frowned upon and viewed as a potential security hazard). Look into using the shlex
module to parse complex command strings and execute the statements with Popen
.The --env-override
flag currently applies to the .env
file in the root of the library. A user should be able to override any variable in this .env
file, and they should be able to override any variable set in their minipresto.cfg
file. Override checks need to be updated in core.py
:
At a higher level, this raises the question of how to best manage config. Does it make sense to have it all live in one file? Probably. Perhaps everything should live in the minipresto.cfg
file, the library's root .env
file should be deleted.
Accessing localhost:9000
redirects to something like localhost:34030
with a "request failed" message in the browser in the following config:
minio:
image: minio/minio
container_name: minio
ports:
- '9000:9000'
volumes:
- './resources/minio/data/:/data'
environment:
MINIO_ACCESS_KEY: ${S3_ACCESS_KEY}
MINIO_SECRET_KEY: ${S3_SECRET_KEY}
command: server /data
I was able to work around this issue by adding 9001:9001
to the list of exposed ports and 'specifying --address ":9000"
and --console-address ":9001"
in the command parameter:
minio:
image: minio/minio
container_name: minio
ports:
- '9000:9000'
- '9001:9001'
volumes:
- './resources/minio/data/:/data'
environment:
MINIO_ACCESS_KEY: ${S3_ACCESS_KEY}
MINIO_SECRET_KEY: ${S3_SECRET_KEY}
command: server --address ":9000" --console-address ":9001" /data
This will serve as an ongoing list of modules we would like to add to the Minipresto library. The modules will be a mixture of Starburst Enterprise Presto features and open source PrestoSQL features.
In setting up minitrino for the first time, new users are unable to run the ./install.sh script if they are on python version 3.10, which is the latest python version available
From 378 onwards , internal-communication.shared-secret will be required if authentication is enabled on your SEP cluster.
There's a somewhat confusing message displayed when running the install.sh script:
./install.sh: line 5: pip: command not found
This doesn't look like an error, but more of a side effect of testing for the required Python version. However, I think users may get confused whether or not the installation succeeded if they see this message.
You should be able to modify the code to suppress this message without losing any functionality:
if pip --version 2>&1 | grep -q "python3.[6-9]"; then
PIP=pip
elif pip3 --version 2>&1 | grep -q "python3.[6-9]"; then
PIP=pip3
Otherwise if you'd rather have a pull request, let me know!
Starburst switched the underlying OS in the latest SEP images (363-e and later). The Dockerfile will need to be updated to accommodate for all Trino-based releases (354-e and on) with added logic to handle OS changes in 363-e and later images.
Since the artifact names are consecutive, update the Presto Dockerfile to pull from Maven Central. This is currently broken in 338-e.1 and later.
Around version 407 (perhaps earlier), users will receive the below error thrown by the event-listener when attempting to spin up the event-logger module:
2023-04-05T17:45:46.345Z ERROR main io.trino.server.Server Event listener factory 'event-logger' is not registered. Available factories: [immuta-event-listener, starburst-atlas, starburst-open-lineage, http, mysql, audit-log] java.lang.IllegalArgumentException: Event listener factory 'event-logger' is not registered. Available factories: [immuta-event-listener, starburst-atlas, starburst-open-lineage, http, mysql, audit-log] at com.google.common.base.Preconditions.checkArgument(Preconditions.java:435) at io.trino.eventlistener.EventListenerManager.createEventListener(EventListenerManager.java:114) at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197) at java.base/java.util.Collections$2.tryAdvance(Collections.java:4853) at java.base/java.util.Collections$2.forEachRemaining(Collections.java:4861) at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921) at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682) at io.trino.eventlistener.EventListenerManager.configuredEventListeners(EventListenerManager.java:101) at io.trino.eventlistener.EventListenerManager.loadEventListeners(EventListenerManager.java:86) at io.trino.server.Server.doStart(Server.java:162) at io.trino.server.Server.lambda$start$0(Server.java:89) at io.trino.$gen.Trino_407_e_1____20230405_174525_1.run(Unknown Source) at io.trino.server.Server.start(Server.java:89) at com.starburstdata.presto.StarburstTrinoServer.main(StarburstTrinoServer.java:46)
Temporary workaround is to comment out the event-listener.properties volume from the event-logger.yaml all together.
Related Slack: https://starburstdata.slack.com/archives/C01394KHSAU/p1671822000765579
Apple's latest machines ship with M1 CPUs, built with an ARM64 architecture. Certain images (i.e. MySQL) are incompatible with these CPUs and require adjustments to the deployment in order to get it to work on the user's M1-based machine.
A one-off fix for the MySQL module involved changing the Docker Compose YAML to the following:
mysql:
platform: "linux/amd64"
image: "mysql:5.7"
container_name: "mysql"
The following MySQL tags paired with platform: "linux/amd64"
seem to work:
mysql:8.0.31
mysql:8
mysql:5.7
This issue can serve as a channel for planning/discovery as we figure out how to best tackle this issue.
Images detected to really cause problems on M1:
mcr.microsoft.com/mssql/server:${SQLSERVER_VER}
ibmcom/db2:${DB2_VER}
starburstdata/hive-metastore-unsecured:0cc95b
When provisioning a minitrino cluster with the DB2 module, the cluster is able to come up successfully. However, if you try to query the module, you receive the following error:
`trino> show schemas from db2;
Query 20220719_163217_00001_k9r9w, FAILED, 1 node
Splits: 11 total, 0 done (0.00%)
0.51 [0 rows, 0B] [0 rows/s, 0B/s]
Query 20220719_163217_00001_k9r9w failed: DB2 SQL Error: SQLCODE=-1001, SQLSTATE=2E000, SQLERRMC=MINITRINO, DRIVER=4.25.13`
I stopped minitrino, removing docker volumes using minitrino -v remove --volumes
, and deleted everything in ~/.minitrino/lib/modules/catalog/db2/resources/db2/db2-storage/
After spinning up the DB2 module again and attempting to query it, I received a new exception as shown below:
'trino> show schemas from db2;
Query 20220719_170146_00017_kfa26, FAILED, 1 node
Splits: 11 total, 0 done (0.00%)
0.67 [0 rows, 0B] [0 rows/s, 0B/s]
Query 20220719_170146_00017_kfa26 failed: [jcc][t4][2043][11550][4.25.13] Exception java.net.ConnectException: Error opening socket to server db2/172.19.0.3 on port 50,000 with message: Connection refused (Connection refused). ERRORCODE=-4499, SQLSTATE=08001'
When using the sqlserver module, users might see an error similar to the below when attempting to query the sqlserver catalog:
The driver could not establish a secure connection to SQL Server by using Secure Sockets Layer (SSL) encryption. Error: "PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target"
Same issue reported in this internal slack thread: https://starburstdata.slack.com/archives/C01FQCMJQM6/p1648827385983409
RCA is apparently a breaking change in the MS SQL JDBC driver version update between 371 and 374 - more details mentioned in linked thread
Workaround is to update the connection-url similar to either of the below options:
connection-url=jdbc:sqlserver://sqlserver:1433;database=master;encrypt=false
or
connection-url=jdbc:sqlserver://sqlserver:1433;database=master;trustServerCertificate=true
New updates to the SQL Server connector I believe require ';encrypt=false' to be appended to the end of the connection URL. New URL should be:
connection-url=jdbc:sqlserver://sqlserver:1433;database=master;encrypt=false
Docs for reference:
https://docs.starburst.io/latest/connector/starburst-sqlserver.html#configuration
The readme documents for each module should be accompanied with a high-level description of what the module does and any important instructions for interacting with the module. Will close this issue once all modules have this and it becomes a standard process for new module PRs.
Library modules need to have testing coverage. I envision this coverage looking like this:
ModuleTest
class with:
module_name
: the name of the module to testargs
: commands to execute against the module. The args should be dicts with the following keys:
command
: the command to executecontainer
: true/false, whether or not this is to be executed inside of a container or on the host shelltimeout
: how long before the test command times out. This is to account for giving time to services to fully provision before running command against themThe class should accept any module as an input. From there, a single file (probably) can be used as a runner to test all modules.
Lastly, module tests should only run for PRs if the library changed -- this is because the tests already take long enough, and running tests for n number of modules will take an even longer time to complete.
Currently, certain modules lead to port conflicts when provisioned together (the delta-lake
and hive-minio
modules will cause this, for example). This is because the external and internal port mappings are backwards.
Current port mappings look like:
ports:
- "9083:9084"
This makes the external port 9083
and the internal port 9084
. Plenty of other metastore services exist in the modules, so any module with a metastore provisioned at the same time will result in host-level port conflicts. This can be resolved by simply flipping the mapping:
ports:
- "9084:9083"
This way, the service will be exposed on the host on port 9084
, and the internal service will still be accessible on port 9083
.
Tests should be developed and automated for both the CLI and library modules. Tests should also run with each PR to the master branch, and merging should be blocked if the tests fail.
CLI tests should:
Module tests should:
It doesn't look like the STARBURST_LIC_PATH property in the Minipresto configuration is used when provisioning a cluster. More importantly, once the STARBURST_LIC_PATH property is set and the volume is uncommented in the docker-compose.yml, Minipresto will return an error when provisioning:
[i] Creating presto ...
[i] Creating presto ... error
[i] ERROR: for presto Cannot create container for service presto: create .: volume name is too short, names should be at least two alphanumeric characters
[i] Starting ranger-admin ... done
[i] ERROR: for presto Cannot create container for service presto: create .: volume name is too short, names should be at least two alphanumeric characters
[i] Encountered errors while bringing up the project.
[w] Rolling back provisioned resources due to errors encountered while provisioning the environment.
[e] Failed to execute shell command:
[e] COMPOSE_PROJECT_NAME="minipresto" STARBURST_VER="338-e.0" ELASTICSEARCH_VER="7.6.2" MYSQL_VER="5" OPEN_LDAP_VER="1.3.0" POSTGRES_VER="11"
POSTGRES_HIVE_HMS_VER="11" POSTGRES_EVENT_LOGGER_VER="11" POSTGRES_RANGER_VER="12.2" SEP_RANGER_VER="2.0.24" KIBANA_VER="ver" MEMSQL_VER="ver"
MONGODB_VER="ver" SQL_SERVER_VER="ver" S3_ENDPOINT="s3.us-west-2.amazonaws.com" S3_ACCESS_KEY="*****"
S3_SECRET_KEY="*****" AWS_REGION="us-west-2" SNOWFLAKE_DIST_CONNECT_URL="" SNOWFLAKE_DIST_CONNECT_USER=""
SNOWFLAKE_DIST_CONNECT_PASSWORD="" SNOWFLAKE_DIST_WAREHOUSE="" SNOWFLAKE_DIST_DB="" SNOWFLAKE_DIST_STAGE_SCHEMA="" SNOWFLAKE_JDBC_CONNECT_URL=""
SNOWFLAKE_JDBC_CONNECT_USER="" SNOWFLAKE_JDBC_CONNECT_PASSWORD="" SNOWFLAKE_JDBC_WAREHOUSE="" SNOWFLAKE_JDBC_DB="" SNOWFLAKE_JDBC_STAGE_SCHEMA="" \
[e] docker-compose -f /Users/joechu/git/minipresto/lib/docker-compose.yml \
[e] -f /Users/joechu/git/minipresto/lib/modules/security/system-ranger/system-ranger.yml \
[e] -f /Users/joechu/git/minipresto/lib/modules/security/ldap/ldap.yml \
[e] up -d
[e] Exit code: 1
So for some reason, STARBURST_LIC_PATH is not set as an environmental variable. Manually setting an environmental variable export STARBURST_LIC_PATH=
or using the flag minipresto provision --env STARBURST_LIC_PATH=
works as a workaround.
Minipresto version - commit 419edb4
minipresto.cfg
[CLI]
LIB_PATH=/Users/joechu/git/minipresto/lib
TEXT_EDITOR=/usr/bin/vim
[DOCKER]
DOCKER_HOST=unix:///var/run/docker.sock
[PRESTO]
STARBURST_LIC_PATH=/Users/joechu/Documents/Licenses/signed_trial.license
CONFIG=
query.max-memory-per-node=8GB
query.max-total-memory-per-node=8GB
JVM_CONFIG=
-Xmx16G
[MODULES]
S3_ENDPOINT=s3.us-west-2.amazonaws.com
S3_ACCESS_KEY=*****
S3_SECRET_KEY=*****
AWS_REGION=us-west-2
SNOWFLAKE_DIST_CONNECT_URL=
SNOWFLAKE_DIST_CONNECT_USER=
SNOWFLAKE_DIST_CONNECT_PASSWORD=
SNOWFLAKE_DIST_WAREHOUSE=
SNOWFLAKE_DIST_DB=
SNOWFLAKE_DIST_STAGE_SCHEMA=
SNOWFLAKE_JDBC_CONNECT_URL=
SNOWFLAKE_JDBC_CONNECT_USER=
SNOWFLAKE_JDBC_CONNECT_PASSWORD=
SNOWFLAKE_JDBC_WAREHOUSE=
SNOWFLAKE_JDBC_DB=
SNOWFLAKE_JDBC_STAGE_SCHEMA=
docker-compose.yml
version: "3.7"
services:
presto:
build:
context: .
args:
STARBURST_VER: "${STARBURST_VER}"
labels:
- "com.starburst.tests=minipresto"
- "com.starburst.tests.module=presto"
image: "minipresto/presto:${STARBURST_VER}"
container_name: "presto"
hostname: "presto.minipresto.starburstdata.com"
labels:
- "com.starburst.tests=minipresto"
- "com.starburst.tests.module=presto"
# Uncomment this to enable the volume mount. The variable should point to a
# valid SEP license.
volumes:
- "${STARBURST_LIC_PATH}:/usr/lib/presto/etc/starburstdata.license:ro"
ports:
- "8080:8080"
Any substantive/regularly-accessed class or function in Minipresto should be accompanied with detailed doc strings explaining what the code does, what it returns, and the meaning, type, and default of any parameters/properties. Will remove this once functions have been properly documented.
Currently, Minipresto will execute a module's bootstrap script(s) every single time a Minipresto module is provisioned––this creates obvious problems. To handle this issue, I propose the following:
--skip-bootstrap
to the provision
command. If passed in with no string, it will skip all bootstraps. If passed in with specific module names, it will skip those modules' bootstraps.initialze_container()
function that adds an /opt/minipresto/
directory to the container as well as a bootstrap_status.txt
file. Handled in PR #16
bootstrap_status.txt
file for the checksum value of the file you're about to execute. If the checksum is not in the file, execute the script, then add the checksum to the file. If the checksum is in the file, skip the bootstrap script and add nothing to the bootstrap_status.txt
file.A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.