github pages landing page: https://hpe-container-platform-community.github.io/
hpe-container-platform-community / ezdemo Goto Github PK
View Code? Open in Web Editor NEWHPE Ezmeral Deployment tool for demos
HPE Ezmeral Deployment tool for demos
github pages landing page: https://hpe-container-platform-community.github.io/
It would be good to configure ad_admin1 and ad_user1 on picasso.
As an ansible newbie I'm not sure the best way to structure the scripts.
Here's my attempt ...
root@1e2f223abb02:/app/server# cat ansible/routines/configure_picasso.yml
### Configure Picasso
- hosts: "{{ (groups['controllers'] | first) | default([])}}"
tasks:
- name: read cluster id
shell: "hpecp k8scluster list -o text | cut -d' ' -f1"
register: cluster_id
- name: get cluster
shell: "hpecp k8scluster get {{ cluster_id.stdout }} -o json"
register: cluster_json
ignore_errors: True
- set_fact:
cluster: "{{ cluster_json.stdout | from_json }}"
- set_fact:
firstmaster_id: "{{ (cluster | json_query(jmesquery)) | first }}"
vars:
jmesquery: "k8shosts_config[?role=='master'].node"
- shell: "hpecp k8sworker get {{ firstmaster_id }} -o json"
register: firstmaster_json
- set_fact:
firstmasterip: "{{ (firstmaster_json.stdout | from_json) | json_query('ipaddr') }}"
- name: prepare tenants
shell: |-
function retry {
local n=1
local max=20
local delay=30
while true; do
"$@" && break || {
if [[ $n -lt $max ]]; then
((n++))
echo "Command failed. Attempt $n/$max:"
sleep $delay;
else
fail "The command has failed after $n attempts."
fi
}
done
}
export SCRIPTPATH="/opt/bluedata/bundles/hpe-cp*"
export MASTER_NODE_IP={{ firstmasterip }}
export LOG_FILE_PATH=/tmp/register_k8s_prepare.log
retry ${SCRIPTPATH}/startscript.sh --action prepare_dftenants
export LOG_FILE_PATH=/tmp/register_k8s_configure.log
[[ $(tail -1 ${LOG_FILE_PATH} 2> /dev/null ) == "The action configure_dftenants completed successfully." ]] || echo yes | ${SCRIPTPATH}/startscript.sh --action configure_dftenants
export LOG_FILE_PATH=/tmp/register_k8s_register.log
[[ $(tail -1 ${LOG_FILE_PATH} 2> /dev/null ) == "The action register_dftenants completed successfully." ]] || expect <<EOF
set timeout 1800
spawn $(realpath ${SCRIPTPATH})/startscript.sh --action register_dftenants
expect ".*Enter Site Admin username: " { send "admin\r" }
expect "admin\r\nEnter Site Admin password: " { send "{{ admin_password }}\r" }
expect eof
EOF
register: result
# retries: 15
# delay: 60
# until: result is not failed
- name: configure picasso DF users
hosts: localhost
tasks:
- name: mapr password
shell: "kubectl --kubeconfig {{ ansible_env.HOME }}/.kube/config -n dfdemo get secret system -o yaml | grep MAPR_PASSWORD | head -1 | awk '{print $2}' | base64 --decode"
register: mapr_password
- name: maprlogin
shell: "kubectl --kubeconfig {{ ansible_env.HOME }}/.kube/config -n dfdemo exec admincli-0 -- bash -c 'echo {{ mapr_password.stdout }} | maprlogin password'"
- name: add ad_admin1
shell: "kubectl --kubeconfig {{ ansible_env.HOME }}/.kube/config -n dfdemo exec admincli-0 -- maprcli acl edit -type cluster -user ad_admin1:fc"
- name: add ad_user1
shell: "kubectl --kubeconfig {{ ansible_env.HOME }}/.kube/config -n dfdemo exec admincli-0 -- maprcli acl edit -type cluster -user ad_user1:login"
allow users to select regions via config.json
E.g. using the --pull always
docker run --pull always -d -p 4000:4000 -p 8443:8443 ${joined} erdincka/ezdemo:latest
Could we update to the most recent stable Nvidia driver?
https://uk.download.nvidia.com/tesla/470.82.01/NVIDIA-Linux-x86_64-470.82.01.run
I'm not sure if these issues are because I corrupted erdincka/ezdemo:latest
with the github action...
I had to create the group_vars folder:
[[ -d ./ansible/group_vars/ ]] || mkdir ./ansible/group_vars
echo "ansible_ssh_common_args: ${SSH_OPTS}" > ./ansible/group_vars/all.yml
gateway_pub_dns was throwing an error, and relative path for key was failing unless in the right directory
### TODO: Move to ansible task
SSH_CONFIG="
Host *
StrictHostKeyChecking no
Host hpecp_gateway
# Hostname ${gateway_pub_dns}
Hostname ${GATW_PUB_DNS[0]}
# IdentityFile generated/controller.prv_key
IdentityFile /app/server/generated/controller.prv_key
...
I had to create the ~/.ssh folder:
[[ -d ~/.ssh ]] || mkdir ~/.ssh && chmod 700 ~/.ssh
echo "${SSH_CONFIG}" > ~/.ssh/ssh_config ## TODO: move to ansible, delete on destroy
For some reason, ssh client wasn't using ~/.ssh/ssh_config
by default, I had to use ~/.ssh/config
instead:
[[ -d ~/.ssh ]] || mkdir ~/.ssh && chmod 700 ~/.ssh
echo "${SSH_CONFIG}" > ~/.ssh/config ## TODO: move to ansible, delete on destroy
Do these changes make sense? Shall I add them?
Instead of showing just the output of the run(s), we should have a nicer/simpler/friendlier component to report the progress.
This could be a progress bar configured with,
root_block_device {
...
delete_on_termination = true
}
To update the MLOPS SCC configuration:
POST /api/v2/k8scluster/ {cluster_id} /kubectl
With payload data:
data = {
"op": {kubectl_op}, // "create", "apply", "delete"
"data": {
"apiVersion": "",
"kind": "",
"metadata": {
"namespace": "",
"name": "",
"labels":{
"kubedirector.hpe.com/cmType": "source-control",
"createdByUser": "",
"createdByRole": "",
"parentConfiguration": ""
}
},
"data":{
"sourceControlName": "",
"type": "github | bitbucket",
"repoURL": "",
"authType": "token | password",
"branch": "",
"workingDirectory": "",
"proxyProtocol": "",
"proxyHostname": "",
"proxyPort": "",
"username": "",
"email": "",
"token": "",
"description": ""
}
}
}
more information to follow on:
data.apiVersion
data.kind
data.metadata.labels.parentConfiguration (how do we retrieve this)?
E.g. Parent
{
"method": "post",
"apiurl": "https://127.0.0.1:8080",
"timeout": 239,
"data": {
"kubectl_op": "create",
"cluster_href": "/api/v2/k8scluster/1",
"payload": {
"apiVersion": "v1",
"kind": "ConfigMap",
"metadata": {
"namespace": "k8s-tenant-1",
"name": "abc",
"labels": {
"kubedirector.hpe.com/cmType": "source-control",
"createdByUser": "6",
"createdByRole": "Admin"
}
},
"data": {
"type": "github",
"repoURL": "[email protected]:hpe-container-platform-community/example_active_directory_server.git",
"authType": "token",
"branch": "main",
"workingDirectory": "",
"proxyProtocol": "",
"proxyHostname": "",
"proxyPort": "",
"description": ""
}
}
},
"op": "source_control_action"
}
Example child
{
"method": "post",
"apiurl": "https://127.0.0.1:8080",
"timeout": 239,
"data": {
"kubectl_op": "create",
"cluster_href": "/api/v2/k8scluster/1",
"payload": {
"apiVersion": "v1",
"kind": "ConfigMap",
"metadata": {
"namespace": "k8s-tenant-1",
"name": "mysccchild",
"labels": {
"kubedirector.hpe.com/cmType": "source-control",
"createdByUser": "22",
"createdByRole": "Member",
"parentConfiguration": "myscc"
}
},
"data": {
"type": "github",
"repoURL": "[email protected]:hpe-container-platform-community/example_active_directory_server.git",
"authType": "token",
"branch": "main",
"workingDirectory": "",
"proxyProtocol": "",
"proxyHostname": "",
"proxyPort": "",
"username": "mygitusername",
"email": "[email protected]",
"token": "mygittoken",
"description": ""
}
}
},
"op": "source_control_action"
}
ansible output ...
TASK [drain gpu nodes] *********************************************************
failed: [localhost] (item={'changed': True, 'stdout': 'ip-10-1-0-176.eu-west-1.compute.internal', 'stderr': '', 'rc': 0, 'cmd': 'kubectl get nodes -o json | jq -r \'.items[] | select( .status.addresses[].address == "10.1.0.176") | .metadata.name\'', 'start': '2022-02-09 10:03:24.206134', 'end': '2022-02-09 10:03:25.481530', 'delta': '0:00:01.275396', 'msg': '', 'invocation': {'module_args': {'_raw_params': 'kubectl get nodes -o json | jq -r \'.items[] | select( .status.addresses[].address == "10.1.0.176") | .metadata.name\'', '_uses_shell': True, 'warn': False, 'stdin_add_newline': True, 'strip_empty_ends': True, 'argv': None, 'chdir': None, 'executable': None, 'creates': None, 'removes': None, 'stdin': None}}, 'stdout_lines': ['ip-10-1-0-176.eu-west-1.compute.internal'], 'stderr_lines': [], 'failed': False, 'item': '10.1.0.176', 'ansible_loop_var': 'item'}) => {"ansible_loop_var": "item", "changed": true, "cmd": "kubectl drain --ignore-daemonsets \"ip-10-1-0-176.eu-west-1.compute.internal\"", "delta": "0:00:01.718773", "end": "2022-02-09 10:03:27.463665", "item": {"ansible_loop_var": "item", "changed": true, "cmd": "kubectl get nodes -o json | jq -r '.items[] | select( .status.addresses[].address == \"10.1.0.176\") | .metadata.name'", "delta": "0:00:01.275396", "end": "2022-02-09 10:03:25.481530", "failed": false, "invocation": {"module_args": {"_raw_params": "kubectl get nodes -o json | jq -r '.items[] | select( .status.addresses[].address == \"10.1.0.176\") | .metadata.name'", "_uses_shell": true, "argv": null, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "stdin_add_newline": true, "strip_empty_ends": true, "warn": false}}, "item": "10.1.0.176", "msg": "", "rc": 0, "start": "2022-02-09 10:03:24.206134", "stderr": "", "stderr_lines": [], "stdout": "ip-10-1-0-176.eu-west-1.compute.internal", "stdout_lines": ["ip-10-1-0-176.eu-west-1.compute.internal"]}, "msg": "non-zero return code", "rc": 1, "start": "2022-02-09 10:03:25.744892", "stderr": "error: unable to drain node \"ip-10-1-0-176.eu-west-1.compute.internal\", aborting command...\n\nThere are pending nodes to be drained:\n ip-10-1-0-176.eu-west-1.compute.internal\nerror: cannot delete Pods with local storage (use --delete-emptydir-data to override): istio-system/grafana-784c89f4cf-rk6g4", "stderr_lines": ["error: unable to drain node \"ip-10-1-0-176.eu-west-1.compute.internal\", aborting command...", "", "There are pending nodes to be drained:", " ip-10-1-0-176.eu-west-1.compute.internal", "error: cannot delete Pods with local storage (use --delete-emptydir-data to override): istio-system/grafana-784c89f4cf-rk6g4"], "stdout": "node/ip-10-1-0-176.eu-west-1.compute.internal already cordoned", "stdout_lines": ["node/ip-10-1-0-176.eu-west-1.compute.internal already cordoned"]}
I'm wondering if it is possible to handle this issue?
Should I open a new branch for ovirt and vmware, or just check into the existing repo?
the UI we can still adapt later for it, what do you think?
Users should not have an SSL error when accessing MCS
Selecting High Availability option creates 2 gateways & 3 controllers, but installation only configures 2 gateways. Need to update the process within Ansible to configure & add these 2 additional controllers into the cluster (as EPIC workers) and then enable HA.
No Cluster IP or floating VIP required in this setup.
I'm wondering whether a simple fix for this requirement could be to mount two volumes and use two rsync processes.
One process would copy config files allowing users to provide only the files they need to change, e.g.
If on the host (rsync source) you had ./app/server/aws/config.json
rsync's default behavior would be to copy only the files on the source without deleting the other files in the destination.
rsync could also be used to create a backup of the entire docker /app folder to the local machine?
Thoughts?
TASK [setup gitea] *************************************************************
fatal: [localhost]: FAILED! => {"changed": true, "cmd": "./setup_gitea.sh 'kubectl --kubeconfig /root/.kube/config -n k8s-tenant-1'", "delta": "0:00:00.183659", "end": "2022-01-23 18:24:44.196309", "failed_when_result": true, "msg": "non-zero return code", "rc": 1, "start": "2022-01-23 18:24:44.012650", "stderr": "error: stat /root/.kube/config: no such file or directory", "stderr_lines": ["error: stat /root/.kube/config: no such file or directory"], "stdout": "", "stdout_lines": []}
I think the /root/.kube directory needs to be created, e.g.
ansible/refresh.yml
...
...
- name: update kubeadmin config
shell: |-
[[ -d ~/.kube ]] || mkdir ~/.kube
while : ; do
hpecp k8scluster admin_kube_config {{ item }} > ~/.kube/config
[ $(wc -l ~/.kube/config | cut -d' ' -f1) -lt 5 ] || break
sleep 10
done
with_items: "{{ cluster_ids }}"
Could we provide an option for users to set the admin password? E.g.
{
"aws_access_key": "",
"aws_secret_key": "",
"is_mlops": false,
"is_df": fale,
"user": "",
"project_id": ""
"admin_password": "ChangeMe!!"
}
With terraform it is possible to lookup instance types. This will allow deployment to continue (possibly at a higher cost) if the preferred instance type is not supported in the selected region and availability zone. E.g.
data "aws_ec2_instance_type_offering" "example" {
filter {
name = "instance-type"
values = ["t2.micro", "t3.micro"]
}
preferred_instance_types = ["t3.micro", "t2.micro"]
}
Source:
Also, aws_ec2_instance_types
. E.g.
data "aws_ec2_instance_types" "test" {
filter {
name = "auto-recovery-supported"
values = ["true"]
}
filter {
name = "network-info.encryption-in-transit-supported"
values = ["true"]
}
filter {
name = "instance-storage-supported"
values = ["true"]
}
filter {
name = "instance-type"
values = ["g5.2xlarge", "g5.4xlarge"]
}
}
https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/ec2_instance_types
allow selection of region via config.json
It would be good to provide usage if user provides wrong parameters, e.g. 00-run_all.sh
#!/usr/bin/env bash
set -euo pipefail
if ! echo "aws azure kvm vmware" | grep -w -q ${1}; then
echo Usage: "${0} aws|azure|kvm|vmware"
exit 1
fi
...
Add support for single node KVM
see EZESC-1160
Some tasks can run in parallel, ie, "add workers" and "configure DF node", "create k8scluster" and "create DF".
Currently running Samba in docker container within ad_server is started as async task, and is never checked for completion/errors. Would be nice to have a method to submit tasks in the background, and check/wait just before the job that depends it (create tenant should check create cluster, install mapr should check AD integration etc).
Ansible has async method, and I guess it is the best option to implement this (need to ensure we submit and check these jobs on the same hosts etc).
Feedback from eng ...
Chris Snow, You did not install the environment correctly if you want to have the root CA guarantee TLS transactions....
See the install options:
--ssl-cert : Absolute path to the SSL certificate.
--ssl-priv-key : Absolute path to the SSL certificate's private key.
--ssl-ca-data : Absolute path to the SSL CA certificate data file path (optional).
We should be using ...
--ssl-cert=/etc/pki/tls/certs/cert.pem
--ssl-priv-key=/etc/pki/tls/private/key.pem
--ssl-ca-data=/etc/pki/tls/certs/minica.pem
# If you install ECP in this fashion, you will not see the insecure TLS connection.
See EZESC-1160 (internal Jira)
Adding VMWare as provider
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.