Git Product home page Git Product logo

opi-poc's People

Contributors

amarv-marvell avatar artek-koltun avatar bbhamidipati avatar chrispsommers avatar dependabot[bot] avatar ganesh-juniper avatar glimchb avatar llabordehpe avatar mestery avatar mgheorghe avatar mkalderon avatar pdp2shirts avatar pwalessi-dell avatar renovate[bot] avatar seroyer avatar stefanchulski avatar thejacobwalters avatar timworsleyf5 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

opi-poc's Issues

CI is failing

See for example here. It seems as if the psabpf library changed, which our ipdk-plugin uses:

Network poc-net1  Creating
[97](https://github.com/opiproject/opi-poc/runs/7118823824?check_suite_focus=true#step:7:98)
E0629 19:13:00.083794    3436 plugin.go:458] ERROR: Unable to add host port port: [exit status 255]
[98](https://github.com/opiproject/opi-poc/runs/7118823824?check_suite_focus=true#step:7:99)
exit status 255: add_member: unknown keyword
[99](https://github.com/opiproject/opi-poc/runs/7118823824?check_suite_focus=true#step:7:100)
Network poc-net1  Error
[100](https://github.com/opiproject/opi-poc/runs/7118823824?check_suite_focus=true#step:7:101)
failed to create network poc-net1: Error response from daemon: remote: Error: exit status 255
[101](https://github.com/opiproject/opi-poc/runs/7118823824?check_suite_focus=true#step:7:102)
Usage: psabpf-ctl action-selector add-member pipe ID ACTION_SELECTOR_NAME action ACTION [data ACTION_PARAMS]
[102](https://github.com/opiproject/opi-poc/runs/7118823824?check_suite_focus=true#step:7:103)
       psabpf-ctl action-selector delete-member pipe ID ACTION_SELECTOR_NAME MEMBER_REF
[103](https://github.com/opiproject/opi-poc/runs/7118823824?check_suite_focus=true#step:7:104)
       psabpf-ctl action-selector update-member pipe ID ACTION_SELECTOR_NAME MEMBER_REF action ACTION [data ACTION_PARAMS]
[104](https://github.com/opiproject/opi-poc/runs/7118823824?check_suite_focus=true#step:7:105)
       psabpf-ctl action-selector create-group pipe ID ACTION_SELECTOR_NAME
[105](https://github.com/opiproject/opi-poc/runs/7118823824?check_suite_focus=true#step:7:106)
       psabpf-ctl action-selector delete-group pipe ID ACTION_SELECTOR_NAME GROUP_REF
[106](https://github.com/opiproject/opi-poc/runs/7118823824?check_suite_focus=true#step:7:107)
       psabpf-ctl action-selector add-to-group pipe ID ACTION_SELECTOR_NAME MEMBER_REF to GROUP_REF
[107](https://github.com/opiproject/opi-poc/runs/7118823824?check_suite_focus=true#step:7:108)
       psabpf-ctl action-selector delete-from-group pipe ID ACTION_SELECTOR_NAME MEMBER_REF from GROUP_REF
[108](https://github.com/opiproject/opi-poc/runs/7118823824?check_suite_focus=true#step:7:109)
       psabpf-ctl action-selector empty-group-action pipe ID ACTION_SELECTOR_NAME action ACTION [data ACTION_PARAMS]
[109](https://github.com/opiproject/opi-poc/runs/7118823824?check_suite_focus=true#step:7:110)
       psabpf-ctl action-selector get pipe ID ACTION_SELECTOR_NAME [member MEMBER_REF | group GROUP_REF | empty-group-action]
[110](https://github.com/opiproject/opi-poc/runs/7118823824?check_suite_focus=true#step:7:111)
[111](https://github.com/opiproject/opi-poc/runs/7118823824?check_suite_focus=true#step:7:112)
       ACTION := { id ACTION_ID | name ACTION_NAME }
[112](https://github.com/opiproject/opi-poc/runs/7118823824?check_suite_focus=true#step:7:113)
       ACTION_PARAMS := { DATA }
[113](https://github.com/opiproject/opi-poc/runs/7118823824?check_suite_focus=true#step:7:114)

That is called from here.

Replace "sleep 5" with check for service

In many files there are instances of
echo wait 5s... && sleep 5s

For example in:
integration/docker-compose.xpu.yml
integration/docker-compose.spdk.yml
integration/scripts/integration.sh

That is a fragile approach at best. Instead you can either make checks for the various services or replace the curl calls with functions that check for sets of HTTP return values and retries.

`./scripts/deploy.sh` fails on remote mount bind

run

bash -x ./scripts/deploy.sh -m xpu -x 10.246.67.206

or

DOCKER_HOST=ssh://[email protected] COMPOSE_FILE=docker-compose.xpu.yml:docker-compose.networks.yml docker-compose up

see

ERROR: for integration_xpu-telegraf_1  Cannot start service xpu-telegraf: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: rootfs_linux.go:76: mounting "/home/boris/opi-poc/integration/xpu-cpu/telegraf-redfish.conf" to rootfs at "/etc/telegraf/telegraf.conf" caused: mount through procfd: not a directory: unknown: Are you trying to mount a directory onto a file (or vice-versa)? Check if the specified host path exists and is the expected type

bring `Marvell SDK` container

since we are creating SHIM layer on top of Vendor SKDs, compile and run hello word on top of Marvell SDK on xPU in our POC

Build and push container images to GHCR on PRs and merges to master

We should be using GitHub Actions to build and push container images for our Docker images on both PRs and pushes to master. IPDK has done this, we could use that as an example of how to do this. We should also make sure to do multi-arch builds (e.g. x86 and arm64).

sometimes DHCP request fails

wrong

docker-compose -f docker-compose.pxe.yml exec -T pxe nmap --script broadcast-dhcp-discover
Starting Nmap 7.92 ( https://nmap.org/ ) at 2022-07-07 20:23 UTC
WARNING: No targets were specified, so 0 hosts scanned.
Nmap done: 0 IP addresses (0 hosts up) scanned in 10.34 seconds

correct

+ docker-compose -f docker-compose.pxe.yml exec -T pxe nmap --script broadcast-dhcp-discover
Starting Nmap 7.80 ( https://nmap.org ) at 2022-07-07 21:12 UTC
Pre-scan script results:
| broadcast-dhcp-discover:
|   Response 1 of 1:
|     IP Offered: 10.127.127.11
|     DHCP Message Type: DHCPOFFER
|     Server Identifier: 10.127.127.3
|     IP Address Lease Time: 5m00s
|_    Subnet Mask: 255.255.255.0
Nmap done: 0 IP addresses (0 hosts up) scanned in 0.20 seconds
WARNING: No targets were specified, so 0 hosts scanned.

Make integration tests faster

We currently build containers for each push, but perhaps we should separate out those steps and push containers to GHCR (specifically the SPDK container) to make things faster. Worth discussing on this issue how we want to do this.

now they take 20m

image

`spdk/build/examples/perf` fails to connect

this works

 $  docker-compose -f docker-compose.xpu.yml exec xpu-spdk /root/spdk/build/examples/identify -r 'traddr:10.127.127.4 trtype:TCP adrfam:IPv4 trsvcid:4420'
TELEMETRY: No legacy callbacks, legacy socket not created
[2022-07-08 00:17:29.364643] nvme_fabric.c: 180:nvme_fabric_prop_get_cmd_sync: *ERROR*: Property Get failed
[2022-07-08 00:17:29.383634] nvme_fabric.c: 180:nvme_fabric_prop_get_cmd_sync: *ERROR*: Property Get failed
=====================================================
NVMe over Fabrics controller at 10.127.127.4:4420: nqn.2014-08.org.nvmexpress.discovery
=====================================================
Controller Capabilities/Features
================================
Vendor ID:                             0000
Subsystem Vendor ID:                   0000
Serial Number:                         ....................
Model Number:                          ........................................
Firmware Version:                      22.05
Recommended Arb Burst:                 0
IEEE OUI Identifier:                   00 00 00
Multi-path I/O
  May have multiple subsystem ports:   No
  May have multiple controllers:       No
  Associated with SR-IOV VF:           No
Max Data Transfer Size:                131072
Max Number of Namespaces:              0
NVMe Specification Version (VS):       1.3
NVMe Specification Version (Identify): 1.3
Maximum Queue Entries:                 128
Contiguous Queues Required:            Yes
Arbitration Mechanisms Supported
  Weighted Round Robin:                Not Supported
  Vendor Specific:                     Not Supported
Reset Timeout:                         15000 ms
Doorbell Stride:                       4 bytes
NVM Subsystem Reset:                   Not Supported
Command Sets Supported
  NVM Command Set:                     Supported
Boot Partition:                        Not Supported
Memory Page Size Minimum:              4096 bytes
Memory Page Size Maximum:              4096 bytes
Persistent Memory Region:              Not Supported
Optional Asynchronous Events Supported
  Namespace Attribute Notices:         Not Supported
  Firmware Activation Notices:         Not Supported
128-bit Host Identifier:               Not Supported

Controller Memory Buffer Support
================================
Supported:                             No

Persistent Memory Region Support
================================
Supported:                             No

Admin Command Set Attributes
============================
Security Send/Receive:                 Not Supported
Format NVM:                            Not Supported
Firmware Activate/Download:            Not Supported
Namespace Management:                  Not Supported
Device Self-Test:                      Not Supported
Directives:                            Not Supported
NVMe-MI:                               Not Supported
Virtualization Management:             Not Supported
Doorbell Buffer Config:                Not Supported
Abort Command Limit:                   1
Async Event Request Limit:             4
Number of Firmware Slots:              N/A
Firmware Slot 1 Read-Only:             N/A
Firmware Update Granularity:           No Information Provided
Per-Namespace SMART Log:               No
Asymmetric Namespace Access Log Page:  Not Supported
Command Effects Log Page:              Not Supported
Get Log Page Extended Data:            Supported
Telemetry Log Pages:                   Not Supported
Error Log Page Entries Supported:      128
Keep Alive:                            Not Supported

NVM Command Set Attributes
==========================
Submission Queue Entry Size
  Max:                       1
  Min:                       1
Completion Queue Entry Size
  Max:                       1
  Min:                       1
Number of Namespaces:        0
Compare Command:             Not Supported
Write Uncorrectable Command: Not Supported
Dataset Management Command:  Not Supported
Write Zeroes Command:        Not Supported
Set Features Save Field:     Not Supported
Reservations:                Not Supported
Timestamp:                   Not Supported
Copy:                        Not Supported
Volatile Write Cache:        Not Present
Atomic Write Unit (Normal):  1
Atomic Write Unit (PFail):   1
Atomic Compare & Write Unit: 1
Fused Compare & Write:       Supported
Scatter-Gather List
  SGL Command Set:           Supported
  SGL Keyed:                 Supported
  SGL Bit Bucket Descriptor: Not Supported
  SGL Metadata Pointer:      Not Supported
  Oversized SGL:             Not Supported
  SGL Metadata Address:      Not Supported
  SGL Offset:                Supported
  Transport SGL Data Block:  Not Supported
Replay Protected Memory Block:  Not Supported

Firmware Slot Information
=========================
Active slot:                 0


Error Log
=========

Active Namespaces
=================
Discovery Log Page
==================
Generation Counter:                    1
Number of Records:                     1
Record Format:                         0

Discovery Log Entry 0
----------------------
Transport Type:                        3 (TCP)
Address Family:                        1 (IPv4)
Subsystem Type:                        2 (NVM Subsystem)
Transport Requirements:
  Secure Channel:                      Not Required
Port ID:                               0 (0x0000)
Controller ID:                         65535 (0xffff)
Admin Max SQ Size:                     128
Transport Service Identifier:          4420
NVM Subsystem Qualified Name:          nqn.2016-06.io.spdk:cnode1
Transport Address:                     0.0.0.0                                                                                                                                               

this fails

$  docker-compose -f docker-compose.xpu.yml exec xpu-spdk /root/spdk/build/examples/perf -r 'traddr:10.127.127.4 trtype:TCP adrfam:IPv4 trsvcid:4420' -c 0x1 -q 1 -o 4096 -w randread -t 10
TELEMETRY: No legacy callbacks, legacy socket not created
Initializing NVMe Controllers
[2022-07-08 00:11:12.091713] posix.c: 526:posix_sock_create: *ERROR*: connect() failed, errno = 111
[2022-07-08 00:11:12.091777] nvme_tcp.c:1941:nvme_tcp_qpair_connect_sock: *ERROR*: sock connection error of tqpair=0x178a550 with addr=0.0.0.0, port=4420
[2022-07-08 00:11:12.091796] nvme_tcp.c:2148:nvme_tcp_ctrlr_construct: *ERROR*: failed to create admin qpair
[2022-07-08 00:11:12.091811] nvme.c: 705:nvme_ctrlr_probe: *ERROR*: Failed to construct NVMe controller for SSD: 0.0.0.0
No valid NVMe controllers or AIO or URING devices found

spdk target exits when running on system without AVX 512

$ docker run -it --rm --privileged -v /dev/hugepages:/dev/hugepages -v /dev/shm:/dev/shm -v /proc:/proc ghcr.io/opiproject/opi-spdk:main bash
[root@3c9db4b50106 ~]# sync; echo 1 > /proc/sys/vm/drop_caches  && \
            echo 1024 > /proc/sys/vm/nr_hugepages && \
            grep "" /sys/kernel/mm/hugepages/hugepages-*/nr_hugepages
/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages:0
/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages:1024
[root@3c9db4b50106 ~]# /usr/local/bin/spdk_tgt -m 0x1 -s 512 --no-pci
Illegal instruction (core dumped)

The VMs which we are using to build SPDK and push to GHCR have AVX512. So if you try to run the container on a system that does not have AVX 512, it will fail.

I think `ghcr.io/opiproject/opi-spdk:main` is not used still

in logs I saw SPDK is re-built... need to deep dive on this
I made a package public here https://github.com/orgs/opiproject/packages/container/opi-spdk/settings
and can download it manually without authentication

$ docker pull ghcr.io/opiproject/opi-spdk:main
main: Pulling from opiproject/opi-spdk
e1deda52ffad: Already exists
4f4fb700ef54: Pull complete
de940d00a4ca: Pull complete
5f7f428d4e44: Pull complete
2bcaf9bbbfbc: Pull complete
Digest: sha256:3223207ec66095caac36a31527f65f0a58e448ef9c91465a4c3b11902e53af26
Status: Downloaded newer image for ghcr.io/opiproject/opi-spdk:main
ghcr.io/opiproject/opi-spdk:main

but integration checker looks like rebuilding SPDK... why ?

create `deploy.sh` script

takes 1 argument:

  • emulated (like today, everything is running on our laptop/VM)
  • Real xPU IP address (script will SSH to xPU and run relevant docker-compose there) and run other relevant docker-compose on our laptop/VM

sztp require license agreement and mode selection on startup

see status

integration_sztp_1                sztpd sqlite:///:memory:         Exit 1                                                                                                                                                                                                                                                                                               

and logs

sztp_1               |
sztp_1               | First time initialization.  Please accept the license terms.
sztp_1               |
sztp_1               | By entering "Yes" below, you agree to be bound to the terms and conditions contained on this screen with Watsen Networks.
sztp_1               |
sztp_1               |     sys.exit(main())
sztp_1               |   File "/usr/local/lib/python3.9/site-packages/sztpd/__main__.py", line 5, in main
sztp_1               |     def main(argv=None):A=argparse.ArgumentParser(prog='sztpd',formatter_class=argparse.RawDescriptionHelpFormatter,description='SZTPD implements the "bootstrap server" defined in RFC 8572.',epilog='\nExit status code: 0 on success, non-0 on error.  Error output goes to stderr.\n\nThe "cacert" argument is a filepath to a PEM file that contains one or more X.509\nCA certificates used to authenticate the RDBMS\'s TLS certificate.\n\nThe "key" and "cert" arguments are each a filepath to a PEM file that contains\nthe key and certificate that SZTPD should use to authenticate itself to the\nRDBMS.  These parameters must be specified together, and must be specified\nin conjunction with the "cacert" parameter.\n\nThe "database-url" argument has the form "<dialect>:<dialect-specific-path>".\nThree dialects are supported: "sqlite", "postgresql", and "mysql+pymysql".\nThe dialect-specific-path for each of these is described below.\n\nFor the "sqlite" dialect, <dialect-specific-path> follows the format\n"///<sqlite-path>", where <sqlite-path> can be one of:\n\n  :memory:    - an in-memory database (only useful for testing)\n  <filepath>  - an OS-specific filepath to a persisted database file\n\n  Examples:\n\n    $ sztpd sqlite:///:memory:                      (memory)\n    $ sztpd sqlite:///relative/path/to/sztpd.db     (unix)\n    $ sztpd sqlite:////absolute/path/to/sztpd.db    (unix)\n    $ sztpd sqlite:///C:\\path\\to\\sztpd.db           (windows)\n\nFor both the "postgresql" and "mysql+pymysql" dialects, <dialect-specific-path>\nfollows the format "//<user>[:<passwd>]@<host>:<port>/<database-name>".\n\n  Examples:\n\n    The following two examples assume the database is called "sztpd" and\n    that the database server listens on the loopback address with no TLS.\n\n      $ sztpd ***localhost:3306/sztpd\n      $ sztpd ***localhost:5432/sztpd\n\n\nPlease see the documentation for more information.\n');A.add_argument('-v','--version',help='show version number and exit.',action='version',version=__version__.__version__);A.add_argument('-C','--cacert',help='path to certificates used to authenticate the database (see below for details).');A.add_argument('-c','--cert',help='path to cert used to authenticate SZTPD to the database (see below for details).');A.add_argument('-k','--key',help='path to key used to authenticate SZTPD to the database (see below for details).');A.add_argument('database_url',help='see below for details.',metavar='database-url');B=A.parse_args();return sztpd.run(B.database_url,B.cacert,B.cert,B.key)
sztp_1               |   File "/usr/local/lib/python3.9/site-packages/sztpd/sztpd.py", line 38, in run
sztp_1               |     print('');c=pkg_resources.resource_filename('sztpd','LICENSE.txt');U=open(c,'r');print(U.read());U.close();print('First time initialization.  Please accept the license terms.');print('');print('By entering "Yes" below, you agree to be bound to the terms and conditions contained on this screen with Watsen Networks.');print('');d=input('Please enter "Yes" or "No": ')
sztp_1               | EOFError: EOF when reading a line
sztp_1               | Please enter "Yes" or "No": xpu-cpu-ssh_1        | [s6-init] making user provided files available at /var/run/s6/etc...exited 0.

add `postgresql` to `sztp`

like

volumes:
  sztp_data: {}

services:
  db:
    image: postgres:9.5
    volumes:
      - sztp_data:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: sztpd-db
      POSTGRES_USER: sztpd-admin
      POSTGRES_PASSWORD: secret
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "sztpd-admin"]
      interval: 30s
      timeout: 30s
      retries: 3

and

   sztpd postgresql://sztpd-admin:[email protected]:5432/sztpd-db 

move huge pages allocation

SPDK requires hugepages, example:

      sync; echo 1 > /proc/sys/vm/drop_caches
      echo 1024 > /proc/sys/vm/nr_hugepages
      grep "" /sys/kernel/mm/hugepages/hugepages-*/nr_hugepages

but sometimes there is not enough memory, see:

spdk-target_1         | + echo 1024
spdk-target_1         | + grep '' /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
spdk-target_1         | /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages:0
spdk-target_1         | /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages:185
spdk-target_1         | + /usr/local/bin/spdk_tgt -m 0x1 -s 512 --no-pci

consider moving huge pages setup and verification as additional step on the main workflow
https://github.com/opiproject/opi-poc/blob/main/.github/workflows/poc-integration.yml#L39

Move CI code into script

Currently, the CI code runs things such as this and this. These require changes to the files in .github/workflows, which makes it harder to track those changes when you're modifying the docker compose files in the integration directory, for example.

This issue tracks the movement of these commands into a script which will live in integration/scripts.

try https://watsen.net/docs/sztpd/0.0.11/admin-guide/#simulator

code

diff --git a/integration/pxe/Dockerfile.sztp b/integration/pxe/Dockerfile.sztp
index 9327663..5843a87 100644
--- a/integration/pxe/Dockerfile.sztp
+++ b/integration/pxe/Dockerfile.sztp
@@ -1,2 +1,10 @@
 FROM python:3.9
 RUN pip install sztpd==0.0.11
+
+# for simulator:
+RUN apt update && apt install -y libyang-tools libxml2-utils
+# FIXME: why ADD is not working here?
+RUN curl -kL https://watsen.net/support/sztpd-simulator-0.0.11.tgz | tar -zxvf - -C /tmp/
+WORKDIR /tmp/sztpd-simulator
+RUN cd pki; make pki; cd -
+RUN ./run-sztpd-test.sh
(END)

example

Step 7/7 : RUN ./run-sztpd-test.sh
 ---> Running in 914068a29faa
Temporary directory for output files: /tmp/tmp.S6oypN1bDF

Creating instances...
   ^-- Creating SZTPD instance 1...okay.  (SZTPD instance 1 running with PID 17)
   ^-- Creating SZTPD instance 2...okay.  (SZTPD instance 2 running with PID 18)
   ^-- Creating SZTPD instance 3...okay.  (SZTPD instance 3 running with PID 19)

Giving servers time to startup...

Configuring instances...
   ^-- Configuring SZTPD instance 1...
   ^-- Configuring SZTPD instance 2...
   ^-- Configuring SZTPD instance 3...

Giving servers time to open their ports...

Testing instances...
   ^-- Testing SZTPD instance 1
   ^-- Testing SZTPD instance 2
   ^-- Testing SZTPD instance 3

Running simulator...
  ^-- Getting bootstrapping data...
  ^-- Processing bootstrapping data...
  ^-- Processing redirect information...
  ^-- Getting bootstrapping data from next server...
  ^-- Processing bootstrapping data...
  ^-- Processing redirect information...
  ^-- Getting bootstrapping data from next server...
  ^-- Processing bootstrapping data...
  ^-- Processing onboarding information...
  ^-- Bootstrap complete.
  ^-- rfc8572-agent exited with status code 0

Killing `sztpd` instances...
   ^-- Sending SIGTERM to instance 1 (PID 17)
   ^-- Sending SIGTERM to instance 2 (PID 18)
   ^-- Sending SIGTERM to instance 3 (PID 19)

Giving servers time to shutdown...

Verifying instances are killed...
   ^-- Verifying SZTPD instance 1 killed...okay.
   ^-- Verifying SZTPD instance 2 killed...okay.
   ^-- Verifying SZTPD instance 3 killed...okay.

All done!
Removing intermediate container 914068a29faa
 ---> 368cfe830953
Successfully built 368cfe830953

Deduplicate run-shellcheck.sh

There are currently 3 identical versions of this file in the repo:
integration/scripts/run-shellcheck.sh
networking/scripts/run-shellcheck.sh
storage/scripts/run-shellcheck.sh

`./scripts/deploy.sh` fails on xPU

 $ ./scripts/deploy.sh -m xpu -x 10.246.67.206
Selected mode xpu
Deploying xPU environment
BMC IP address:
Host IP address: 127.0.0.1
xPU IP address: 10.246.67.206
ERROR: Network opi-external declared as external, but could not be found. Please create the network manually using `docker network create opi-external` and try again.

telegraf inputs.http plugin for SPDK fails

xpu-telegraf_1       | 2022-07-07T20:08:50Z E! [inputs.http] Error in plugin: [url=http://xpu-cpu-ssh:9009]: Post "http://xpu-cpu-ssh:9009": dial tcp 10.127.127.7:9009: connect: connection refused

bring `Nvidia DOCA SDK` container

since we are creating SHIM layer on top of Vendor SKDs, compile and run hello word on top of Nvidia DOCA SDK on xPU in our POC

See https://docs.nvidia.com/doca/sdk/container-deployment/index.html

sudo docker pull nvcr.io/nvidia/doca/doca:1.3.0-devel

QEMU-based

https://docs.nvidia.com/doca/sdk/developer-guide/index.html#developing-without-bluefield-dpu

If the development process needs to be done without access to a BlueField DPU, the recommendation is to use a QEMU-based deployment of a container on top of a regular x86 server. The development container for the host will be the same doca:devel image we mentioned previously.

but

While the compilation can be performed on top of the container, testing the compiled software must be done on top of a BlueField DPU. This is because the QEMU environment emulates an aarch64 architecture, but it does not emulate the hardware devices present on the BlueField DPU. Therefore, the tested program will not be able to access the devices needed for its successful execution, thus mandating that the testing is done on top of a physical DPU.

can run it like this

 $  docker run -v /usr/bin/qemu-aarch64-static:/usr/bin/qemu-aarch64-static --rm -it nvcr.io/nvidia/doca/doca:1.3.0-devel ls -l /opt/mellanox/doca
total 4
drwxr-xr-x 1 root root  275 May  8 05:21 applications
drwxr-xr-x 2 root root 4096 May  8 05:21 include
drwxr-xr-x 1 root root   23 May  8 05:17 infrastructure
drwxr-xr-x 1 root root   31 May  8 05:15 lib
drwxr-xr-x 9 root root  137 May  8 05:21 samples
drwxr-xr-x 2 root root   88 May  8 05:15 tools

add healthcheck to spdk dockerfile

in https://github.com/opiproject/opi-poc/blob/main/storage/Dockerfile

HEALTHCHECK CMD curl --fail --insecure --user spdkuser:spdkpass -X POST -H 'Content-Type: application/json' -d '{\"id\": 1, \"method\": \"bdev_get_bdevs\"}' http://localhost:9009 || exit 1

in addition to what we have in compose

    healthcheck:
      test: ["CMD-SHELL", "curl --fail --insecure --user spdkuser:spdkpass -X POST -H 'Content-Type: application/json' -d '{\"id\": 1, \"method\": \"bdev_get_bdevs\"}' http://localhost:9009 || exit 1"]
      interval: 6s
      retries: 5
      start_period: 20s
      timeout: 10s

check via

 $ docker inspect --format='{{json .State.Health.Status}}' opi-spdk | jq
"healthy"

typo `hig-speed-external`

docker-compose.networks.yml:18:    name: hig-speed-external

vs

docker-compose.networks.yml:17:  high-speed-external:
docker-compose.otel.yml:18:      high-speed-external:
docker-compose.otel.yml:33:      high-speed-external:
docker-compose.otel.yml:37:  high-speed-external:
docker-compose.otel.yml:38:    name: high-speed-external
docker-compose.spdk.yml:17:      high-speed-external:
docker-compose.spdk.yml:42:  high-speed-external:
docker-compose.xpu.yml:39:      high-speed-external:
docker-compose.xpu.yml:223:  high-speed-external:
docker-compose.xpu.yml:224:    name: high-speed-external
docker-compose.yml:18:      high-speed-external:
docker-compose.yml:85:  high-speed-external:
docker-compose.yml:86:    name: high-speed-external

use `linuxserver/openssh-server` instead of installing `openssh-server` in fedora

see https://github.com/opiproject/opi-poc/blob/main/integration/xpu-cpu/Dockerfile

for example from https://hub.docker.com/r/linuxserver/openssh-server

---
version: "2.1"
services:
  openssh-server:
    image: lscr.io/linuxserver/openssh-server:latest
    container_name: openssh-server
    hostname: openssh-server #optional
    environment:
      - PUID=1000
      - PGID=1000
      - TZ=Europe/London
      - PUBLIC_KEY=yourpublickey #optional
      - PUBLIC_KEY_FILE=/path/to/file #optional
      - PUBLIC_KEY_DIR=/path/to/directory/containing/_only_/pubkeys #optional
      - PUBLIC_KEY_URL=https://github.com/username.keys #optional
      - SUDO_ACCESS=false #optional
      - PASSWORD_ACCESS=false #optional
      - USER_PASSWORD=password #optional
      - USER_PASSWORD_FILE=/path/to/file #optional
      - USER_NAME=linuxserver.io #optional
    volumes:
      - /path/to/appdata/config:/config
    ports:
      - 2222:2222
    restart: unless-stopped

bring `Intel IPU SDK` container

since we are creating SHIM layer on top of Vendor SKDs, compile and run hello word on top of Intel IPU SDK on xPU in our POC

what to use when .md reaches its limits?

Trying to create a lab_requirements.md file, so we can note down lab requirements as they arise and not lose track. If written in *.md, lab_requirements.md looks like below, which is hard to maintain. If not tables, I would imagine there will be other cases where *.md falls short. What is the plan when that threshold of "markdown just won't do" is reached, to share and collaborate using more complex documents? For example, should use use a purpose-built requirements web app to track requirements? Use google doc with its spreadsheet feature? Looking for suggestions here.

===>| ID | REQUIREMENTS DESCRIPTION | REQUESTED BY | DATE | SLACK CHANNEL | PRIORITY (P1="must have") | ACCEPTANCE CRITERIA | NOTES |
|----|----------------------------------------------------|-----------------|------------|---------------|---------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|
| 1 | support dynamic provisioning of a broadcast domain | Glimcher, Boris | 2022-06-30 | #testing | P1 | CI able to dynamically 1) spin up layer 3 broadcast domain (subnet) 2) deploy a DHCP server 3) deploy a set of DHCP clients 4) destroy domain after pipeline done | From provisioning perspective - how can we run ZTP agent in containers and test it? with DHCP servers and such... also can be done in containers |<====

Split out networking in docker compose setup

We currently have a single network for all components. We should expand this to create multiple docker networks and isolate things where it makes sense. Ideally, we would want at least two networks:

  • xPU network for xPU CPU and xPU BMC
  • External network for external components

`otel-gw-collector` is `unhealthy`

$  docker-compose -f docker-compose.yml -f docker-compose.xpu.yml -f docker-compose.otel.yml -f docker-compose.spdk.yml -f docker-compose.pxe.yml ps -a
             Name                            Command                   State                                                          Ports
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
integration_host-bmc-redfish_1    python /usr/src/app/redfis ...   Up (healthy)
integration_host-bmc-ssh_1        /init                            Up (healthy)     0.0.0.0:2208->2222/tcp, 0.0.0.0:8001->8000/tcp
integration_host-bmc_1            /bin/sh -c sleep infinity        Up (healthy)
integration_host-cpu-ssh_1        /init                            Up (healthy)     0.0.0.0:2210->2222/tcp
integration_host-cpu_1            /bin/sh -c sleep infinity        Up (healthy)
integration_otel-gw-collector_1   /otelcol --config=/etc/ote ...   Up (unhealthy)   0.0.0.0:13133->13133/tcp, 0.0.0.0:1888->1888/tcp, 0.0.0.0:4317->4317/tcp, 55678/tcp,
                                                                                    0.0.0.0:55679->55679/tcp, 0.0.0.0:8888->8888/tcp, 0.0.0.0:8889->8889/tcp
integration_pxe_1                 sh -e -u -x -c envsubst <  ...   Up (healthy)     0.0.0.0:8082->8082/tcp
integration_spdk-target_1         sh -c echo 512 > /proc/sys ...   Up (unhealthy)   0.0.0.0:9004->9009/tcp
integration_sztp_1                sztpd --help                     Exit 1
integration_xpu-bmc-redfish_1     python /usr/src/app/redfis ...   Up (healthy)
integration_xpu-bmc-ssh_1         /init                            Up (healthy)     0.0.0.0:2209->2222/tcp, 0.0.0.0:8002->8000/tcp
integration_xpu-bmc_1             /bin/sh -c sleep infinity        Up (healthy)
integration_xpu-cpu-ssh_1         /init                            Up (healthy)     0.0.0.0:2207->2222/tcp, 0.0.0.0:9009->9009/tcp
integration_xpu-cpu_1             /bin/sh -c sleep infinity        Up (healthy)
integration_xpu-spdk_1            sh -c echo 512 > /proc/sys ...   Exit 1
integration_xpu-telegraf_1        /entrypoint.sh telegraf          Up
prometheus                        /bin/prometheus --config.f ...   Up               0.0.0.0:9090->9090/tcp

pkg_resources.DistributionNotFound: The 'pysqlite3' distribution was not found and is required by sztpd

$  docker-compose -f docker-compose.yml -f docker-compose.xpu.yml -f docker-compose.otel.yml -f docker-compose.spdk.yml -f docker-compose.pxe.yml logs sztp
Attaching to integration_sztp_1
sztp_1               | Traceback (most recent call last):
sztp_1               |   File "/usr/local/bin/sztpd", line 33, in <module>
sztp_1               |     sys.exit(load_entry_point('sztpd==0.0.11', 'console_scripts', 'sztpd')())
sztp_1               |   File "/usr/local/bin/sztpd", line 25, in importlib_load_entry_point
sztp_1               |     return next(matches).load()
sztp_1               |   File "/usr/lib64/python3.9/importlib/metadata.py", line 77, in load
sztp_1               |     module = import_module(match.group('module'))
sztp_1               |   File "/usr/lib64/python3.9/importlib/__init__.py", line 127, in import_module
sztp_1               |     return _bootstrap._gcd_import(name[level:], package, level)
sztp_1               |   File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
sztp_1               |   File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
sztp_1               |   File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
sztp_1               |   File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
sztp_1               |   File "<frozen importlib._bootstrap_external>", line 850, in exec_module
sztp_1               |   File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
sztp_1               |   File "/usr/local/lib/python3.9/site-packages/sztpd/__main__.py", line 4, in <module>
sztp_1               |     from .  import __version__,sztpd
sztp_1               |   File "/usr/local/lib/python3.9/site-packages/sztpd/sztpd.py", line 9, in <module>
sztp_1               |     import gc,tracemalloc,os,re,json,base64,signal,asyncio,datetime,functools,pkg_resources
sztp_1               |   File "/usr/lib/python3.9/site-packages/pkg_resources/__init__.py", line 3257, in <module>
sztp_1               |     def _initialize_master_working_set():
sztp_1               |   File "/usr/lib/python3.9/site-packages/pkg_resources/__init__.py", line 3240, in _call_aside
sztp_1               |     f(*args, **kwargs)
sztp_1               |   File "/usr/lib/python3.9/site-packages/pkg_resources/__init__.py", line 3269, in _initialize_master_working_set
sztp_1               |     working_set = WorkingSet._build_master()
sztp_1               |   File "/usr/lib/python3.9/site-packages/pkg_resources/__init__.py", line 582, in _build_master
sztp_1               |     ws.require(__requires__)
sztp_1               |   File "/usr/lib/python3.9/site-packages/pkg_resources/__init__.py", line 899, in require
sztp_1               |     needed = self.resolve(parse_requirements(requirements))
sztp_1               |   File "/usr/lib/python3.9/site-packages/pkg_resources/__init__.py", line 785, in resolve
sztp_1               |     raise DistributionNotFound(req, requirers)
sztp_1               | pkg_resources.DistributionNotFound: The 'pysqlite3' distribution was not found and is required by sztpd

and

$  docker-compose -f docker-compose.yml -f docker-compose.xpu.yml -f docker-compose.otel.yml -f docker-compose.spdk.yml -f docker-compose.pxe.yml ps -a
             Name                            Command                   State                                                          Ports
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
integration_host-bmc-redfish_1    python /usr/src/app/redfis ...   Up (healthy)
integration_host-bmc-ssh_1        /init                            Up               0.0.0.0:2208->2222/tcp, 0.0.0.0:8001->8000/tcp
integration_host-bmc_1            /bin/sh -c sleep infinity        Up
integration_host-cpu-ssh_1        /init                            Up               0.0.0.0:2210->2222/tcp
integration_host-cpu_1            /bin/sh -c sleep infinity        Up
integration_otel-gw-collector_1   /otelcol --config=/etc/ote ...   Up (unhealthy)   0.0.0.0:13133->13133/tcp, 0.0.0.0:1888->1888/tcp, 0.0.0.0:4317->4317/tcp, 55678/tcp,
                                                                                    0.0.0.0:55679->55679/tcp, 0.0.0.0:8888->8888/tcp, 0.0.0.0:8889->8889/tcp
integration_pxe_1                 sh -e -u -x -c envsubst <  ...   Up (healthy)     0.0.0.0:8082->8082/tcp
integration_spdk-target_1         sh -c echo 512 > /proc/sys ...   Up (unhealthy)   0.0.0.0:9004->9009/tcp
integration_sztp_1                sztpd --help                     Exit 1
integration_xpu-bmc-redfish_1     python /usr/src/app/redfis ...   Up (healthy)
integration_xpu-bmc-ssh_1         /init                            Up               0.0.0.0:2209->2222/tcp, 0.0.0.0:8002->8000/tcp
integration_xpu-bmc_1             /bin/sh -c sleep infinity        Up
integration_xpu-cpu-ssh_1         /init                            Up               0.0.0.0:2207->2222/tcp, 0.0.0.0:9009->9009/tcp
integration_xpu-cpu_1             /bin/sh -c sleep infinity        Up
integration_xpu-spdk_1            sh -c echo 512 > /proc/sys ...   Exit 1
integration_xpu-telegraf_1        /entrypoint.sh telegraf          Up
prometheus                        /bin/prometheus --config.f ...   Up               0.0.0.0:9090->9090/tcp

pathfinding `cyborg`

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.