test-network-function / cnf-certification-test Goto Github PK

Home Page: https://test-network-function.github.io/cnf-certification-test/

License: Apache License 2.0

Dockerfile 0.30% Makefile 0.39% Go 93.42% Shell 2.76% HTML 1.52% JavaScript 1.61%

cnf-certification-test's Introduction

Red Hat Best Practices Test Suite for Kubernetes

Objective

To provide a set of test cases for the Containerized Network Functions/Cloud Native Functions (CNFs) to verify if best practices for deployment on Red Hat OpenShift clusters are followed.

The test suite can be run as a standalone (after compiling the Golang code) or as a container.
The full documentation is published here. Please contact us in case the documentation is broken.
The catalog of all the available test cases can be found here.

Demo

Target Audience

Partner
Developer

Technical Pre-requisites for Running the Test Suite

OCP or Kubernetes Cluster
Docker or Podman (if running the container-based version)

Pre-requisites for Topics Covered

Knowledge on Kubernetes
OpenShift Container Platform
Kubernetes Operator

Linters for the Codebase

License

Red Hat Best Practices Test Suite for Kubernetes is copyright Red Hat, Inc. and available under an Apache 2 license.

cnf-certification-test's People

Contributors

Stargazers

Watchers

cnf-certification-test's Issues

Improvement scope

cnf-certification-test/run-cnf-suites.sh

Line 148 in 3493751

cd ./cnf-certification-test || exit 1

This could be replaced with the below command as it may create issues when cnf-certification-test dir is not in the same dir from where script is getting executed

cnf-certification-test/run-cnf-suites.sh

Line 84 in 3493751

cd "$BASEDIR"/cnf-certification-test || exit 1

observability-pod-disruption-budget is listed as "Mandatory" in CATALOG.md but is skipped when ran.

Relevant portion of claim.json:

            "-classname": "CNF Certification Test Suite",
            "-name": "[It] observability observability-pod-disruption-budget [common, telco, observability-pod-disruption-budget, observability]",                                        
            "-status": "skipped",
            "-time": "0.000136341",
            "skipped": {
              "-message": "skipped - Test skipped because there are no []v1.PodDisruptionBudget to test, please check under test labels"                                                  
            },
            "system-err": "\u003e Enter [BeforeEach] observability - /usr/tnf/tnf-src/cnf-certification-test/observability/suite.go:42 @ 05/19/23 15:33:45.613\n\u003c Exit [BeforeEach] observability - /usr/tnf/tnf-src/cnf-certification-test/observability/suite.go:42 @ 05/19/23 15:33:45.613 (0s)\n\u003e Enter [It] observability-pod-disruption-budget - /usr/tnf/tnf-src/cnf-certification-test/observability/suite.go:66 @ 05/19/23 15:33:45.613\n[SKIPPED] Test skipped because there are no []v1.PodDisruptionBudget to test, please check under test labels\nIn [It] at: /usr/tnf/tnf-src/pkg/testhelper/testhelper.go:130 @ 05/19/23 15:33:45.613\n\u003c Exit [It] observability-pod-disruption-budget - /usr/tnf/tnf-src/cnf-certification-test/observability/suite.go:66 @ 05/19/23 15:33:45.613 (0s)\n\u003e Enter [ReportAfterEach] observability - /usr/tnf/tnf-src/cnf-certification-test/observability/suite.go:45 @ 05/19/23 15:33:45.613\n\u003c Exit [ReportAfterEach] observability - /usr/tnf/tnf-src/cnf-certification-test/observability/suite.go:45 @ 05/19/23 15:33:45.613 (0s)"

CATALOG.md lists this test as "Mandatory" and so the situation is created where we have a "Mandatory" test that is being skipped.

If the check is just to ensure the minAvailable and maxUnavailable values on any existing object (but where no object is fine) is there a way to have the test written so that it passes? If PodDisruptionBudget is required to exist can this test fail rather than be skipped?

If my understanding or expectations aren't correct, let me know.

Autodiscover fails to detect Operator pods

Hi,

I'm working on testing a CNF that is deployed with an Operator, I wanted the Operator tests to be run against that Operator, but the testsuite says Discovered Operators: 0
I have a label that is added in the tnf config for Operatorpodlabels.

operatorsUnderTestLabels:
  - "app:operator-to-be-tested"
  - "test-network-function.com/operator:target"

[autodiscover.go: 140] parsed operators under test labels: [{LabelKey:test-network-function.com/operator LabelValue:target} {LabelKey:test-network-function.com/operator LabelValue:}] [provider.go: 282] Operators found: 0
I clearly pointed out an Operator and checked if csv is installed and pod labels are identifying the right pod. But test suite doesn't pick the operator.
TNF version: v4.2.4

networking-iptables test failing in OCP 4.12

Hi team.

Coming here to report an issue with networking-iptables test in OCP 4.12. My impression is that OCP 4.12 resources may be creating some iptables rules in the pods deployed in the cluster (maybe because the security policies that are updated starting in this OCP version), and it is hitting the failure on this test.

Here you have a DCI job example on OCP 4.12:

https://www.distributed-ci.io/jobs/476e4771-b682-4c36-8da7-58d15630a08e/tests/f6e71ac0-d4e8-4b2a-a083-9baf46ec4a3c

Non-compliant iptables config on: container: test pod: test-648dc964f9-blg7d ns: test-cnf log: # Generated by iptables-save v1.8.4 on Sat Nov 12 06:22:59 2022
*filter
:INPUT ACCEPT [122:9784]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [119:9713]
-A FORWARD -p tcp -m tcp --dport 22623 --tcp-flags FIN,SYN,RST,ACK SYN -j REJECT --reject-with icmp-port-unreachable
-A FORWARD -p tcp -m tcp --dport 22624 --tcp-flags FIN,SYN,RST,ACK SYN -j REJECT --reject-with icmp-port-unreachable
-A FORWARD -d 169.254.169.254/32 -p tcp -m tcp ! --dport 53 -j REJECT --reject-with icmp-port-unreachable
-A FORWARD -d 169.254.169.254/32 -p udp -m udp ! --dport 53 -j REJECT --reject-with icmp-port-unreachable
-A OUTPUT -p tcp -m tcp --dport 22623 --tcp-flags FIN,SYN,RST,ACK SYN -j REJECT --reject-with icmp-port-unreachable
-A OUTPUT -p tcp -m tcp --dport 22624 --tcp-flags FIN,SYN,RST,ACK SYN -j REJECT --reject-with icmp-port-unreachable
-A OUTPUT -d 169.254.169.254/32 -p tcp -m tcp ! --dport 53 -j REJECT --reject-with icmp-port-unreachable
-A OUTPUT -d 169.254.169.254/32 -p udp -m udp ! --dport 53 -j REJECT --reject-with icmp-port-unreachable
COMMIT
# Completed on Sat Nov 12 06:22:59 2022

Non-compliant iptables config on: container: test pod: test-648dc964f9-svpdf ns: test-cnf log: # Generated by iptables-save v1.8.4 on Sat Nov 12 06:22:59 2022
*filter
:INPUT ACCEPT [241:19318]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [240:19455]
-A FORWARD -p tcp -m tcp --dport 22623 --tcp-flags FIN,SYN,RST,ACK SYN -j REJECT --reject-with icmp-port-unreachable
-A FORWARD -p tcp -m tcp --dport 22624 --tcp-flags FIN,SYN,RST,ACK SYN -j REJECT --reject-with icmp-port-unreachable
-A FORWARD -d 169.254.169.254/32 -p tcp -m tcp ! --dport 53 -j REJECT --reject-with icmp-port-unreachable
-A FORWARD -d 169.254.169.254/32 -p udp -m udp ! --dport 53 -j REJECT --reject-with icmp-port-unreachable
-A OUTPUT -p tcp -m tcp --dport 22623 --tcp-flags FIN,SYN,RST,ACK SYN -j REJECT --reject-with icmp-port-unreachable
-A OUTPUT -p tcp -m tcp --dport 22624 --tcp-flags FIN,SYN,RST,ACK SYN -j REJECT --reject-with icmp-port-unreachable
-A OUTPUT -d 169.254.169.254/32 -p tcp -m tcp ! --dport 53 -j REJECT --reject-with icmp-port-unreachable
-A OUTPUT -d 169.254.169.254/32 -p udp -m udp ! --dport 53 -j REJECT --reject-with icmp-port-unreachable
COMMIT
# Completed on Sat Nov 12 06:22:59 2022

Non-compliant iptables config on: container: test pod: test-0 ns: production-cnf log: # Generated by iptables-save v1.8.4 on Sat Nov 12 06:23:00 2022
*filter
:INPUT ACCEPT [180:14454]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [181:14647]
-A FORWARD -p tcp -m tcp --dport 22623 --tcp-flags FIN,SYN,RST,ACK SYN -j REJECT --reject-with icmp-port-unreachable
-A FORWARD -p tcp -m tcp --dport 22624 --tcp-flags FIN,SYN,RST,ACK SYN -j REJECT --reject-with icmp-port-unreachable
-A FORWARD -d 169.254.169.254/32 -p tcp -m tcp ! --dport 53 -j REJECT --reject-with icmp-port-unreachable
-A FORWARD -d 169.254.169.254/32 -p udp -m udp ! --dport 53 -j REJECT --reject-with icmp-port-unreachable
-A OUTPUT -p tcp -m tcp --dport 22623 --tcp-flags FIN,SYN,RST,ACK SYN -j REJECT --reject-with icmp-port-unreachable
-A OUTPUT -p tcp -m tcp --dport 22624 --tcp-flags FIN,SYN,RST,ACK SYN -j REJECT --reject-with icmp-port-unreachable
-A OUTPUT -d 169.254.169.254/32 -p tcp -m tcp ! --dport 53 -j REJECT --reject-with icmp-port-unreachable
-A OUTPUT -d 169.254.169.254/32 -p udp -m udp ! --dport 53 -j REJECT --reject-with icmp-port-unreachable
COMMIT
# Completed on Sat Nov 12 06:23:00 2022

Non-compliant iptables config on: container: test pod: test-1 ns: production-cnf log: # Generated by iptables-save v1.8.4 on Sat Nov 12 06:23:00 2022
*filter
:INPUT ACCEPT [90:7236]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [90:7305]
-A FORWARD -p tcp -m tcp --dport 22623 --tcp-flags FIN,SYN,RST,ACK SYN -j REJECT --reject-with icmp-port-unreachable
-A FORWARD -p tcp -m tcp --dport 22624 --tcp-flags FIN,SYN,RST,ACK SYN -j REJECT --reject-with icmp-port-unreachable
-A FORWARD -d 169.254.169.254/32 -p tcp -m tcp ! --dport 53 -j REJECT --reject-with icmp-port-unreachable
-A FORWARD -d 169.254.169.254/32 -p udp -m udp ! --dport 53 -j REJECT --reject-with icmp-port-unreachable
-A OUTPUT -p tcp -m tcp --dport 22623 --tcp-flags FIN,SYN,RST,ACK SYN -j REJECT --reject-with icmp-port-unreachable
-A OUTPUT -p tcp -m tcp --dport 22624 --tcp-flags FIN,SYN,RST,ACK SYN -j REJECT --reject-with icmp-port-unreachable
-A OUTPUT -d 169.254.169.254/32 -p tcp -m tcp ! --dport 53 -j REJECT --reject-with icmp-port-unreachable
-A OUTPUT -d 169.254.169.254/32 -p udp -m udp ! --dport 53 -j REJECT --reject-with icmp-port-unreachable
COMMIT
# Completed on Sat Nov 12 06:23:00 2022

Non-compliant []*provider.Container: [container: test pod: test-648dc964f9-blg7d ns: test-cnf container: test pod: test-648dc964f9-svpdf ns: test-cnf container: test pod: test-0 ns: production-cnf container: test pod: test-1 ns: production-cnf]

Note this is not happening in previous OCP versions.

Also, remember that we're using exactly the quay.io/testnetworkfunction/cnf-test-partner:latest image for the pods deployed.

Nodes using latest OCP 4.13 (and 4.14) are based on CentOS Stream CoreOS, so that platform-alteration-ocp-node-os-lifecycle fails

Since today, latest OCP 4.13 (and OCP 4.14) versions are using CentOS Stream CoreOS in the OCP nodes, so that the platform-alteration-ocp-node-os-lifecycle test is failing because it does not consider it as a valid OS version.

Example of DCI job where this is happening:

https://www.distributed-ci.io/jobs/a229f9d1-0438-47d5-acc9-e77e4751c521/tests/37d1534f-3332-40d5-b7f3-8b8616126f8e

> Enter [BeforeEach] platform-alteration - /usr/tnf/tnf-src/cnf-certification-test/platform/suite.go:49 @ 03/15/23 01:26:36.856
< Exit [BeforeEach] platform-alteration - /usr/tnf/tnf-src/cnf-certification-test/platform/suite.go:49 @ 03/15/23 01:26:36.856 (0s)
> Enter [It] platform-alteration-ocp-node-os-lifecycle - /usr/tnf/tnf-src/cnf-certification-test/platform/suite.go:144 @ 03/15/23 01:26:36.856
STEP: Testing the control-plane and workers in the cluster for Operating System compatibility - /usr/tnf/tnf-src/cnf-certification-test/platform/suite.go:490 @ 03/15/23 01:26:36.856
Master Node worker-1 has been found to be running an incompatible operating system: CentOS Stream CoreOS 413.92.202303061740-0 (Plow)
Master Node worker-2 has been found to be running an incompatible operating system: CentOS Stream CoreOS 413.92.202303061740-0 (Plow)
Master Node worker-3 has been found to be running an incompatible operating system: CentOS Stream CoreOS 413.92.202303061740-0 (Plow)
Master Node master-0 has been found to be running an incompatible operating system: CentOS Stream CoreOS 413.92.202303061740-0 (Plow)
Master Node master-1 has been found to be running an incompatible operating system: CentOS Stream CoreOS 413.92.202303061740-0 (Plow)
Master Node master-2 has been found to be running an incompatible operating system: CentOS Stream CoreOS 413.92.202303061740-0 (Plow)
Master Node worker-0 has been found to be running an incompatible operating system: CentOS Stream CoreOS 413.92.202303061740-0 (Plow)
Number of control plane nodes running non-RHCOS based operating systems: 7
[FAILED] Number of control plane nodes running non-RHCOS based operating systems: 7
In [It] at: /usr/tnf/tnf-src/cnf-certification-test/platform/suite.go:571 @ 03/15/23 01:26:36.857
< Exit [It] platform-alteration-ocp-node-os-lifecycle - /usr/tnf/tnf-src/cnf-certification-test/platform/suite.go:144 @ 03/15/23 01:26:36.857 (0s)
> Enter [ReportAfterEach] platform-alteration - /usr/tnf/tnf-src/cnf-certification-test/platform/suite.go:52 @ 03/15/23 01:26:36.857
< Exit [ReportAfterEach] platform-alteration - /usr/tnf/tnf-src/cnf-certification-test/platform/suite.go:52 @ 03/15/23 01:26:36.857 (0s)

Starts happening with:

I believe that the platform-alteration suite has to be updated; so that, the method called testNodeOperatingSystemStatus, when using IsRHCOS method, it covers the case of CentOS Stream

Unable to test CRD's

Trying to manually run CNF tests and it fails due to not being able to scale the CRD's.

[root@dci-server cnf-certification-test]# ./run-tnf-container.sh -t  /tmp/tnf_config.d -o /tmp/tnf.out.d -i quay.io/testnetworkfunction/cnf-certification-test:v4.2.4 -f -s -l common,extended
Performing KUBECONFIG autodiscovery                                                                                                                                                           
Performing DOCKERCFG autodiscovery                                                                                                                                                             
-t /tmp/tnf_config.d                                                                                                                                                                          
-o /tmp/tnf.out.d                                                                                                                                                                             
-i quay.io/testnetworkfunction/cnf-certification-test:v4.2.4                                                                                                                                  
-f                                                                                                                                                                                             
-s                                                                                                                                                                                             
-l common,extended                                                                                                                                                                            
Kubeconfig Autodiscovery: configuration loaded from $KUBECONFIG            

[....snip....]

 INFO   [May 18 18:49:16.345][autodiscover_operators.go: 123]  CSV name: xxxxxx-operator.v2.23.1 (ns: openshift)                                                                                
INFO   [May 18 18:49:16.345][autodiscover_operators.go: 123]  CSV name: xxxxxx-operator.v2.23.1 (ns: xxxxxx)                                                                                    
INFO   [May 18 18:49:16.345][autodiscover_operators.go: 123]  CSV name: xxxxxx-operator.v2.23.1 (ns: test-s3int)                                                                               
FATAL  [May 18 18:49:17.539][autodiscover_scaleObject.go: 66] error while getting the scaling fields directives.xxxxxx.xxxx.com "payload-coredns" not found                               
+ RESULT=1
+ set +o pipefail
+ exit 1
+ html_output
+ '[' -f /usr/tnf/claim/claim.json ']'
+ echo -n 'var initialjson='
+ cat /usr/tnf/claim/claim.json
+ cp /usr/tnf/script/results.html /usr/tnf/claim

The relevant part of the config:

targetCrdFilters:
  - nameSuffix: "xxxx.xxxxx.com"
    scalable: false

The object ultimately under test is a daemonset and is therefore unscalable by normal means. The fact that it seems to be locating the directives CRD seems to imply the targetCrdFilters is configured properly. I've redacted some partner names for the vibes but I can communicate more explicitly side channel.

DeploymentConfig must be considered in deployment-scaling and pod-owner-type tests.

Using latest release v4.1.0, Replicas inside the DeploymentConfig resource are not being considered.

lifecycle-deployment-scaling was skipped with log:
Test skipped because there are no []*provider.Deployment to test
yet the application had multiple replicas defined.

lifecycle-pod-owner-type was failed with log:
Tests that CNF Pod(s) are deployed as part of a ReplicaSet(s)/StatefulSet(s)
and the application has ReplicaSets defined from DeploymentConfig resource

Stuck in loop checking CPU scheduling classification when trying to run performance suite.

When running either 4.2.3 or 4.2.4 I get stuck in the following loop if I ever enable to performance suite:

INFO   [May 17 23:50:36.219][scheduling.go: 120] Checking the scheduling policy/priority in container: xxx-xxxxxx pod: xxx-xxxxxx-xxxxxxx-xxxxx-2qwv5 ns: xxxxxx for pid=368 
INFO   [May 17 23:50:36.438][scheduling.go: 133] pid 368 in container: xxx-xxxxxx pod: xxx-xxxxxx-xxxxxxx-xxxxx-2qwv5 ns: xxxxxx has the cpu scheduling policy SCHED_FIFO, scheduling priority 99 
  pid=368 in container: xxx-xxxxxx pod: xxx-xxxxxx-xxxxxxx-xxxxx-2qwv5 ns: xxxxxx with cpu scheduling policy=SCHED_FIFO, priority=%!s(int=99) did not satisfy cpu scheduling requirements
INFO   [May 17 23:50:36.438][scheduling.go: 120] Checking the scheduling policy/priority in container: xxx-xxxxxx pod: xxx-xxxxxx-xxxxxxx-xxxxx-2qwv5 ns: xxxxxx for pid=369 
INFO   [May 17 23:50:36.649][scheduling.go: 133] pid 369 in container: xxx-xxxxxx pod: xxx-xxxxxx-xxxxxxx-xxxxx-2qwv5 ns: xxxxxx has the cpu scheduling policy SCHED_FIFO, scheduling priority 99 
  pid=369 in container: xxx-xxxxxx pod: xxx-xxxxxx-xxxxxxx-xxxxx-2qwv5 ns: xxxxxx with cpu scheduling policy=SCHED_FIFO, priority=%!s(int=99) did not satisfy cpu scheduling requirements
INFO   [May 17 23:50:36.649][scheduling.go: 120] Checking the scheduling policy/priority in container: xxx-xxxxxx pod: xxx-xxxxxx-xxxxxxx-xxxxx-2qwv5 ns: xxxxxx for pid=370 
INFO   [May 17 23:50:36.848][scheduling.go: 133] pid 370 in container: xxx-xxxxxx pod: xxx-xxxxxx-xxxxxxx-xxxxx-2qwv5 ns: xxxxxx has the cpu scheduling policy SCHED_OTHER, scheduling priority 0 
  pid=370 in container: xxx-xxxxxx pod: xxx-xxxxxx-xxxxxxx-xxxxx-2qwv5 ns: xxxxxx with cpu scheduling policy=SCHED_OTHER, priority=%!s(int=0) satisfies cpu scheduling requirements
INFO   [May 17 23:50:36.848][scheduling.go: 120] Checking the scheduling policy/priority in container: xxx-xxxxxx pod: xxx-xxxxxx-xxxxxxx-xxxxx-2qwv5 ns: xxxxxx for pid=372 
INFO   [May 17 23:50:37.056][scheduling.go: 133] pid 372 in container: xxx-xxxxxx pod: xxx-xxxxxx-xxxxxxx-xxxxx-2qwv5 ns: xxxxxx has the cpu scheduling policy SCHED_OTHER, scheduling priority 0 
  pid=372 in container: xxx-xxxxxx pod: xxx-xxxxxx-xxxxxxx-xxxxx-2qwv5 ns: xxxxxx with cpu scheduling policy=SCHED_OTHER, priority=%!s(int=0) satisfies cpu scheduling requirements
INFO   [May 17 23:50:37.056][scheduling.go: 120] Checking the scheduling policy/priority in container: xxx-xxxxxx pod: xxx-xxxxxx-xxxxxxx-xxxxx-2qwv5 ns: xxxxxx for pid=373 
INFO   [May 17 23:50:37.261][scheduling.go: 133] pid 373 in container: xxx-xxxxxx pod: xxx-xxxxxx-xxxxxxx-xxxxx-2qwv5 ns: xxxxxx has the cpu scheduling policy SCHED_OTHER, scheduling priority 0 
  pid=373 in container: xxx-xxxxxx pod: xxx-xxxxxx-xxxxxxx-xxxxx-2qwv5 ns: xxxxxx with cpu scheduling policy=SCHED_OTHER, priority=%!s(int=0) satisfies cpu scheduling requirements
INFO   [May 17 23:50:37.261][scheduling.go: 120] Checking the scheduling policy/priority in container: xxx-xxxxxx pod: xxx-xxxxxx-xxxxxxx-xxxxx-2qwv5 ns: xxxxxx for pid=374 
INFO   [May 17 23:50:37.475][scheduling.go: 133] pid 374 in container: xxx-xxxxxx pod: xxx-xxxxxx-xxxxxxx-xxxxx-2qwv5 ns: xxxxxx has the cpu scheduling policy SCHED_FIFO, scheduling priority 99 
  pid=374 in container: xxx-xxxxxx pod: xxx-xxxxxx-xxxxxxx-xxxxx-2qwv5 ns: xxxxxx with cpu scheduling policy=SCHED_FIFO, priority=%!s(int=99) did not satisfy cpu scheduling requirements
INFO   [May 17 23:50:37.475][scheduling.go: 120] Checking the scheduling policy/priority in container: xxx-xxxxxx pod: xxx-xxxxxx-xxxxxxx-xxxxx-2qwv5 ns: xxxxxx for pid=375 
INFO   [May 17 23:50:37.685][scheduling.go: 133] pid 375 in container: xxx-xxxxxx pod: xxx-xxxxxx-xxxxxxx-xxxxx-2qwv5 ns: xxxxxx has the cpu scheduling policy SCHED_FIFO, scheduling priority 99 
  pid=375 in container: xxx-xxxxxx pod: xxx-xxxxxx-xxxxxxx-xxxxx-2qwv5 ns: xxxxxx with cpu scheduling policy=SCHED_FIFO, priority=%!s(int=99) did not satisfy cpu scheduling requirements

And then it just goes on like that (it will run until interrupted, I've let it runf or close to half an hour before) until I issue a oc delete ds -n cnf-suite to delete the test pods which causes the execution to kick forward (likely with additional errors, I'd presume). I redacted the namespace and pod names for the sake of confidentiality.

Tainted kernel node test fails if Intel Vt-d for directed I/O is enabled

On New Mexico lab master nodes(master 1 and master2) - OS reported tainted kernels when Intel Vt-d for directed I/O was enabled.
Disabling the option, made the taint to be 0 on /proc/sys/kernel/tainted
I can reproduce the issue by enabling Vt-d for directed I/O - however there were two settings one under I/O configuration and one under CPU configuration for Intel VT-d, enabling the option on I/O config creates a kernel taint.

Potentially uncordon nodes prior to starting pod-recreation tests

In-case of a "Ready,SchedulingDisabled" node, we might be encountering a leftover from a previous run. Should we be checking for these nodes prior to the run so we can uncordon them properly?

Or maybe add an "After" section into the ginkgo that uncordons all nodes upon an unexpected exit/panic.

When pods under test are not ready, test fail ungracefully

exclude all non-ready pods from the test run and save the excluded pods to the claim file

Some tests are linked to a different test suite

Just raising this issue in case it were a mistake.

By checking a DCI job testing the cnf-certification-test main branch, I've seen there are some tests that are not defined within the corresponding test suite, in theory:

[It] lifecycle manageability-containers-image-tag [extanded, manageability-containers-image-tag] -> should be manageability test suite?
[It] lifecycle access-control-requests-and-limits [common, access-control-requests-and-limits] -> should be access-control test suite?
[It] networking access-control-network-policy-deny-all [common, access-control-network-policy-deny-all] -> should be access-control test suite?

lifecycle-cpu-isolation and lifecycle-affinity-required-pods don't appear to be matching pods

Relevant portion of the TNF config (CNF name redacted, can provide whatever is needed side channel):

podsUnderTestLabels:
  - "app.kubernetes.io/name:cnfname-topology"
  - "app.kubernetes.io/name:cnfname-directive"
  - "cnfname-directive:cnfname-directive-payload-coredns"
  - "cnfname-directive:cnfname-directive-topology"
  - "app:nat"
  - "app:cnfname-operator"

And there are pods with requests/limits configured in a way that seems like it would cause the test to fail (mismatched, not using whole CPU's, etc):

[root@dci-server cnf-certification-test]# oc get pods -l cnfname-directive=cnfname-directive-payload-coredns
NAME                                    READY   STATUS    RESTARTS      AGE
cnfname-directive-payload-coredns-2fsgz   1/1     Running   0             21h
cnfname-directive-payload-coredns-5g6xg   1/1     Running   0             21h
cnfname-directive-payload-coredns-5p2tf   1/1     Running   0             21h
cnfname-directive-payload-coredns-5tj9d   1/1     Running   1 (17h ago)   21h
cnfname-directive-payload-coredns-m4sbd   1/1     Running   0             21h
cnfname-directive-payload-coredns-vz489   1/1     Running   0             21h
cnfname-directive-payload-coredns-znkhh   1/1     Running   0             21h

[root@dci-server cnf-certification-test]# oc get pods cnfname-directive-payload-coredns-znkhh -o yaml | grep -A6 resources:
    resources:
      limits:
        cpu: "1"
        memory: 1Gi
      requests:
        cpu: 250m
        memory: 250Mi

but that aren't found when the test suite runs:

          {
            "-classname": "CNF Certification Test Suite",
            "-name": "[It] lifecycle lifecycle-cpu-isolation [common, telco, lifecycle-cpu-isolation, lifecycle]",
            "-status": "skipped",
            "-time": "0.000523242",
            "skipped": {
              "-message": "skipped - Test skipped because there are no []*provider.Pod to test, please check under test labels"
            },
            "system-err": "\u003e Enter [BeforeEach] lifecycle - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:55 @ 05/19/23 17:19:27.076\n\u003c Exit [BeforeEach] lifecycle - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:55 @ 05/19/23 17:19:27.076 (0s)\n\u003e Enter [It] lifecycle-cpu-isolation - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:167 @ 05/19/23 17:19:27.076\n[SKIPPED] Test skipped because there are no []*provider.Pod to test, please check under test labels\nIn [It] at: /usr/tnf/tnf-src/pkg/testhelper/testhelper.go:130 @ 05/19/23 17:19:27.076\n\u003c Exit [It] lifecycle-cpu-isolation - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:167 @ 05/19/23 17:19:27.076 (1ms)\n\u003e Enter [ReportAfterEach] lifecycle - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:58 @ 05/19/23 17:19:27.076\n\u003c Exit [ReportAfterEach] lifecycle - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:58 @ 05/19/23 17:19:27.076 (0s)"
          },

The above is for lifecycle-cpu-isolation but's also relevant for lifecycle-affinity-required-pods as these same pods have affinity rules but not the affinity-related labels or annotations mentioned in the CATALOG.md description which seems like it should cause a failure rather than be skipped.

Issues launching debug daemonset in OCP 4.12 related to security constraints

Today, we have launched the first OCP 4.12 daily job and we have tested some workloads on top of it, including the workloads we use ot test CNF Cert Suite. You can see that job here.

Digging into the tnf results, we observed that there were a lot of unexpected results (some tests failed or were skipped and they should have passed), all of them caused by debug daemonset wrong behavior.

Taking a look into the events that happened in the cluster (it can be seen here, in events.txt file), we can see these issues related to the debug daemonset:

default                                            32m         Warning   FailedCreate                                      daemonset/debug                                                       Error creating: pods "debug-rtcj7" is forbidden: violates PodSecurity "restricted:latest": host namespaces (hostNetwork=true, hostPID=true), privileged (container "container-00" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container "container-00" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "container-00" must set securityContext.capabilities.drop=["ALL"]), restricted volume types (volume "host" uses restricted volume type "hostPath"), runAsNonRoot != true (pod or container "container-00" must set securityContext.runAsNonRoot=true), runAsUser=0 (container "container-00" must not set runAsUser=0), seccompProfile (pod or container "container-00" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
openshift-kube-controller-manager                  32m         Normal    CreatedSCCRanges                                  pod/kube-controller-manager-master-2                                  created SCC ranges for tnf namespace
default                                            32m         Warning   FailedCreate                                      daemonset/debug                                                       Error creating: pods "debug-8l6pg" is forbidden: violates PodSecurity "restricted:latest": host namespaces (hostNetwork=true, hostPID=true), privileged (container "container-00" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container "container-00" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "container-00" must set securityContext.capabilities.drop=["ALL"]), restricted volume types (volume "host" uses restricted volume type "hostPath"), runAsNonRoot != true (pod or container "container-00" must set securityContext.runAsNonRoot=true), runAsUser=0 (container "container-00" must not set runAsUser=0), seccompProfile (pod or container "container-00" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
default                                            32m         Warning   FailedCreate                                      daemonset/debug                                                       Error creating: pods "debug-dpkrr" is forbidden: violates PodSecurity "restricted:latest": host namespaces (hostNetwork=true, hostPID=true), privileged (container "container-00" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container "container-00" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "container-00" must set securityContext.capabilities.drop=["ALL"]), restricted volume types (volume "host" uses restricted volume type "hostPath"), runAsNonRoot != true (pod or container "container-00" must set securityContext.runAsNonRoot=true), runAsUser=0 (container "container-00" must not set runAsUser=0), seccompProfile (pod or container "container-00" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
default                                            32m         Warning   FailedCreate                                      daemonset/debug                                                       Error creating: pods "debug-2rqvh" is forbidden: violates PodSecurity "restricted:latest": host namespaces (hostNetwork=true, hostPID=true), privileged (container "container-00" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container "container-00" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "container-00" must set securityContext.capabilities.drop=["ALL"]), restricted volume types (volume "host" uses restricted volume type "hostPath"), runAsNonRoot != true (pod or container "container-00" must set securityContext.runAsNonRoot=true), runAsUser=0 (container "container-00" must not set runAsUser=0), seccompProfile (pod or container "container-00" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
default                                            32m         Warning   FailedCreate                                      daemonset/debug                                                       Error creating: pods "debug-ghjw9" is forbidden: violates PodSecurity "restricted:latest": host namespaces (hostNetwork=true, hostPID=true), privileged (container "container-00" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container "container-00" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "container-00" must set securityContext.capabilities.drop=["ALL"]), restricted volume types (volume "host" uses restricted volume type "hostPath"), runAsNonRoot != true (pod or container "container-00" must set securityContext.runAsNonRoot=true), runAsUser=0 (container "container-00" must not set runAsUser=0), seccompProfile (pod or container "container-00" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
default                                            32m         Warning   FailedCreate                                      daemonset/debug                                                       Error creating: pods "debug-rf2l8" is forbidden: violates PodSecurity "restricted:latest": host namespaces (hostNetwork=true, hostPID=true), privileged (container "container-00" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container "container-00" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "container-00" must set securityContext.capabilities.drop=["ALL"]), restricted volume types (volume "host" uses restricted volume type "hostPath"), runAsNonRoot != true (pod or container "container-00" must set securityContext.runAsNonRoot=true), runAsUser=0 (container "container-00" must not set runAsUser=0), seccompProfile (pod or container "container-00" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
default                                            32m         Warning   FailedCreate                                      daemonset/debug                                                       Error creating: pods "debug-cg58w" is forbidden: violates PodSecurity "restricted:latest": host namespaces (hostNetwork=true, hostPID=true), privileged (container "container-00" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container "container-00" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "container-00" must set securityContext.capabilities.drop=["ALL"]), restricted volume types (volume "host" uses restricted volume type "hostPath"), runAsNonRoot != true (pod or container "container-00" must set securityContext.runAsNonRoot=true), runAsUser=0 (container "container-00" must not set runAsUser=0), seccompProfile (pod or container "container-00" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
default                                            32m         Warning   FailedCreate                                      daemonset/debug                                                       Error creating: pods "debug-kk22c" is forbidden: violates PodSecurity "restricted:latest": host namespaces (hostNetwork=true, hostPID=true), privileged (container "container-00" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container "container-00" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "container-00" must set securityContext.capabilities.drop=["ALL"]), restricted volume types (volume "host" uses restricted volume type "hostPath"), runAsNonRoot != true (pod or container "container-00" must set securityContext.runAsNonRoot=true), runAsUser=0 (container "container-00" must not set runAsUser=0), seccompProfile (pod or container "container-00" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
default                                            32m         Warning   FailedCreate                                      daemonset/debug                                                       Error creating: pods "debug-nnhnd" is forbidden: violates PodSecurity "restricted:latest": host namespaces (hostNetwork=true, hostPID=true), privileged (container "container-00" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container "container-00" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "container-00" must set securityContext.capabilities.drop=["ALL"]), restricted volume types (volume "host" uses restricted volume type "hostPath"), runAsNonRoot != true (pod or container "container-00" must set securityContext.runAsNonRoot=true), runAsUser=0 (container "container-00" must not set runAsUser=0), seccompProfile (pod or container "container-00" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
default                                            5m2s        Warning   FailedCreate                                      daemonset/debug                                                       (combined from similar events): Error creating: pods "debug-r5rsl" is forbidden: violates PodSecurity "restricted:latest": host namespaces (hostNetwork=true, hostPID=true), privileged (container "container-00" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container "container-00" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "container-00" must set securityContext.capabilities.drop=["ALL"]), restricted volume types (volume "host" uses restricted volume type "hostPath"), runAsNonRoot != true (pod or container "container-00" must set securityContext.runAsNonRoot=true), runAsUser=0 (container "container-00" must not set runAsUser=0), seccompProfile (pod or container "container-00" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")

My impression is that it is hitting some security issues caused by OCP 4.12. In theory, there are several changes in terms of security policies in OCP 4.12 (e.g. https://connect.redhat.com/en/blog/important-openshift-changes-pod-security-standards), and I suppose the debug daemonset specification should be reviewed to deal with them.

platform-alteration-ocp-node-os-lifecycle test fails if OCP RC versions are detected

Hi team!
If you check this DCI job: https://www.distributed-ci.io/jobs/00c7beb3-1c61-43e4-90bd-76307b8bee9d/files (check dci-tnf-execution.log for checking the tnf execution), the platform-alteration-ocp-node-os-lifecycle test is failing because, from what I see, is not able to parse correctly an OCP RC version:

�[38;5;243m------------------------------�[0m
�[0mplatform-alteration �[0m�[1mplatform-alteration-ocp-node-os-lifecycle�[0m �[38;5;204m[common, platform-alteration-ocp-node-os-lifecycle, platform-alteration]�[0m
�[38;5;243m/usr/tnf/tnf-src/cnf-certification-test/platform/suite.go:144�[0m
  �[1mSTEP:�[0m Testing the control-plane and workers in the cluster for Operating System compatibility �[38;5;243m@ 04/01/23 05:34:04.03�[0m
�[37mDEBUG  �[0m[Apr  1 05:34:04.030][suite.go: 499] There are 7 nodes to process for Operating System compatibility. 
�[37mDEBUG  �[0m[Apr  1 05:34:04.030][suite.go: 505] Node master-0 is running operating system: Red Hat Enterprise Linux CoreOS 413.92.202303281804-0 (Plow) 
�[37mDEBUG  �[0m[Apr  1 05:34:04.030][suite.go: 505] Node master-1 is running operating system: Red Hat Enterprise Linux CoreOS 413.92.202303281804-0 (Plow) 
�[37mDEBUG  �[0m[Apr  1 05:34:04.030][suite.go: 505] Node master-2 is running operating system: Red Hat Enterprise Linux CoreOS 413.92.202303281804-0 (Plow) 
�[37mDEBUG  �[0m[Apr  1 05:34:04.031][suite.go: 505] Node worker-0 is running operating system: Red Hat Enterprise Linux CoreOS 413.92.202303281804-0 (Plow) 
�[37mDEBUG  �[0m[Apr  1 05:34:04.031][suite.go: 534] Comparing RHCOS shortVersion: 4.13.0-rc.2 to openshiftVersion: 4.13.0-rc.2 
  Node worker-0 has been found to be running an incompatible version of RHCOS: 4.13.0-rc.2
�[37mDEBUG  �[0m[Apr  1 05:34:04.031][suite.go: 505] Node worker-1 is running operating system: Red Hat Enterprise Linux CoreOS 413.92.202303281804-0 (Plow) 
�[37mDEBUG  �[0m[Apr  1 05:34:04.032][suite.go: 534] Comparing RHCOS shortVersion: 4.13.0-rc.2 to openshiftVersion: 4.13.0-rc.2 
  Node worker-1 has been found to be running an incompatible version of RHCOS: 4.13.0-rc.2
�[37mDEBUG  �[0m[Apr  1 05:34:04.032][suite.go: 505] Node worker-2 is running operating system: Red Hat Enterprise Linux CoreOS 413.92.202303281804-0 (Plow) 
�[37mDEBUG  �[0m[Apr  1 05:34:04.032][suite.go: 534] Comparing RHCOS shortVersion: 4.13.0-rc.2 to openshiftVersion: 4.13.0-rc.2 
  Node worker-2 has been found to be running an incompatible version of RHCOS: 4.13.0-rc.2
�[37mDEBUG  �[0m[Apr  1 05:34:04.032][suite.go: 505] Node worker-3 is running operating system: Red Hat Enterprise Linux CoreOS 413.92.202303281804-0 (Plow) 
�[37mDEBUG  �[0m[Apr  1 05:34:04.032][suite.go: 534] Comparing RHCOS shortVersion: 4.13.0-rc.2 to openshiftVersion: 4.13.0-rc.2 
  Node worker-3 has been found to be running an incompatible version of RHCOS: 4.13.0-rc.2
  Number of worker nodes running non-RHCOS or non-RHEL based operating systems: 4
  �[38;5;9m[FAILED]�[0m in [It] - /usr/tnf/tnf-src/cnf-certification-test/platform/suite.go:595 �[38;5;243m@ 04/01/23 05:34:04.033�[0m
�[38;5;9m• [FAILED] [0.002 seconds]�[0m
�[0mplatform-alteration �[38;5;9m�[1m[It] platform-alteration-ocp-node-os-lifecycle�[0m �[38;5;204m[common, platform-alteration-ocp-node-os-lifecycle, platform-alteration]�[0m
�[38;5;243m/usr/tnf/tnf-src/cnf-certification-test/platform/suite.go:144�[0m

OCP version used for this deployment is OCP 4.13.0-rc.2

I tried to check the code but I was not able to detect the root cause of the problem, can you please help on this?

When using service-mesh in the pods under test, several tests failed because of istio-proxy container

Hi all,

With this change in dci-openshift-app-agent, I am testing the inclusion of service-mesh (using Aspenmesh) in the pods under test, in order to execute and validate platform-alteration-service-mesh-usage test.

I can confirm the test is working fine and passing. However, there are several tests that now are failing because of the istio-proxy container being deployed in the pods under test. These are the following:

access-control-one-process-per-container
observability-termination-policy
networking-undeclared-container-ports-usage
lifecycle-container-shutdown
lifecycle-liveness
lifecycle-pod-scheduling
lifecycle-statefulset-scaling
platform-alteration-isredhat-release

DCI job where you can see an example of execution.

Having said that, do you think it is feasible to omit istio-proxy container from the tests where containers are check? (because, for example, we can't do anything with lifecycle-statefulset-scaling, I suppose). I mean, while executing these tests, if the container name is istio-proxy, just do not execute the tests on that container. What's your opinion on this?

[OCP 4.12] platform-alteration-tainted-node-kernel test is always failing

Hi all.

Just reporting this here to make you aware of this recurrent issue we have observed in OCP 4.12 daily deployments.

When running the workload that is being tested by tnf in these OCP 4.12 deployments, the test called platform-alteration-tainted-node-kernel is currently failing. Here you can see examples:

https://www.distributed-ci.io/jobs/e9513550-edbe-4036-96e6-981ed2a94641/tests -> complains about cluster6/master-0
https://www.distributed-ci.io/jobs/a6e267e1-7be8-43e2-8bfa-a6dbdbf272be/tests -> complains about cluster5/master-2

This is the error message offered by tnf:

Please note that taints other than 'module was loaded' were found on node master-X.
Taint causing failure: auxiliary taint, defined for and used by distros on node: master-X

This is the command executed on tnf to check the kernel taints present on a node. I could confirm this by running the same command directly in the affected node:

[core@master-2 ~]$ cat /proc/sys/kernel/tainted
65536

The taint is only appearing in OCP 4.12 deployments, regardless of the cluster where the OCP 4.12 job was executed. It is not affecting to other OCP versions. For example, this job was executed on cluster6 with OCP 4.10 the same day than the failure in OCP 4.12/cluster6, and you can see that there's no taint registered.

I'm opening this here to discuss with you how we should handle this issue. I believe it's not tnf-related at all, but maybe the kernel behavior in OCP 4.12 is changing. Then, I would like to confirm with you how to escalate this.

Even skipping the test with -s option, platform-alteration-service-mesh-usage test is executed if istio is installed

Hi all!

I have detected a potential issue with platform-alteration-service-mesh-usage test in our dallas lab while testing tnf v4.0.2 in our daily jobs. The cluster deployed is OCP 4.7.55 with Verizon WebScale 1.3.5 stack installed, so that Aspenmesh (istio) is running.

In our case, as you know, we are testing your partner pods in our debug jobs, and as we are not configuring service mesh on them, we are always skipping this platform-alteration test.

However, in this DCI job, you can see the job was executed and, of course, failed, because the pods were not using service mesh (which is expected).

If you see execution.log file here, which is the output of tnf execution, you can see that the test was marked to be skipped with -s argument:

-k /var/lib/dci-pipeline/tnf-test-cnf/6a725d48-9977-45e5-85ae-7022cde13f94/inputs/kubeconfig
-i cnf-certification-test:v4.0.2
-t /tmp/ansible.poashszg/cnf-certification-test/cnf-certification-test
-o /tmp/ansible.poashszg/cnf-certification-test/cnf-certification-test
-f operator-install-source operator-install-status-no-privileges operator-install-status-succeeded platform-alteration observability lifecycle networking access-control 
-s platform-alteration-service-mesh-usage 
-s networking-dual-stack-service

In the same link, you can also find the claim.json file generated from this execution.

So, my impression is that there may be something wrong when checking if istio is installed, so that the test is executed in that case, even though it is skipped. Could it be?

Testing of PodDisruptionBudgets makes tnf not working with OCP <= 4.7.x

I'm opening this issue just to warn you that tnf deployments in OCP versions lower or equal than 4.7 are not working because, when trying to detect the PodDisruptionBudget resources, it fails because this resource is not available for these OCP versions. See example here: https://www.distributed-ci.io/jobs/fdc5b4f2-3c98-45b4-9607-5af707b67c73/files, in execution.log file:

�[31mFATAL  �[0m[Dec 17 12:42:58.508][autodiscover.go: 140] Cannot get pod disruption budgets, error: the server could not find the requested resource

Just saying this because, in the docs: https://test-network-function.github.io/cnf-certification-test/test-spec/#cnf-specific-tests, minimum OCP version is supposed to be 4.6.0. Probably it should be increased to 4.8.0. Also, take into account that 4.7 is EOL.

Bug in v4.1.4 tnf version and tnfGitCommit in claim.json

Hi,

Test Network Function result parser and claim.json show missinformation in two fields in v4.1.4 when reading tnf version and tnfGitCommit.

Values in Versions tab:
tnf: Unreleased build post
tnfGitCommit: (empty)

This was found executing the test suite in the prebuilt container, executing for example:

./run-tnf-container.sh -i quay.io/testnetworkfunction/cnf-certification-test:v4.1.4 -o results -t config -l "manageability"

If called by digest or locally pulled aswell, it seems internal to v.4.1.4 .

Possible conflicts between lifecycle-cpu-isolation and lifecycle-pod-scheduling conditions to pass

While testing the new test called lifecycle-cpu-isolation, included in #434, I've found some incompatibilities between this test suite and lifecycle-pod-scheduling one.

I've been testing lifecycle-cpu-isolation with DCI, thanks to this change. I managed to fulfill the conditions needed to pass the test, as you can see in this DCI job. However, as a consequence of these modifications in the pods under test, lifecycle-pod-scheduling is now failing.

I think the problem is the following:

In order to pass lifecycle-cpu-isolation test, the pods under test need to include a runtimeClassName, among other requirements.
In our OCP clusters, we already define a default RuntimeClass like this, which is created as a consequence of a PerformanceProfile definition:

$ oc get runtimeclass -o json
{
    "apiVersion": "v1",
    "items": [
        {
            "apiVersion": "node.k8s.io/v1",
            "handler": "high-performance",
            "kind": "RuntimeClass",
            "metadata": {
                "creationTimestamp": "2022-09-04T06:35:43Z",
                "name": "performance-cnf-basic-profile",
                "ownerReferences": [
                    {
                        "apiVersion": "performance.openshift.io/v2",
                        "blockOwnerDeletion": true,
                        "controller": true,
                        "kind": "PerformanceProfile",
                        "name": "cnf-basic-profile",
                        "uid": "6015b5e4-3f2e-4355-b8f7-7ed1b2129f09"
                    }
                ],
                "resourceVersion": "39865",
                "uid": "3cf9e8b8-7d10-4284-8c4a-95bd5615a914"
            },
            "scheduling": {
                "nodeSelector": {
                    "node-role.kubernetes.io/worker": ""
                }
            }
        }
    ],
    "kind": "List",
    "metadata": {
        "resourceVersion": ""
    }
}

The problem is that the RuntimeClass usually defines a nodeSelector for scheduling the pods, like the example above, and it's exactly the cause of failure in lifecycle-pod-scheduling test.

So, my question is, how can we deal with this kind of issue?

platform-alteration-base-image test not working in latest OCP 4.12 (previously, also in OCP 4.13, but now it's working)

Hi team.

Since last week, we've experiencing issues in platform-alteration-base-image test in latest OCP 4.13, starting from OpenShift 4.13 nightly 2023-01-11 12:35 version and also failing in newer OCP 4.13 nightly releases.

If you check this DCI job as example, you can see that the error output is the following:

> Enter [BeforeEach] platform-alteration - /usr/tnf/tnf-src/cnf-certification-test/platform/suite.go:50 @ 01/12/23 06:14:58.893
< Exit [BeforeEach] platform-alteration - /usr/tnf/tnf-src/cnf-certification-test/platform/suite.go:50 @ 01/12/23 06:14:58.893 (0s)
> Enter [It] platform-alteration-base-image - /usr/tnf/tnf-src/cnf-certification-test/platform/suite.go:56 @ 01/12/23 06:14:58.893
container: test pod: test-648dc964f9-94lfr ns: test-cnf - error while running fs-diff: can't execute command on container: command terminated with exit code 125: 
container: test pod: test-648dc964f9-hcf8x ns: test-cnf - error while running fs-diff: can't execute command on container: command terminated with exit code 125: 
container: test pod: test-0 ns: production-cnf - error while running fs-diff: can't execute command on container: command terminated with exit code 125: 
container: test pod: test-1 ns: production-cnf - error while running fs-diff: can't execute command on container: command terminated with exit code 125: 
Containers were unable to run fs-diff: [test test test test]
[FAILED] Containers were unable to run fs-diff.
In [It] at: /usr/tnf/tnf-src/cnf-certification-test/platform/suite.go:204 @ 01/12/23 06:14:59.38
< Exit [It] platform-alteration-base-image - /usr/tnf/tnf-src/cnf-certification-test/platform/suite.go:56 @ 01/12/23 06:14:59.38 (488ms)
> Enter [ReportAfterEach] platform-alteration - /usr/tnf/tnf-src/cnf-certification-test/platform/suite.go:53 @ 01/12/23 06:14:59.38
< Exit [ReportAfterEach] platform-alteration - /usr/tnf/tnf-src/cnf-certification-test/platform/suite.go:53 @ 01/12/23 06:14:59.381 (0s)

A correct execution should look like this one:

> Enter [BeforeEach] platform-alteration - /usr/tnf/tnf-src/cnf-certification-test/platform/suite.go:50 @ 01/11/23 11:34:57.268
< Exit [BeforeEach] platform-alteration - /usr/tnf/tnf-src/cnf-certification-test/platform/suite.go:50 @ 01/11/23 11:34:57.268 (0s)
> Enter [It] platform-alteration-base-image - /usr/tnf/tnf-src/cnf-certification-test/platform/suite.go:56 @ 01/11/23 11:34:57.268
< Exit [It] platform-alteration-base-image - /usr/tnf/tnf-src/cnf-certification-test/platform/suite.go:56 @ 01/11/23 11:34:58.145 (877ms)
> Enter [ReportAfterEach] platform-alteration - /usr/tnf/tnf-src/cnf-certification-test/platform/suite.go:53 @ 01/11/23 11:34:58.145
< Exit [ReportAfterEach] platform-alteration - /usr/tnf/tnf-src/cnf-certification-test/platform/suite.go:53 @ 01/11/23 11:34:58.145 (0s)

In particular, the code that is printing the error is this one. It's like the command you execute cannot be run in the pods.

Do you think it is related to the new OCP version and that tnf code must be updated to handle this?

Update and follow up (7th March 2023)

It looks like the test is not failing in OCP 4.13 anymore, but it is failing in OCP 4.12. Here's an example of latest OCP 4.12 (OpenShift 4.12 nightly 2023-03-03 04:31):

> Enter [BeforeEach] platform-alteration - /usr/tnf/tnf-src/cnf-certification-test/platform/suite.go:49 @ 03/06/23 02:28:23.201
< Exit [BeforeEach] platform-alteration - /usr/tnf/tnf-src/cnf-certification-test/platform/suite.go:49 @ 03/06/23 02:28:23.201 (0s)
> Enter [It] platform-alteration-base-image - /usr/tnf/tnf-src/cnf-certification-test/platform/suite.go:55 @ 03/06/23 02:28:23.201
container: test pod: test-648dc964f9-69pxw ns: test-cnf - error while running fs-diff: can not execute command on container: command terminated with exit code 125: 
container: test pod: test-648dc964f9-r9fdr ns: test-cnf - error while running fs-diff: can not execute command on container: command terminated with exit code 125: 
container: test pod: test-0 ns: production-cnf - error while running fs-diff: can not execute command on container: command terminated with exit code 125: 
container: test pod: test-1 ns: production-cnf - error while running fs-diff: can not execute command on container: command terminated with exit code 125: 
Containers were unable to run fs-diff: [test test test test]
[FAILED] Containers were unable to run fs-diff.
In [It] at: /usr/tnf/tnf-src/cnf-certification-test/platform/suite.go:228 @ 03/06/23 02:28:23.669
< Exit [It] platform-alteration-base-image - /usr/tnf/tnf-src/cnf-certification-test/platform/suite.go:55 @ 03/06/23 02:28:23.669 (468ms)
> Enter [ReportAfterEach] platform-alteration - /usr/tnf/tnf-src/cnf-certification-test/platform/suite.go:52 @ 03/06/23 02:28:23.669
< Exit [ReportAfterEach] platform-alteration - /usr/tnf/tnf-src/cnf-certification-test/platform/suite.go:52 @ 03/06/23 02:28:23.669 (0s)

Node Taints such as "NoSchedule" causes the debug daemonset to not be deployed

For OpenShift clusters - where node taints are required for some other applications running on the OpenShift cluster, where the CNF testing is planned - test execution would stop because of the taints.

We need to have another workaround where node taints cannot be removed, so that test execution would not stop.

Test suite execution for v4.2.3 and v4.2.4 fails due to not parsing the config properly

Hi,

I encountered an issue where the testsuite is unable to load the config:

�[36mINFO   �[0m[May 18 17:55:51.344][utils.go: 50] Loading config from file: /usr/tnf/config/tnf_config.yml 
�[31mFATAL  �[0m[May 18 17:55:51.344][provider.go: 202] Cannot load configuration file: yaml: unmarshal errors:
  line 6: cannot unmarshal !!seq into map[string]string

I followed the syntax as per the latest release and here's a DCI job for the test run:
https://www.distributed-ci.io/jobs/14ccf215-b9ba-4fec-b02c-bf3a88db20ca

Also, I was able to get a successful run when I used the deprecated format for tnf_config.yaml
Here's a successful run:
https://www.distributed-ci.io/jobs/346a1ce7-1ede-499c-a935-8c2d400665f5/jobStates?sort=date

Catalog.md should indicate which test falls under "extended test cases"

Extended test set is a new suite of test cases introduced in v4.1 release, in which the option of whether to include these extended test cases in the execution of CNF Certification Test Suite can be configured in tnf_config.yml

Correspondingly, the catalog.md file should reflect which test cases fall under "extended test set", the execution of which is configurable via tnf_config.yml.