xcat2 / confluent Goto Github PK
View Code? Open in Web Editor NEWxCAT confluent - replacement of conserver and eventually xcatd
License: Apache License 2.0
xCAT confluent - replacement of conserver and eventually xcatd
License: Apache License 2.0
Is it possible that RPMs for the latest release of confluent (1.5.0
) be added to the confluent repo: http://xcat.org/files/confluent/
On a flat network you may not have a default gateway.
If you attempt to configure a device like a Lenovo D2 chassis SMM it will set the correct IP address and subnet mask but leave the default gateway.
Example:
smm001: bmc.ipv4_address: 172.30.23.98/20
smm001: bmc.ipv4_method: Static
smm001: bmc.ipv4_gateway: 192.168.0.1
It would be useful to set the gateway to nothing if we do not have one.
If you set a blank gateway:
nodegroupattrib everything net.mgt.ipv4_gateway=
restarting confluent will cause all discoveries to fail:
Jan 24 16:24:12 {"error": "Error encountered trying to set up smm001, [Errno -2] Name or service not known"}
I expect this is due to it expecting a proper gateway IP address and being set to blank does not help.
Maybe we are doing something stupid but for consistency not having mixed IP network ranges on the SMM is useful.
As a work-around we can set the gateway to something on the network (e.g. the management server).
It would be useful to add an option on the nodediscover list command to report devices that have not been discovered yet.
This change should take into account that another confluent server may have already discovered the device. (e.g. when using multiple xCAT service nodes as confluent servers on the same network segment).
For large-scale cluster discovery this would be useful.
confluent version:
lenovo-confluent-0.7.1-1.noarch
confluent_server-1.7.2-1.noarch
confluent_client-1.7.2-1.noarch
If a user wants to install or upgrade confluent independently of xCAT then the confluent rpm packages do not include all the rpm requisite packages are needed.
For example in confluent 1.7.1 the requisite rpm is required otherwise things do not work:
python-pyghmi
Other rpms that may be required from the Lenovo confluent/xCAT bundle:
python-eventlet
python-greenlet
python2-pyte
Confluent version tested on:
lenovo-confluent-0.7.1-1.noarch
confluent_server-1.7.2-1.noarch
confluent_client-1.7.2-1.noarch
the cluster: 821 compute nodes
cat /noderange/everything/sensors/hardware/energy/all
02c02n04: sensors=[
{
"name": "DC Energy",
"state_ids": [],
"value": 30.58287145277778,
"states": [],
"health": "ok",
"units": "kWh",
"type": "Energy"
}
]
r14c01n11: sensors=[
{
"name": "DC Energy",
"state_ids": [],
"value": 55.889863706944446,
"states": [],
"health": "ok",
"units": "kWh",
"type": "Energy"
}
]
...
here will be blocking a long long time
The following commands do not have man pages:
nodedefine
nodegroupdefine
Not sure if these are needed.
Where found :
lenovo-confluent-0.7.1-1.noarch
confluent_server-1.7.2-1.noarch
confluent_client-1.7.2-1.noarch
consoles.html example is not working due to a cookie issue: when a new console is opened by clicking the button, it won't display anx text, only an empty console and the httpapi.py returns an internal server error 500 and logs the following error:
httpsessions[authorized['sessionid']]['inflight'].add(mythreadid)
KeyError: 'sessionid'
the reason for this is, that the confluentsessionid cookie that is being sent by the server after the first post request is not included in subsequent post requests when initiating the console session.
this can be fixed by removing line 293 in httpapi.py:
cookie['confluentsessionid']['path'] = '/'
as the XMLHTTPRequest does not use / as path the browser won't send the cookie. removing the mentioned line resolves the issue and consoles are working as expected.
regards
Pascal
Hi again, so I've found another "non existent" endpoint that does not return any error (like in the issue #112 ):
/nodes/nodename/configuration/management_controller/users/2/ThisEndpointDoesNotExist
.
I don't know if there are some more enpoints like this, I was wondering if there is a way to generalize the "404 handling" ? For example, having the list of every possible endpoints of the application, and if a request is not contained in this list then return a NotFound error ? But don't know if it's appropriate for your case.
Hi @jjohnson42,
I have great interests in confluent, but I cannot find doc about it.
So how can I get started about confluent?
The confluent discovery policy can currently be set to: manual, open or permissive.
(see https://hpc.lenovo.com/users/documentation/confluentdisco.html).
Confluent has no error checking to ensure this value is correct set by the user.
Example:
nodeattrib node001 discovery.policy
node001: discovery.policy: permissive
nodeattrib node001 discovery.policy=apple
node001: apple
nodeattrib r164c05s01 discovery.policy
node001: discovery.policy: apple
We should improve the code to contain error checking for confluent settings to prevent the user entering incorrect values in those database fields where fixed values are expected.
This specific example is for the discovery.policy setting but a more extensive review should be done for all database fields to improve user experience.
Code level tested on:
lenovo-confluent-0.7.1-1.noarch
confluent_server-1.7.2-1.noarch
confluent_client-1.7.2-1.noarch
During updates of system firmware of 36 nodes in parallel we found one step failed with ioerror being reported for most nodes and stuck in initialising for a few.
you needed to ctrl-c the operation as it was not working.
It seemed that confluent had just rotated the sdterr log file and we saw an error trace like this:
/var/log/confluent/stderr:
May 15 10:44:13 {"previouslogfile": "/var/log/confluent/stderr.2018-05-13"}
May 15 10:44:13 File "/usr/lib64/python2.7/threading.py", line 814, in __bootstrap_inner
(self.name, _format_exc())): Exception in thread Thread-288:
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/site-packages/pyghmi/ipmi/oem/lenovo/imm.py", line 57, in run
self.rsp = self.wc.upload(self.url, self.filename, self.data)
File "/usr/lib/python2.7/site-packages/pyghmi/util/webclient.py", line 143, in upload
data = open(filename, 'rb')
IOError: [Errno 2] No such file or directory: u'/software/fw-thinksystem-sr630-sr650-lxpm/lnvgy_fw_lnvgy_fw_drvln_pdl212s-1.20_anyos_noarch.uxz'
May 15 10:44:13 File "/usr/lib64/python2.7/threading.py", line 814, in __bootstrap_inner
(self.name, _format_exc())): Exception in thread Thread-289:
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/site-packages/pyghmi/ipmi/oem/lenovo/imm.py", line 57, in run
self.rsp = self.wc.upload(self.url, self.filename, self.data)
File "/usr/lib/python2.7/site-packages/pyghmi/util/webclient.py", line 143, in upload
data = open(filename, 'rb')
IOError: [Errno 2] No such file or directory: u'/software/fw-thinksystem-sr630-sr650-lxpm/lnvgy_fw_lnvgy_fw_drvln_pdl212s-1.20_anyos_noarch.uxz'
The nodefirmware command would consistently fail in the same manner if you repeated the attempt to update the firmware.
A restart of the confluent service (systemctl restart confluent) clear the issue.
Confluent version:
lenovo-confluent-0.8.1-1.noarch
confluent_client-1.8.2-1.noarch
confluent_server-1.8.2-1.noarch
On a single node the BMC password was changed.
Trying to query the node should fail but the error message is not user-friendly:
nodehealth nodeb
Unexpected error
Would it be possible to make the error more meaningful to suggest you were unable to establish a connection to the BMC device ?
confluent version:
confluent_client-1.8.2-1.noarch
lenovo-confluent-0.8.1-1.noarch
confluent_server-1.8.2-1.noarch
/var/log/confluent/trace:
Jun 01 14:23:36 Traceback (most recent call last):
File "/opt/confluent/lib/python/confluent/sockapi.py", line 120, in sessionhdl
connection, request, cfm, authdata, authname, skipauth)
File "/opt/confluent/lib/python/confluent/sockapi.py", line 187, in process_request
send_response(hdlr, connection)
File "/opt/confluent/lib/python/confluent/sockapi.py", line 144, in send_response
for rsp in responses:
File "/opt/confluent/lib/python/confluent/plugins/hardwaremanagement/ipmi.py", line 308, in perform_requests
configdata = cfg.get_node_attributes(nodes, _configattributes)
File "/opt/confluent/lib/python/confluent/config/configmanager.py", line 926, in get_node_attributes
decrypt=decrypt)
File "/opt/confluent/lib/python/confluent/config/configmanager.py", line 498, in _decode_attribute
retdict['value'] = decrypt_value(nodeobj[attribute]['cryptvalue'])
File "/opt/confluent/lib/python/confluent/config/configmanager.py", line 222, in decrypt_value
raise Exception("bad HMAC value on crypted value")
Exception: bad HMAC value on crypted value
Hi @jjohnson42
When I use netboot=xnba, in some OS and x86_64 servers, it failed to correctly load kernel and initrd, so I have question here, what the difference between xnba elilo-x64.efi and bootx64.efi? I think xnba is a good bootloader, and If I want to use xnba and bootx64.efi, do you have some suggestions?
When we discover an SMM we do not populate the id.model and id.serial attributes in the confluent database. Example:
[root@cresco6mg1 ~]# nodeattrib smm01
smm01: console.logging: full
smm01: console.method: ipmi
smm01: discovery.policy: permissive,pxe
smm01: groups: smm,rack01-smm,everything
smm01: hardwaremanagement.manager: smm01
smm01: pubkeys.tls_hardwaremanager: string
smm01: secret.hardwaremanagementpassword: ********
smm01: secret.hardwaremanagementuser: ********
However when we discovery a node this information is populated:
[root@cresco6mg1 ~]# nodeattrib nodex001
nodex001: console.logging: full
nodex001: console.method: ipmi
nodex001: discovery.policy: permissive,pxe
nodex001: enclosure.bay: 1
nodex001: enclosure.manager: smm01
nodex001: groups: ipmi,compute,rack01,everything
nodex001: hardwaremanagement.manager: nodex001-xcc
nodex001: id.model: 7X21CTO1WW
nodex001: id.serial: ABC1234
nodex001: id.uuid: string
nodex001: net.switch: switch01
nodex001: net.switchport: 1
node001: pubkeys.tls_hardwaremanager: string
nodex001: secret.hardwaremanagementpassword: ********
nodex001: secret.hardwaremanagementuser: ********
It would be nice to have this feature added.
Work-around is to get the information using the nodeattrib command.
If you run "nodefirmware node-range update firmware-image" you will see a status like this when
the firmware update is complete:
lnode01:pending: 100% lnode02:pending: 100% xcat1:pending: 100%
xcat2:pending: 100% xcat3:pending: 100% xcat4:pending: 100%
The state should not be "pending" when the update is successful.
Even though the update is "100%" the "pending" status is slightly confusing to the user.
Consider changing this status to report "success", "completed" or another appropriate label to make it clear to the use that the update is finished.
Confluent version:
lenovo-confluent-0.8.1-1.noarch
confluent_client-1.8.2-1.noarch
confluent_server-1.8.2-1.noarch
If we attempt to reseat a node using nodereseat and the SMM is not contactable it would be useful for nodereseat to report the SMM name in the error message.
[root ~]# nodereseat node001
Error: Unreachable Target - timeout
This would be better, reporting the value of enclosure.manager:
[root~]# nodereseat node001
Error: chassis001 Unreachable Target - timeout
gpu01 configuration:
Manufacturer: HPE
Product Name: ProLiant XL270d Gen10
confetty
-> ls nodes
c01n01/
c01n02/
gpu01/
io01/
io02/
mgt01/
-> cd /noderange/everything/power
-> cat state
cat state
c01n01: state="on"
c01n02: state="on"
io01: state="on"
io02: state="on"
mgt01: state="on"
....
here is very slow
I you add nodes to confluent:
makeconfluentcfg ipmi (where ipmi is an xCAT nodegroup containing nodea, nodeb and nodec)
This command automatically defines the nodegroups and nodes to confluent.
if you delete the node later:
noderemove nodec
The node is still in the "ipmi" confluent group.
This results in unexpected errors.
Example:
Unexpected error
The /var/log/confluent/trace file commands about KeyError on the missing node.
Jun 01 13:54:24 Traceback (most recent call last):
File "/opt/confluent/lib/python/confluent/sockapi.py", line 120, in sessionhdl
connection, request, cfm, authdata, authname, skipauth)
File "/opt/confluent/lib/python/confluent/sockapi.py", line 178, in process_re
quest
hdlr = pluginapi.handle_path(path, operation, cfm, params)
File "/opt/confluent/lib/python/confluent/core.py", line 722, in handle_path
pathcomponents, autostrip)
File "/opt/confluent/lib/python/confluent/core.py", line 665, in handle_node_r
equest
if attrname in nodeattr[node]:
KeyError: 'nodec'
The problem is possibly two issues. Some points to consider:
Confluent version:
confluent_client-1.8.2-1.noarch
lenovo-confluent-0.8.1-1.noarch
confluent_server-1.8.2-1.noarch
We recently discovered that our XCat 2.15 deployment using goconserver was susceptible to the SWEET32 attack due to hard coded TLS Cipher suites. While it doesn't look like confluent allows DES based ciphers, it made sense to me to request that confluent support configurable TLS options in order to quickly mitigate vulnerabilities discovered in the future.
Specifically, I think it will be prudent to allow runtime configuration of the allowed TLS versions and the allowed cipher suites.
More information on the mentioned patch can be found here and information on the vulnerabilities can be found at https://www.openssl.org/blog/blog/2016/08/24/sweet32/ and https://access.redhat.com/articles/2548661
Hi,
is there really no support to do node discovery with tagged VLAN for IMM/XCC?
Or do I just miss something?
I can't find anything in the documentation / code and confluent does not seem to use xCATs bmcvlantag
attribute at all.
I'm using Confluent through its web API. When I perform a request (via curl for example) on a non-existent endpoint, I do not get an error (a 404 for example). Here what I've done :
curl -g -k -u confluentuser:confluentpassword http://address:4005/nodes/nodename/configuration/management_controller/net_interfaces/does_not_exist -H "Accept: application/json"
{
"_links": {
"collection": {
"href": "./"
},
"self": {
"href": "./does_not_exist"
}
}
}
I found this behavior with multiples endpoints and I was wondering if it was normal or if returning a 404 error would not be better.
Dear All,
i have instlalled confluent-server following the docs but the service doesn't start
systemctl status confluent.service
confluent.service - Confluent hardware manager
Loaded: loaded (/usr/lib/systemd/system/confluent.service; disabled; vendor preset: disabled)
Active: failed (Result: start-limit) since mar 2022-03-15 12:31:08 CET; 2s ago
Process: 2197 ExecStart=/opt/confluent/bin/confluent (code=exited, status=1/FAILURE)
mar 15 12:31:07 pxe.oact.inaf.it systemd[1]: confluent.service: control process exited, code=exited...s=1
mar 15 12:31:07 pxe.oact.inaf.it systemd[1]: Failed to start Confluent hardware manager.
mar 15 12:31:07 pxe.oact.inaf.it systemd[1]: Unit confluent.service entered failed state.
mar 15 12:31:07 pxe.oact.inaf.it systemd[1]: confluent.service failed.
mar 15 12:31:08 pxe.oact.inaf.it systemd[1]: confluent.service holdoff time over, scheduling restart.
mar 15 12:31:08 pxe.oact.inaf.it systemd[1]: Stopped Confluent hardware manager.
mar 15 12:31:08 pxe.oact.inaf.it systemd[1]: start request repeated too quickly for confluent.service
mar 15 12:31:08 pxe.oact.inaf.it systemd[1]: Failed to start Confluent hardware manager.
mar 15 12:31:08 pxe.oact.inaf.it systemd[1]: Unit confluent.service entered failed state.
mar 15 12:31:08 pxe.oact.inaf.it systemd[1]: confluent.service failed.
Can you help me to solve the issue?
At present when you discover nodes they remain powered on.
It would be useful to have an node attribute to indicate what to do next (e.g. power off the node or do some other tasks).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.