Git Product home page Git Product logo

confluent's People

Contributors

aduffy19 avatar andywray avatar arif-ali avatar brianfinley avatar chenglch avatar erderial avatar hengli-kuang avatar jjohnson42 avatar jufm avatar mslacken avatar penghuicui avatar sjtstg avatar tkucherera avatar tkucherera-lenovo avatar vmaneagit avatar weragrzeda avatar whowutwut avatar zhougj4 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

confluent's Issues

Setting net.mgt.ipv4_gateway to an empty value stops network discovery configuration on a flat network

On a flat network you may not have a default gateway.
If you attempt to configure a device like a Lenovo D2 chassis SMM it will set the correct IP address and subnet mask but leave the default gateway.
Example:

nodeconfig smm001

smm001: bmc.ipv4_address: 172.30.23.98/20
smm001: bmc.ipv4_method: Static
smm001: bmc.ipv4_gateway: 192.168.0.1

It would be useful to set the gateway to nothing if we do not have one.

If you set a blank gateway:
nodegroupattrib everything net.mgt.ipv4_gateway=

restarting confluent will cause all discoveries to fail:
Jan 24 16:24:12 {"error": "Error encountered trying to set up smm001, [Errno -2] Name or service not known"}

I expect this is due to it expecting a proper gateway IP address and being set to blank does not help.
Maybe we are doing something stupid but for consistency not having mixed IP network ranges on the SMM is useful.

As a work-around we can set the gateway to something on the network (e.g. the management server).

Add option to nodediscover command to only report undiscovered devices

It would be useful to add an option on the nodediscover list command to report devices that have not been discovered yet.
This change should take into account that another confluent server may have already discovered the device. (e.g. when using multiple xCAT service nodes as confluent servers on the same network segment).

For large-scale cluster discovery this would be useful.

confluent version:
lenovo-confluent-0.7.1-1.noarch
confluent_server-1.7.2-1.noarch
confluent_client-1.7.2-1.noarch

Not all rpm requsites are included in confluent rpm package build

If a user wants to install or upgrade confluent independently of xCAT then the confluent rpm packages do not include all the rpm requisite packages are needed.

For example in confluent 1.7.1 the requisite rpm is required otherwise things do not work:
python-pyghmi

Other rpms that may be required from the Lenovo confluent/xCAT bundle:
python-eventlet
python-greenlet
python2-pyte

Confluent version tested on:
lenovo-confluent-0.7.1-1.noarch
confluent_server-1.7.2-1.noarch
confluent_client-1.7.2-1.noarch

it takes a long time to get the data about energy and temperature

the cluster: 821 compute nodes
cat /noderange/everything/sensors/hardware/energy/all
02c02n04: sensors=[
{
"name": "DC Energy",
"state_ids": [],
"value": 30.58287145277778,
"states": [],
"health": "ok",
"units": "kWh",
"type": "Energy"
}
]
r14c01n11: sensors=[
{
"name": "DC Energy",
"state_ids": [],
"value": 55.889863706944446,
"states": [],
"health": "ok",
"units": "kWh",
"type": "Energy"
}
]
...

here will be blocking a long long time

confluent missing man pages

The following commands do not have man pages:

nodedefine
nodegroupdefine

Not sure if these are needed.

Where found :

rpm -qa | grep confluent

lenovo-confluent-0.7.1-1.noarch
confluent_server-1.7.2-1.noarch
confluent_client-1.7.2-1.noarch

console won't open

consoles.html example is not working due to a cookie issue: when a new console is opened by clicking the button, it won't display anx text, only an empty console and the httpapi.py returns an internal server error 500 and logs the following error:

httpsessions[authorized['sessionid']]['inflight'].add(mythreadid)
KeyError: 'sessionid'

the reason for this is, that the confluentsessionid cookie that is being sent by the server after the first post request is not included in subsequent post requests when initiating the console session.

this can be fixed by removing line 293 in httpapi.py:
cookie['confluentsessionid']['path'] = '/'
as the XMLHTTPRequest does not use / as path the browser won't send the cookie. removing the mentioned line resolves the issue and consoles are working as expected.

regards
Pascal

Non-existent endpoint does not return a 404

Hi again, so I've found another "non existent" endpoint that does not return any error (like in the issue #112 ):
/nodes/nodename/configuration/management_controller/users/2/ThisEndpointDoesNotExist.

I don't know if there are some more enpoints like this, I was wondering if there is a way to generalize the "404 handling" ? For example, having the list of every possible endpoints of the application, and if a request is not contained in this list then return a NotFound error ? But don't know if it's appropriate for your case.

nodeattrib does not restrict setting discovery.policy to policy values

The confluent discovery policy can currently be set to: manual, open or permissive.
(see https://hpc.lenovo.com/users/documentation/confluentdisco.html).

Confluent has no error checking to ensure this value is correct set by the user.
Example:

nodeattrib node001 discovery.policy
node001: discovery.policy: permissive

nodeattrib node001 discovery.policy=apple
node001: apple

nodeattrib r164c05s01 discovery.policy
node001: discovery.policy: apple

We should improve the code to contain error checking for confluent settings to prevent the user entering incorrect values in those database fields where fixed values are expected.
This specific example is for the discovery.policy setting but a more extensive review should be done for all database fields to improve user experience.

Code level tested on:
lenovo-confluent-0.7.1-1.noarch
confluent_server-1.7.2-1.noarch
confluent_client-1.7.2-1.noarch

nodefirmware update fails with ioerror

During updates of system firmware of 36 nodes in parallel we found one step failed with ioerror being reported for most nodes and stuck in initialising for a few.
you needed to ctrl-c the operation as it was not working.

It seemed that confluent had just rotated the sdterr log file and we saw an error trace like this:

/var/log/confluent/stderr:
May 15 10:44:13 {"previouslogfile": "/var/log/confluent/stderr.2018-05-13"}
May 15 10:44:13 File "/usr/lib64/python2.7/threading.py", line 814, in __bootstrap_inner
(self.name, _format_exc())): Exception in thread Thread-288:
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/site-packages/pyghmi/ipmi/oem/lenovo/imm.py", line 57, in run
self.rsp = self.wc.upload(self.url, self.filename, self.data)
File "/usr/lib/python2.7/site-packages/pyghmi/util/webclient.py", line 143, in upload
data = open(filename, 'rb')
IOError: [Errno 2] No such file or directory: u'/software/fw-thinksystem-sr630-sr650-lxpm/lnvgy_fw_lnvgy_fw_drvln_pdl212s-1.20_anyos_noarch.uxz'
May 15 10:44:13 File "/usr/lib64/python2.7/threading.py", line 814, in __bootstrap_inner
(self.name, _format_exc())): Exception in thread Thread-289:
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/site-packages/pyghmi/ipmi/oem/lenovo/imm.py", line 57, in run
self.rsp = self.wc.upload(self.url, self.filename, self.data)
File "/usr/lib/python2.7/site-packages/pyghmi/util/webclient.py", line 143, in upload
data = open(filename, 'rb')
IOError: [Errno 2] No such file or directory: u'/software/fw-thinksystem-sr630-sr650-lxpm/lnvgy_fw_lnvgy_fw_drvln_pdl212s-1.20_anyos_noarch.uxz'

The nodefirmware command would consistently fail in the same manner if you repeated the attempt to update the firmware.
A restart of the confluent service (systemctl restart confluent) clear the issue.

Confluent version:
lenovo-confluent-0.8.1-1.noarch
confluent_client-1.8.2-1.noarch
confluent_server-1.8.2-1.noarch

If BMC password has changed confluent should provide a more meaningful error message

On a single node the BMC password was changed.
Trying to query the node should fail but the error message is not user-friendly:

nodehealth nodeb
Unexpected error

Would it be possible to make the error more meaningful to suggest you were unable to establish a connection to the BMC device ?

confluent version:
confluent_client-1.8.2-1.noarch
lenovo-confluent-0.8.1-1.noarch
confluent_server-1.8.2-1.noarch

/var/log/confluent/trace:
Jun 01 14:23:36 Traceback (most recent call last):
File "/opt/confluent/lib/python/confluent/sockapi.py", line 120, in sessionhdl
connection, request, cfm, authdata, authname, skipauth)
File "/opt/confluent/lib/python/confluent/sockapi.py", line 187, in process_request
send_response(hdlr, connection)
File "/opt/confluent/lib/python/confluent/sockapi.py", line 144, in send_response
for rsp in responses:
File "/opt/confluent/lib/python/confluent/plugins/hardwaremanagement/ipmi.py", line 308, in perform_requests
configdata = cfg.get_node_attributes(nodes, _configattributes)
File "/opt/confluent/lib/python/confluent/config/configmanager.py", line 926, in get_node_attributes
decrypt=decrypt)
File "/opt/confluent/lib/python/confluent/config/configmanager.py", line 498, in _decode_attribute
retdict['value'] = decrypt_value(nodeobj[attribute]['cryptvalue'])
File "/opt/confluent/lib/python/confluent/config/configmanager.py", line 222, in decrypt_value
raise Exception("bad HMAC value on crypted value")
Exception: bad HMAC value on crypted value

What the difference between xnba elilo-x64.efi and bootx64.efi

Hi @jjohnson42

 When I use netboot=xnba, in some OS and x86_64 servers, it failed to correctly load kernel and initrd, so I have question here,  what the difference between xnba elilo-x64.efi and bootx64.efi?   I think xnba is a good bootloader, and If I want to use xnba and bootx64.efi, do you have some suggestions? 

When an SMM is discovered would be useful to populate id.model and id.serial attributes

When we discover an SMM we do not populate the id.model and id.serial attributes in the confluent database. Example:

[root@cresco6mg1 ~]# nodeattrib smm01
smm01: console.logging: full
smm01: console.method: ipmi
smm01: discovery.policy: permissive,pxe
smm01: groups: smm,rack01-smm,everything
smm01: hardwaremanagement.manager: smm01
smm01: pubkeys.tls_hardwaremanager: string
smm01: secret.hardwaremanagementpassword: ********
smm01: secret.hardwaremanagementuser: ********

However when we discovery a node this information is populated:
[root@cresco6mg1 ~]# nodeattrib nodex001
nodex001: console.logging: full
nodex001: console.method: ipmi
nodex001: discovery.policy: permissive,pxe
nodex001: enclosure.bay: 1
nodex001: enclosure.manager: smm01
nodex001: groups: ipmi,compute,rack01,everything
nodex001: hardwaremanagement.manager: nodex001-xcc
nodex001: id.model: 7X21CTO1WW
nodex001: id.serial: ABC1234
nodex001: id.uuid: string
nodex001: net.switch: switch01
nodex001: net.switchport: 1
node001: pubkeys.tls_hardwaremanager: string
nodex001: secret.hardwaremanagementpassword: ********
nodex001: secret.hardwaremanagementuser: ********

It would be nice to have this feature added.
Work-around is to get the information using the nodeattrib command.

nodefirmware update should not leave status in pending state once firmware update is complete

If you run "nodefirmware node-range update firmware-image" you will see a status like this when
the firmware update is complete:

lnode01:pending: 100% lnode02:pending: 100% xcat1:pending: 100%
xcat2:pending: 100% xcat3:pending: 100% xcat4:pending: 100%

The state should not be "pending" when the update is successful.
Even though the update is "100%" the "pending" status is slightly confusing to the user.

Consider changing this status to report "success", "completed" or another appropriate label to make it clear to the use that the update is finished.

Confluent version:
lenovo-confluent-0.8.1-1.noarch
confluent_client-1.8.2-1.noarch
confluent_server-1.8.2-1.noarch

it takes a long time to get the state of nodes if I add a server about HPE

gpu01 configuration:
Manufacturer: HPE
Product Name: ProLiant XL270d Gen10

confetty
-> ls nodes
c01n01/
c01n02/
gpu01/
io01/
io02/
mgt01/
-> cd /noderange/everything/power
-> cat state
cat state
c01n01: state="on"
c01n02: state="on"
io01: state="on"
io02: state="on"
mgt01: state="on"
....
here is very slow

Node groups are not cleaned up when noderemove used to delete a node

I you add nodes to confluent:

makeconfluentcfg ipmi (where ipmi is an xCAT nodegroup containing nodea, nodeb and nodec)
This command automatically defines the nodegroups and nodes to confluent.

if you delete the node later:

noderemove nodec

The node is still in the "ipmi" confluent group.
This results in unexpected errors.
Example:

nodehealth ipmi | collate -a

Unexpected error

The /var/log/confluent/trace file commands about KeyError on the missing node.
Jun 01 13:54:24 Traceback (most recent call last):
File "/opt/confluent/lib/python/confluent/sockapi.py", line 120, in sessionhdl
connection, request, cfm, authdata, authname, skipauth)
File "/opt/confluent/lib/python/confluent/sockapi.py", line 178, in process_re
quest
hdlr = pluginapi.handle_path(path, operation, cfm, params)
File "/opt/confluent/lib/python/confluent/core.py", line 722, in handle_path
pathcomponents, autostrip)
File "/opt/confluent/lib/python/confluent/core.py", line 665, in handle_node_r
equest
if attrname in nodeattr[node]:
KeyError: 'nodec'

The problem is possibly two issues. Some points to consider:

  1. If we delete a node from confluent should the node also be removed form and nodegroups?
  2. If a non-existent node is defined in a node group why does this generate trace file errors?
  3. Should we have a method to add/remove individual attributes form confluent node groups ?
    At present I think you have to define the nodegroup again (minus the objects you want to remove from the group

Confluent version:
confluent_client-1.8.2-1.noarch
lenovo-confluent-0.8.1-1.noarch
confluent_server-1.8.2-1.noarch

Add TLS Configuration Options

We recently discovered that our XCat 2.15 deployment using goconserver was susceptible to the SWEET32 attack due to hard coded TLS Cipher suites. While it doesn't look like confluent allows DES based ciphers, it made sense to me to request that confluent support configurable TLS options in order to quickly mitigate vulnerabilities discovered in the future.

Specifically, I think it will be prudent to allow runtime configuration of the allowed TLS versions and the allowed cipher suites.

More information on the mentioned patch can be found here and information on the vulnerabilities can be found at https://www.openssl.org/blog/blog/2016/08/24/sweet32/ and https://access.redhat.com/articles/2548661

No support for tagged VLAN nodediscover?

Hi,

is there really no support to do node discovery with tagged VLAN for IMM/XCC?
Or do I just miss something?

I can't find anything in the documentation / code and confluent does not seem to use xCATs bmcvlantag attribute at all.

Non-existent endpoint does not return a 404 error

I'm using Confluent through its web API. When I perform a request (via curl for example) on a non-existent endpoint, I do not get an error (a 404 for example). Here what I've done :

curl  -g -k -u confluentuser:confluentpassword http://address:4005/nodes/nodename/configuration/management_controller/net_interfaces/does_not_exist -H "Accept: application/json" 
 
{
    "_links": {
        "collection": {
            "href": "./"
        }, 
        "self": {
            "href": "./does_not_exist"
        }
    }
}

I found this behavior with multiples endpoints and I was wondering if it was normal or if returning a 404 error would not be better.

Failed to start Confluent hardware manager

Dear All,

i have instlalled confluent-server following the docs but the service doesn't start

systemctl status confluent.service
 confluent.service - Confluent hardware manager
   Loaded: loaded (/usr/lib/systemd/system/confluent.service; disabled; vendor preset: disabled)
   Active: failed (Result: start-limit) since mar 2022-03-15 12:31:08 CET; 2s ago
  Process: 2197 ExecStart=/opt/confluent/bin/confluent (code=exited, status=1/FAILURE)

mar 15 12:31:07 pxe.oact.inaf.it systemd[1]: confluent.service: control process exited, code=exited...s=1
mar 15 12:31:07 pxe.oact.inaf.it systemd[1]: Failed to start Confluent hardware manager.
mar 15 12:31:07 pxe.oact.inaf.it systemd[1]: Unit confluent.service entered failed state.
mar 15 12:31:07 pxe.oact.inaf.it systemd[1]: confluent.service failed.
mar 15 12:31:08 pxe.oact.inaf.it systemd[1]: confluent.service holdoff time over, scheduling restart.
mar 15 12:31:08 pxe.oact.inaf.it systemd[1]: Stopped Confluent hardware manager.
mar 15 12:31:08 pxe.oact.inaf.it systemd[1]: start request repeated too quickly for confluent.service
mar 15 12:31:08 pxe.oact.inaf.it systemd[1]: Failed to start Confluent hardware manager.
mar 15 12:31:08 pxe.oact.inaf.it systemd[1]: Unit confluent.service entered failed state.
mar 15 12:31:08 pxe.oact.inaf.it systemd[1]: confluent.service failed.

Can you help me to solve the issue?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.