aaroc / chpctier2 Goto Github PK

Issues with the CHPC Tier2 Facility

chpctier2's Issues

storage nodes time

storage nodes times on controllers are wrong, fix

infiniband templates for zabbix

perfquery is your friend ....

take those stats understand what they are and which we want to keep and then pump them into zabbix and trigger/display them

Node network congestion

The 50 compute nodes are spread over 2 48 port switches, 48 ports are used in 1 and 2 ports in the other. Redistribute

upgrade monitoring server

reinstall with centos 7 deployed from puppet.

migrate db from postgresql to mysql. Do we care about history ?

cvmfs via puppet

cvmfs needs an update in cluster.
redo all cvmfs via puppet.

Reconfigure network

Network is overly complex.

change and or fix network to collapse the 172.20.196 network in 172.20.100
all management network in vlan 10 some are,

alice eos is down as of 4pm this afternoon

tests from alice central services are failing on ailce::za_chpc::eos

ATLAS SW Dir wrong

fix ATLAS SW Directory and nfs cluster wide share, according to 👍 https://twiki.cern.ch/twiki/bin/view/AtlasComputing/CernVMFS#Setup_Instructions_for_LCG_Grid

rebuild xcat server

throw away xcat.

reinstall xcat01 as centos7.x
throw away xcat.
back up for prosperity, nothing particularly needs keeping

request A/R recalc

A/R is wrong for 20-25 June and 20-26 May

I have no idea why they are wrong, they both span a power failure, one intentional and one not.
The the sides are way to long for when the service actually came up.

15000+ ALICE jobs in error states

MonaLisa reporting 15k+ jobs in errors states.

10k in error_v
3k error_e
2.2k error_ib

grafana/zabbix say it happened mostly at 14h30

install ralph

install ralph to test in contrast to racktables.

rotate ethsw09 and ethsw10

ethsw09 and ethsw10 are blowing the air in the wrong direction, they need to be switched around. This does involve rerouting all the fibres

atlas eos on wrong controller

the dell storage arrays are reporting that the rebuilding disks are on the wrong controller, controller 1 but the prefered controller is controller 0.

I dont want to touch this while the rebuilding is going on. All the storage currently being rebuilt is on the "wrong" controller according to management software. They were all originally configured to be on controller 0, go figure.

This means that either something is wrong or broken.
Stress test once all data is live should answer this issue #13

Move arrays back to controller 0 when ready.

This is to show you the board

Enable LSCG

Ok this is not for the Tier 2 per say, but for the "grid" site at the CHPC.

We have had a request to enable the Life Sciences Compute Grid at the site. This should be simply enabling the VO on the CE and WN.

@bazinski needs to make the changes, unless you can delegate that to me.

no zabbix data from monelisa

The monalisa data in zabbix is not being pulled in, since about 11am this morning

Add biomed VO

information at https://cclcgvomsli01.in2p3.fr:8443/voms/biomed/configuration/configuration.action

replace squid with frontier squid

replace the current squid on vobox to be frontier squid, via puppet

network congestion ?

failed jobs are in the 1500 range now on error_sv attempting to save.

Notice from tenet again to say congestion on cpt-jhb-durban.

One must assume that the outgoing connection on seacom is causing us to be congested.

second network connections for grid-xrootd0[12]

For redundancy reasons the singular 10G interfaces on grid-xrootd0[12] must be made dual and then plugged into the alternate 10G switch and then both interfaces bonded.

IP allocation and access

The zabbix monitoring server can not access certain network ranges, collapse the overly compex network topology to a simpler structure of 3 blockes.

ipmi/snmp/monitoring/management
data
public.

storage trays wrongly configured in rack tables.

fix the storage array locations in racktables in rack4

monalisa zabbix link is down

the monalisa client on the vobox is down.

ALICE SAM3 tests in unknown

something wrong with sam tests since late 25th june

Centos7 upgrade of storage

The eos storage can use centos 7, upgrade them

grid-se2, grid-xrootd0[12], grid-xrootd

Network at chpc is flaky

Intermittent network errors.

Relevant parties have been informed.

install racktables

install racktables on new monitoring server and integrate with zabbix as per :
https://github.com/kaz260/RackTables-ZABBIX-bridge

enable AfricaGrid VO

On sagrid queue
config is here : https://voms.ct.infn.it:8443/voms/vo.africa-grid.org/configuration/configuration.action

Get more storage

For a tier2 site we need additional storage.

ALICE needs 600TB usable to comply with our obligations, have c. 350.
ATLAS as of "very soon" will require 600TB as well, 60TB for "local users" and the rest for general usage.

10G onto grid-se

grid-se the redirector for ATLAS needs a 10G interface that is sitting in grid-ui
grid-ui is stolen for some cloud project.

Either grid-ui and grid-se must be reinstalled and physically swapped or the 10G interface in grid-ui must go into grid-se.

they appear to be same proc and same mem.

firewalls in puppet

rather self explanatory, the nodes all have various configurations for firewalls, put it all in puppet, the module is installed, its a config and testing exercise

replace xcat with foreman

get rid of xcat, its currently dead, nobody wants to fix it.

ALICE EOS is down

due to power failure this morning ALICE EOS is down with corrupt files, still working on it, and reading

eos template for zabbix

eos template for zabbix to monitor .... eos.

huge drop in alice jobs

alice jobs plumeted from 1800 to 700 in 1 hour

zabbix slack bot wrong destination

zabbix is sending to the bot directly not to a specific channel #chcp-zabbix

Node 1-40 has no serial number

Yip you read the subject correct, gnode-1-40 has no serial number/service tag.

dmidecode -t system :

root@gnode-1-40:~ $ dmidecode -t system

dmidecode 2.11

SMBIOS 2.7 present.

Handle 0x0001, DMI type 1, 27 bytes
System Information
Manufacturer: Dell Inc.
Product Name: PowerEdge C8220
Version: Not Specified
Serial Number: N/A
UUID: 4C4C4544-002F-4110-8020-CEC04F202020
Wake-up Type: Power Switch
SKU Number: N/A
Family: Server

job failures 22/23 july

monalisa reporting large job failures.

update grid-xroot0[12]

kernel version needs update to prevent
https://bugzilla.redhat.com/show_bug.cgi?id=713546

workaround is :
vm.min_free_kbytes = 512000
vm.zone_reclaim_mode = 1

This will obviously require a reboot and hence downtime

aaroc / chpctier2 Goto Github PK

chpctier2's Issues

dmidecode 2.11

Recommend Projects

Recommend Topics

Recommend Org