aaroc / chpctier2 Goto Github PK
View Code? Open in Web Editor NEWIssues with the CHPC Tier2 Facility
Issues with the CHPC Tier2 Facility
storage nodes times on controllers are wrong, fix
perfquery is your friend ....
take those stats understand what they are and which we want to keep and then pump them into zabbix and trigger/display them
The 50 compute nodes are spread over 2 48 port switches, 48 ports are used in 1 and 2 ports in the other. Redistribute
reinstall with centos 7 deployed from puppet.
migrate db from postgresql to mysql. Do we care about history ?
cvmfs needs an update in cluster.
redo all cvmfs via puppet.
Network is overly complex.
change and or fix network to collapse the 172.20.196 network in 172.20.100
all management network in vlan 10 some are,
tests from alice central services are failing on ailce::za_chpc::eos
fix ATLAS SW Directory and nfs cluster wide share, according to ๐ https://twiki.cern.ch/twiki/bin/view/AtlasComputing/CernVMFS#Setup_Instructions_for_LCG_Grid
throw away xcat.
reinstall xcat01 as centos7.x
throw away xcat.
back up for prosperity, nothing particularly needs keeping
Add the yaim config files for biomed VO
setup eos for atlas
A/R is wrong for 20-25 June and 20-26 May
I have no idea why they are wrong, they both span a power failure, one intentional and one not.
The the sides are way to long for when the service actually came up.
MonaLisa reporting 15k+ jobs in errors states.
10k in error_v
3k error_e
2.2k error_ib
grafana/zabbix say it happened mostly at 14h30
install ralph to test in contrast to racktables.
ethsw09 and ethsw10 are blowing the air in the wrong direction, they need to be switched around. This does involve rerouting all the fibres
the dell storage arrays are reporting that the rebuilding disks are on the wrong controller, controller 1 but the prefered controller is controller 0.
I dont want to touch this while the rebuilding is going on. All the storage currently being rebuilt is on the "wrong" controller according to management software. They were all originally configured to be on controller 0, go figure.
This means that either something is wrong or broken.
Stress test once all data is live should answer this issue #13
Move arrays back to controller 0 when ready.
Ok this is not for the Tier 2 per say, but for the "grid" site at the CHPC.
We have had a request to enable the Life Sciences Compute Grid at the site. This should be simply enabling the VO on the CE and WN.
@bazinski needs to make the changes, unless you can delegate that to me.
The monalisa data in zabbix is not being pulled in, since about 11am this morning
replace the current squid on vobox to be frontier squid, via puppet
failed jobs are in the 1500 range now on error_sv attempting to save.
Notice from tenet again to say congestion on cpt-jhb-durban.
One must assume that the outgoing connection on seacom is causing us to be congested.
For redundancy reasons the singular 10G interfaces on grid-xrootd0[12] must be made dual and then plugged into the alternate 10G switch and then both interfaces bonded.
The zabbix monitoring server can not access certain network ranges, collapse the overly compex network topology to a simpler structure of 3 blockes.
fix the storage array locations in racktables in rack4
the monalisa client on the vobox is down.
something wrong with sam tests since late 25th june
The eos storage can use centos 7, upgrade them
grid-se2, grid-xrootd0[12], grid-xrootd
Intermittent network errors.
Relevant parties have been informed.
install racktables on new monitoring server and integrate with zabbix as per :
https://github.com/kaz260/RackTables-ZABBIX-bridge
On sagrid queue
config is here : https://voms.ct.infn.it:8443/voms/vo.africa-grid.org/configuration/configuration.action
For a tier2 site we need additional storage.
ALICE needs 600TB usable to comply with our obligations, have c. 350.
ATLAS as of "very soon" will require 600TB as well, 60TB for "local users" and the rest for general usage.
grid-se the redirector for ATLAS needs a 10G interface that is sitting in grid-ui
grid-ui is stolen for some cloud project.
Either grid-ui and grid-se must be reinstalled and physically swapped or the 10G interface in grid-ui must go into grid-se.
they appear to be same proc and same mem.
rather self explanatory, the nodes all have various configurations for firewalls, put it all in puppet, the module is installed, its a config and testing exercise
get rid of xcat, its currently dead, nobody wants to fix it.
due to power failure this morning ALICE EOS is down with corrupt files, still working on it, and reading
eos template for zabbix to monitor .... eos.
alice jobs plumeted from 1800 to 700 in 1 hour
zabbix is sending to the bot directly not to a specific channel #chcp-zabbix
Yip you read the subject correct, gnode-1-40 has no serial number/service tag.
dmidecode -t system :
root@gnode-1-40:~ $ dmidecode -t system
dmidecode 2.11
SMBIOS 2.7 present.
Handle 0x0001, DMI type 1, 27 bytes
System Information
Manufacturer: Dell Inc.
Product Name: PowerEdge C8220
Version: Not Specified
Serial Number: N/A
UUID: 4C4C4544-002F-4110-8020-CEC04F202020
Wake-up Type: Power Switch
SKU Number: N/A
Family: Server
monalisa reporting large job failures.
kernel version needs update to prevent
https://bugzilla.redhat.com/show_bug.cgi?id=713546
workaround is :
vm.min_free_kbytes = 512000
vm.zone_reclaim_mode = 1
This will obviously require a reboot and hence downtime
swap the physical host gridui which is now called cloud-head or something similar to the physical host grid-se is sitting on. This then leaves grid-ce, grid-se and grid-se2 all with dual 10G connections for resiliance.
most jobs dont get executed
put foreman and puppet on old xcat server
ethsw07 is not mounted properly in the racks (sitting loose), has not identifiable management ip
We lost all the disks on the "new" redirector for atlas storage.
We (I) will update the firmwares for a lot of things on the servers, starting with the grid-se.chpc.ac.za machine.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.