Git Product home page Git Product logo

chpctier2's People

Contributors

bazinski avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

chpctier2's Issues

infiniband templates for zabbix

perfquery is your friend ....

take those stats understand what they are and which we want to keep and then pump them into zabbix and trigger/display them

15000+ ALICE jobs in error states

MonaLisa reporting 15k+ jobs in errors states.

10k in error_v
3k error_e
2.2k error_ib

grafana/zabbix say it happened mostly at 14h30

atlas eos on wrong controller

the dell storage arrays are reporting that the rebuilding disks are on the wrong controller, controller 1 but the prefered controller is controller 0.

I dont want to touch this while the rebuilding is going on. All the storage currently being rebuilt is on the "wrong" controller according to management software. They were all originally configured to be on controller 0, go figure.

This means that either something is wrong or broken.
Stress test once all data is live should answer this issue #13

Move arrays back to controller 0 when ready.

Reconfigure network

Network is overly complex.

change and or fix network to collapse the 172.20.196 network in 172.20.100
all management network in vlan 10 some are,

ALICE EOS is down

due to power failure this morning ALICE EOS is down with corrupt files, still working on it, and reading

Get more storage

For a tier2 site we need additional storage.

ALICE needs 600TB usable to comply with our obligations, have c. 350.
ATLAS as of "very soon" will require 600TB as well, 60TB for "local users" and the rest for general usage.

request A/R recalc

A/R is wrong for 20-25 June and 20-26 May

I have no idea why they are wrong, they both span a power failure, one intentional and one not.
The the sides are way to long for when the service actually came up.

rebuild xcat server

throw away xcat.

reinstall xcat01 as centos7.x
throw away xcat.
back up for prosperity, nothing particularly needs keeping

Enable LSCG

Ok this is not for the Tier 2 per say, but for the "grid" site at the CHPC.

We have had a request to enable the Life Sciences Compute Grid at the site. This should be simply enabling the VO on the CE and WN.

@bazinski needs to make the changes, unless you can delegate that to me.

firmware upgrades for servers

We lost all the disks on the "new" redirector for atlas storage.

We (I) will update the firmwares for a lot of things on the servers, starting with the grid-se.chpc.ac.za machine.

firewalls in puppet

rather self explanatory, the nodes all have various configurations for firewalls, put it all in puppet, the module is installed, its a config and testing exercise

Node network congestion

The 50 compute nodes are spread over 2 48 port switches, 48 ports are used in 1 and 2 ports in the other. Redistribute

rotate ethsw09 and ethsw10

ethsw09 and ethsw10 are blowing the air in the wrong direction, they need to be switched around. This does involve rerouting all the fibres

Node 1-40 has no serial number

Yip you read the subject correct, gnode-1-40 has no serial number/service tag.

dmidecode -t system :

root@gnode-1-40:~ $ dmidecode -t system

dmidecode 2.11

SMBIOS 2.7 present.

Handle 0x0001, DMI type 1, 27 bytes
System Information
Manufacturer: Dell Inc.
Product Name: PowerEdge C8220
Version: Not Specified
Serial Number: N/A
UUID: 4C4C4544-002F-4110-8020-CEC04F202020
Wake-up Type: Power Switch
SKU Number: N/A
Family: Server

IP allocation and access

The zabbix monitoring server can not access certain network ranges, collapse the overly compex network topology to a simpler structure of 3 blockes.

  1. ipmi/snmp/monitoring/management
  2. data
  3. public.

upgrade monitoring server

reinstall with centos 7 deployed from puppet.

migrate db from postgresql to mysql. Do we care about history ?

fix ethsw07

ethsw07 is not mounted properly in the racks (sitting loose), has not identifiable management ip

10G onto grid-se

grid-se the redirector for ATLAS needs a 10G interface that is sitting in grid-ui
grid-ui is stolen for some cloud project.

Either grid-ui and grid-se must be reinstalled and physically swapped or the 10G interface in grid-ui must go into grid-se.

they appear to be same proc and same mem.

network congestion ?

failed jobs are in the 1500 range now on error_sv attempting to save.

Notice from tenet again to say congestion on cpt-jhb-durban.

One must assume that the outgoing connection on seacom is causing us to be congested.

move grid-ui/cloud-head to grid-se

swap the physical host gridui which is now called cloud-head or something similar to the physical host grid-se is sitting on. This then leaves grid-ce, grid-se and grid-se2 all with dual 10G connections for resiliance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.