compute-5-0 : some errors with DIMM B2/B6, swap with A2/A6.
compute-3-3 : memory errors on DIMM A3, swap with B3.
nanox-zen3 partition added, 3 nodes.
compute-2-1 : unvailable, back to production after reboot.
Some news after holidays :
compute-14-5 memory upgrade to 512GB.
New partition : zen2, 6 new nodes, plus molqed and actipnmr.
Retire of Infiniband.
Following partition are removed :
6 nodes, will be replaced by zen2 nodes end of april - begining of may.
compute-8-0 : sda replaced by a new SSD, back online.
compute-2-1 : sdb replaced by a spare disk, back online.
compute-8-0 : node out of service, sda broken
actipnmr and molqed partitions are now online for production.
compute-5-2 : problem with B3 DIMM, swap with A3.
compute-9-4 : cpu 2 dead, replaced.
compute-9-3 : motherboard & power supply problem, both have been replaced, node back to work.
Memory problem on compute-3-3, DIMM A3, looking for an issue.
New website is up, made with hugo, available in English and French, sources on irsamc git.
Incident on the BeeGFS storage, MDS server reached 100%, some calculations got blocked. metadata volume has been increased.
Cluster maintenance : everything is fine, slurm configuration has been corrected to avoid crashes.
BeeOND is now installed and available for multi-node jobs, check the documentation.
RMA compute-8-1 : memory bank B1 has been changed.
compute-10-0 back to production after maintenance.
compute-8-1 : swap DIMM A1 - B1 to check memory default.
xeonv6 partition added (1 node).
BeeGFS mount crashed, so when jobs finished, they can’t execute prolog script which destroy /mnt/beegfs/tmpdir/$SLURM_JOB_ID
So node is considered unvailable, resulting this morning breakdown.
Tue Jan 08 2019 14:17:07 Drive 0 in disk drive bay 1 is operating normally.
Tue Dec 25 2018 06:59:55 Fault detected on drive 0 in disk drive bay 1.
Power supply out, replacing it with Dell support.
compute-0-1 out of service, CPU 4 dead.
compute-1-[0,7,13] are back online.
Going into production of xeonv1_mono, xeonv2_mono and sv6. All of them with Infiniband network.
Back to prod of lpqsv26 new name : lcpq-curie.
lpqsv26 name is still available.
Retirement of : compute-6-1, compute-6-2, compute-6-3, compute-0-0.
compute-0-0 is now an “epycv1” (AMD Naples) node.
We add compute-9-[1-4] (xeonv5) and compute-0-0 (epycv1).
We put on production compute-9-0 (xeonv5) and compute-10-0 (xeonv5_mono), they use Intel Xeon Gold on Skylake architecture.
Upgrade to Centos 6.9 done, storage upgraded to BeeGFS 6.
compute-0-1 : one disk broke.
Memory problem on compute-3-3 (swap between A1 - B1)
A few nodes have NFS problems, to correct it on the master :
service rpcbind restart
service nfs restart
On the nodes :
service autofs restart
compute-3-2 : power supply down, replaced.
compute-5-0 : memory problem, swap A1 <> B1
New node : compute-7-2 - xeonv4.
New node : compute-7-1 - xeonv4.
compute-0-3 : crash on last night. Definitively stopped.
compute-0-4 : node turn down during the 31 august night. No messages. We turn it on again and reinstall.
compute-6-3 : hdd 1 HS, will be retired on august 30.
Cluster is stopped for electric maintenance.
System is upgrade to CentOS 6.8, slurm 16.05.2.
Memory problem on compute-5-1 : swap between B4 and A4.
xeonv4 and xeonv4_mono partitions are now up (compute-7 and compute-8).
Crash compute-0-4 : electric problem, we monitore it.
Memory problem on compute-5-1, DIMM B4 : check it in september.
Problem with memory speed on compute-7-0 : checking it with Dell.
compute-6-0 : HDD 1 dead. Retired of the node.
Memory problem on compute-2-5 : swap A2-B2.
SSD KO on compute-50-30 : issue with HP.
Memory swap : A3/B3 on compute-5-0 (09/05/2016).
Move and reinstall of compute-40-3 and compute-41-0 (10/05/2016).
compute-6-3 is back.
compute-3-0 repared (11/05/2016).
Memory swap A2/B2 on compute-3-4.
Move and reinstall of compute-40-[0-2].
HDD KO on compute-6-3.
New module : openmpi/openmpi-1.10.2-ifort16-int64 for Timo’s DIRAC
Maintenance of compute-40-[0-3] et compute-41-0 in order to move them.
CPU problem on compute-3-0, he goes on maintenance mode.
Unstable network on compute-2-4 et compute-3-5, corrections.
Adding a new node for the ANR esbodyr (napab).
Add nodes on the ex-sv6 cluster (sv6 and sv6_ssd partitions)
Add intel compiler cluster 2016
New module intel compiler 14
Intel 2014 is now not loaded by default.
compute-0-0 & compute-0-2 are back in production.
compute-0-1 is now in maintenance to do some tests.
compute-3-3 will be in maintenance to replace memory. Will be done when job finish.
compute-3-0 and compute-3-1 will be in maintenance to add ram. Status : done (64 > 128 GB).
compute-0-0 & compute-0-2 have hardware problems, they are down for the moment.