News
2025/07/10
- zen3-7 and zen3-8 : upgrade RAM to 2TB
- zen3-5 and zen3-6 : upgrade RAM to 1TB
2505/07/06
- mds1 (BeeGFS) : crash causing interruption of the scratch space
2025/04/04
Adding cncliff-1 (trixs)
2025/02/18
Swap DIMM A1 on cnxv3-5.
Swap DIMM B4 on xv3-3.
Adding cnzen3-8 (trex).
2024/07/09
Adding record partition.
2024/06/25
cnnapab-1 : problem on DIMM B4, swap with B8 (same as 2016, compute-5-1).
2024/05/27
Website finally adapted to Ampere
2023/02/14
compute-5-0 : some errors with DIMM B2/B6, swap with A2/A6.
2022/06/02
compute-3-3 : memory errors on DIMM A3, swap with B3.
2022/05/25
- compute-5-1 : memory errors on DIMM B1/B5, swap with A1/A5
- oss-0-0 : PSU1 dead, replaced by spare
2022/02/01
nanox-zen3 partition added, 3 nodes.
2021/11/02
compute-2-1 : unvailable, back to production after reboot.
2021/08/24-25
Some news after holidays :
- pull of master infiniband card
- add to master idrac card
- lcpq-curie software upgrade
- compute-1-12 : crash after OOM
- compute-1-13 : second disk not available, fixed
- compute-3-5 : alert on DIMM B3, swap with A3
- ipmi ips resetup for compute-3-4 & compute-3-5
- compute-1-1 : sda dead, replaced and system reinstalled
- compute-3-2 : alert on DIMM B4, swap with A4
- compute-1-6 : after firmware upgrades, idrac out, electric drain to reset system
- firmware & software updates on empty nodes
2021/06/22
compute-14-5 memory upgrade to 512GB.
2021/04/21
New partition : zen2, 6 new nodes, plus molqed and actipnmr.
Retire of Infiniband.
2021/03/30
Following partition are removed :
- sv6
- xeonv1_mono
- xeonv2_mono
6 nodes, will be replaced by zen2 nodes end of april - begining of may.
2021/01/26
compute-8-0 : sda replaced by a new SSD, back online.
compute-2-1 : sdb replaced by a spare disk, back online.
2021/01/05
compute-8-0 : node out of service, sda broken
2020/10/08
actipnmr and molqed partitions are now online for production.
2020/09/03
compute-5-2 : problem with B3 DIMM, swap with A3.
2020/06/24
compute-9-4 : cpu 2 dead, replaced.
2020/06/16
compute-9-3 : motherboard & power supply problem, both have been replaced, node back to work.
2020/05/20
Memory problem on compute-3-3, DIMM A3, looking for an issue.
2020/04/28
New website is up, made with hugo, available in English and French, sources on irsamc git.
2020/01/29
Incident on the BeeGFS storage, MDS server reached 100%, some calculations got blocked. metadata volume has been increased.
2020/01/09
Cluster maintenance : everything is fine, slurm configuration has been corrected to avoid crashes.
BeeOND is now installed and available for multi-node jobs, check the documentation.
2019/12/17
RMA compute-8-1 : memory bank B1 has been changed.
compute-10-0 back to production after maintenance.
2019/11/19
compute-8-1 : swap DIMM A1 - B1 to check memory default.
2019/09/17
xeonv6 partition added (1 node).
2019/06/06
BeeGFS mount crashed, so when jobs finished, they can’t execute prolog script which destroy /mnt/beegfs/tmpdir/$SLURM_JOB_ID
So node is considered unvailable, resulting this morning breakdown.
2019/01/08
compute-1-4 :
Tue Jan 08 2019 14:17:07 Drive 0 in disk drive bay 1 is operating normally.
Tue Dec 25 2018 06:59:55 Fault detected on drive 0 in disk drive bay 1.
compute-9-4 :
Power supply out, replacing it with Dell support.
2018/10/01
compute-0-1 out of service, CPU 4 dead.
compute-1-[0,7,13] are back online.
Going into production of xeonv1_mono, xeonv2_mono and sv6. All of them with Infiniband network.
2018/09/13
Back to prod of lpqsv26 new name : lcpq-curie.
lpqsv26 name is still available.
Retirement of : compute-6-1, compute-6-2, compute-6-3, compute-0-0.
compute-0-0 is now an “epycv1” (AMD Naples) node.
We add compute-9-[1-4] (xeonv5) and compute-0-0 (epycv1).
2017/10/26
We put on production compute-9-0 (xeonv5) and compute-10-0 (xeonv5_mono), they use Intel Xeon Gold on Skylake architecture.
2017/08/29
Upgrade to Centos 6.9 done, storage upgraded to BeeGFS 6.
compute-0-1 : one disk broke.
Memory problem on compute-3-3 (swap between A1 - B1)
2017/07/12
A few nodes have NFS problems, to correct it on the master :
service rpcbind restart
service nfs restart
On the nodes :
service autofs restart
2017/03/03
compute-3-2 : power supply down, replaced.
2017/01/18
compute-5-0 : memory problem, swap A1 <> B1
2016/11/07
New node : compute-7-2 - xeonv4.
2016/09/26
New node : compute-7-1 - xeonv4.
2016/09/16
compute-0-3 : crash on last night. Definitively stopped.
2016/09/01
compute-0-4 : node turn down during the 31 august night. No messages. We turn it on again and reinstall.
2016/07/25
compute-6-3 : hdd 1 HS, will be retired on august 30.
Cluster is stopped for electric maintenance.
System is upgrade to CentOS 6.8, slurm 16.05.2.
Memory problem on compute-5-1 : swap between B4 and A4.
2016/06/28
xeonv4 and xeonv4_mono partitions are now up (compute-7 and compute-8).
2016/06/28
Crash compute-0-4 : electric problem, we monitore it.
2016/06/27
Memory problem on compute-5-1, DIMM B4 : check it in september.
Problem with memory speed on compute-7-0 : checking it with Dell.
2016/06/15
compute-6-0 : HDD 1 dead. Retired of the node.
2016/06/13
Memory problem on compute-2-5 : swap A2-B2.
SSD KO on compute-50-30 : issue with HP.
2016/05/09-10-11
Memory swap : A3/B3 on compute-5-0 (09/05/2016).
Move and reinstall of compute-40-3 and compute-41-0 (10/05/2016).
compute-6-3 is back.
compute-3-0 repared (11/05/2016).
2016/05/03
Memory swap A2/B2 on compute-3-4.
Move and reinstall of compute-40-[0-2].
HDD KO on compute-6-3.
New module : openmpi/openmpi-1.10.2-ifort16-int64 for Timo’s DIRAC
2016/04/22
Maintenance of compute-40-[0-3] et compute-41-0 in order to move them.
2016/04/19
CPU problem on compute-3-0, he goes on maintenance mode.
Unstable network on compute-2-4 et compute-3-5, corrections.
2016/04/15
Adding a new node for the ANR esbodyr (napab).
Add nodes on the ex-sv6 cluster (sv6 and sv6_ssd partitions)
2016/04/04
Add intel compiler cluster 2016
New module intel compiler 14
Intel 2014 is now not loaded by default.
2015/10/06
compute-0-0 & compute-0-2 are back in production.
compute-0-1 is now in maintenance to do some tests.
compute-3-3 will be in maintenance to replace memory. Will be done when job finish.
compute-3-0 and compute-3-1 will be in maintenance to add ram. Status : done (64 > 128 GB).
2015/09/23
compute-0-0 & compute-0-2 have hardware problems, they are down for the moment.