> Curie > News

News

2023/02/14

compute-5-0 : some errors with DIMM B2/B6, swap with A2/A6.

2022/06/02

compute-3-3 : memory errors on DIMM A3, swap with B3.

2022/05/25

compute-5-1 : memory errors on DIMM B1/B5, swap with A1/A5
oss-0-0 : PSU1 dead, replaced by spare

2022/02/01

nanox-zen3 partition added, 3 nodes.

2021/11/02

compute-2-1 : unvailable, back to production after reboot.

2021/08/24-25

Some news after holidays :

pull of master infiniband card
add to master idrac card
lcpq-curie software upgrade
compute-1-12 : crash after OOM
compute-1-13 : second disk not available, fixed
compute-3-5 : alert on DIMM B3, swap with A3
ipmi ips resetup for compute-3-4 & compute-3-5
compute-1-1 : sda dead, replaced and system reinstalled
compute-3-2 : alert on DIMM B4, swap with A4
compute-1-6 : after firmware upgrades, idrac out, electric drain to reset system
firmware & software updates on empty nodes

2021/06/22

compute-14-5 memory upgrade to 512GB.

2021/04/21

New partition : zen2, 6 new nodes, plus molqed and actipnmr.

Retire of Infiniband.

2021/03/30

Following partition are removed :

sv6
xeonv1_mono
xeonv2_mono

6 nodes, will be replaced by zen2 nodes end of april - begining of may.

2021/01/26

compute-8-0 : sda replaced by a new SSD, back online.

compute-2-1 : sdb replaced by a spare disk, back online.

2021/01/05

compute-8-0 : node out of service, sda broken

2020/10/08

actipnmr and molqed partitions are now online for production.

2020/09/03

compute-5-2 : problem with B3 DIMM, swap with A3.

2020/06/24

compute-9-4 : cpu 2 dead, replaced.

2020/06/16

compute-9-3 : motherboard & power supply problem, both have been replaced, node back to work.

2020/05/20

Memory problem on compute-3-3, DIMM A3, looking for an issue.

2020/04/28

New website is up, made with hugo, available in English and French, sources on irsamc git.

2020/01/29

Incident on the BeeGFS storage, MDS server reached 100%, some calculations got blocked. metadata volume has been increased.

2020/01/09

Cluster maintenance : everything is fine, slurm configuration has been corrected to avoid crashes.

BeeOND is now installed and available for multi-node jobs, check the documentation.

2019/12/17

RMA compute-8-1 : memory bank B1 has been changed.

compute-10-0 back to production after maintenance.

2019/11/19

compute-8-1 : swap DIMM A1 - B1 to check memory default.

2019/09/17

xeonv6 partition added (1 node).

2019/06/06

BeeGFS mount crashed, so when jobs finished, they can’t execute prolog script which destroy /mnt/beegfs/tmpdir/$SLURM_JOB_ID

So node is considered unvailable, resulting this morning breakdown.

2019/01/08

compute-1-4 :

Tue Jan 08 2019 14:17:07 Drive 0 in disk drive bay 1 is operating normally.

Tue Dec 25 2018 06:59:55 Fault detected on drive 0 in disk drive bay 1.

compute-9-4 :

Power supply out, replacing it with Dell support.

2018/10/01

compute-0-1 out of service, CPU 4 dead.

compute-1-[0,7,13] are back online.

Going into production of xeonv1_mono, xeonv2_mono and sv6. All of them with Infiniband network.

2018/09/13

Back to prod of lpqsv26 new name : lcpq-curie.

lpqsv26 name is still available.

Retirement of : compute-6-1, compute-6-2, compute-6-3, compute-0-0.

compute-0-0 is now an “epycv1” (AMD Naples) node.

We add compute-9-[1-4] (xeonv5) and compute-0-0 (epycv1).

2017/10/26

We put on production compute-9-0 (xeonv5) and compute-10-0 (xeonv5_mono), they use Intel Xeon Gold on Skylake architecture.

2017/08/29

Upgrade to Centos 6.9 done, storage upgraded to BeeGFS 6.

compute-0-1 : one disk broke.

Memory problem on compute-3-3 (swap between A1 - B1)

2017/07/12

A few nodes have NFS problems, to correct it on the master :

service rpcbind restart
service nfs restart

On the nodes :

service autofs restart

2017/03/03

compute-3-2 : power supply down, replaced.

2017/01/18

compute-5-0 : memory problem, swap A1 <> B1

2016/11/07

New node : compute-7-2 - xeonv4.

2016/09/26

New node : compute-7-1 - xeonv4.

2016/09/16

compute-0-3 : crash on last night. Definitively stopped.

2016/09/01

compute-0-4 : node turn down during the 31 august night. No messages. We turn it on again and reinstall.

2016/07/25

compute-6-3 : hdd 1 HS, will be retired on august 30.

Cluster is stopped for electric maintenance.

System is upgrade to CentOS 6.8, slurm 16.05.2.

Memory problem on compute-5-1 : swap between B4 and A4.

2016/06/28

xeonv4 and xeonv4_mono partitions are now up (compute-7 and compute-8).

2016/06/28

Crash compute-0-4 : electric problem, we monitore it.

2016/06/27

Memory problem on compute-5-1, DIMM B4 : check it in september.

Problem with memory speed on compute-7-0 : checking it with Dell.

2016/06/15

compute-6-0 : HDD 1 dead. Retired of the node.

2016/06/13

Memory problem on compute-2-5 : swap A2-B2.

SSD KO on compute-50-30 : issue with HP.

2016/05/09-10-11

Memory swap : A3/B3 on compute-5-0 (09/05/2016).

Move and reinstall of compute-40-3 and compute-41-0 (10/05/2016).

compute-6-3 is back.

compute-3-0 repared (11/05/2016).

2016/05/03

Memory swap A2/B2 on compute-3-4.

Move and reinstall of compute-40-[0-2].

HDD KO on compute-6-3.

New module : openmpi/openmpi-1.10.2-ifort16-int64 for Timo’s DIRAC

2016/04/22

Maintenance of compute-40-[0-3] et compute-41-0 in order to move them.

2016/04/19

CPU problem on compute-3-0, he goes on maintenance mode.

Unstable network on compute-2-4 et compute-3-5, corrections.

2016/04/15

Adding a new node for the ANR esbodyr (napab).

Add nodes on the ex-sv6 cluster (sv6 and sv6_ssd partitions)

2016/04/04

Add intel compiler cluster 2016

New module intel compiler 14

Intel 2014 is now not loaded by default.

2015/10/06

compute-0-0 & compute-0-2 are back in production.

compute-0-1 is now in maintenance to do some tests.

compute-3-3 will be in maintenance to replace memory. Will be done when job finish.

compute-3-0 and compute-3-1 will be in maintenance to add ram. Status : done (64 > 128 GB).

2015/09/23

compute-0-0 & compute-0-2 have hardware problems, they are down for the moment.