On-Prem

HPC

DoE watchdog warns of poor maintenance at home of Frontier exascale system

Report says new QA plan currently being worked up


The US Department of Energy's watchdog claims that operations and maintenance are being poorly managed at Oak Ridge National Laboratory’s datacenter, home to advanced computers such as the world’s first exascale system, Frontier.

The DoE’s Office of Inspector General (OIG) received an allegation in September 2022 regarding maintenance and calibration in datacenters at the Oak Ridge site in Tennessee, which undertakes science projects relating to nuclear power and national security.

According to the report [PDF], filed yesterday, the allegation claimed that the calibration program at the site was inadequate, and there was poor or no maintenance at all on pressure relief valves (PRVs) within the datacenters. The OIG said it conducted an inspection from January 2023 through September 2023, and was able to "substantiate" the allegations.

Specifically, the watchdog said it found the calibration program was inadequate to meet quality assurance requirements, and that standards-based management system procedures were not always followed when maintaining PRVs.

Failure to test or inspect PRVs properly could cause the system to exceed allowable pressure limits, potentially resulting in "events that may harm personnel and equipment," the OIG stated, while if the infrastructure is not properly maintained, it could affect the availability of the computational resources and thus the site’s mission goals.

Oak Ridge National Laboratory is managed and operated by UT-Battelle, LLC. This is a not-for-profit organization established in 2000 for the sole purpose of managing the Oak Ridge site for the DoE, and is a limited liability partnership between the University of Tennessee and Battelle Memorial Institute, itself a non-profit science and technology outfit.

We asked UT-Battelle for a response to this report, but the organization was not immediately available to give an answer.

The report refers to datacenters relating to buildings 5300, 5600, and 5800 at the Oak Ridge site. These are home to the Multiprogram Research Facility, the Computational Sciences Building and the Engineering Technology Facility.

The Computational Sciences Building houses the Oak Ridge Leadership Computing Facility (OLCF), which operates the Frontier supercomputer.

The OIG report said it found UT-Battelle's calibration program to be inadequate because the organization was "unable to provide sufficient documentation that demonstrated calibration had been performed in accordance with applicable criteria."

A UT-Battelle manager informed the watchdog that routine calibration is not necessary, the report added. This is because each piece of equipment is calibrated at installation, and the datacenter systems are then continuously monitored by a subcontractor using a software system that notifies them of any subsequent issues.

However, the OIG said that while this is allowed, all software, regardless of safety significance, must be controlled by a quality assurance program, and the quality assurance program must describe how the requirements are met.

As ORNL was unable to provide documentation describing how these requirements are met, the OIG report said that UT-Battelle does not therefore know if the software is providing accurate information.

In the case of the PRVs, the report stated that UT-Battelle did not always maintain and/or test the three types of datacenter PRVs in accordance with applicable guidance.

The OIG found that all three listed air-type PRVs had not always been tested within the required timeframes, while 22 of the 54 refrigerant-type PRVs had not been tested and 12 of 27 water-type PRVs on the list had not been tested and/or inspected, as required.

UT-Battelle said in the report that PRV testing did not occur because it was overlooked in some instances, while in the case of refrigerant PRVs, testing was performed based on the manufacturer's recommendations rather than ORNL's procedure.

UT-Battelle said that it is currently revising its procedure to reflect the manufacturer's recommendation for refrigerant PRVs, and that it has begun taking action to ensure full compliance with its procedure.

However, the OIG report noted that UT-Battelle carried out an assessment in 2020 that identified similar issues. The recommendations resulting from that assessment show signs of progress, it said, but illustrated the need for further improvement in this area.

The report said that UT-Battelle management fully concurred with the recommendations, and it has agreed to develop a quality assurance plan for the monitoring software and ensure that datacenter PRVs are properly identified and comply with current procedures and requirements. ®

Send us news
4 Comments

Researchers weigh new benchmarks for Green500 amid shifting workload priorities

Just because it's super efficient at Linpack doesn't mean it'll be in everything

Intel drops the deets on UK's Dawn AI supercomputer

Phase one packs 512 Xeons, 1,024 Ponte Vecchio GPUs. Phase two: 10x that

As the Top500 celebrates its 30th year, with a $5 VM you too can get into the top 10 ... of 1993

But if you really care about performance, there are better options out there, natch

Aurora dawns late: Half-baked entry secures second in supercomputer stakes

Half the machine, quadruple the anticipation for all-Intel super

Fujitsu says it can optimize CPU and GPU use to minimize execution time

Demos its Adaptive GPU Allocator as global shortage of geepies grinds on

UK govt finds £225M for Isambard-AI supercomputer powered by Nvidia

5,448 GraceHopper superchips and 200PFLOPS gets you somewhere in the global public top ten

UK bets on Intel CPUs and GPUs, Dell boxen, OpenStack for Dawn supercomputer

We'd make some kind of Sun sets joke here but it's too early in the morning

HPE and Nvidia offer 'turnkey' supercomputer for AI training

If you can afford it – pricing's not out yet

Tachyum says someone will build 50 exaFLOPS super with its as-yet unfinished chips

'It's a huge, effing big machine'

Atos subsidiary Eviden scores contract win in Europe's first exascale system

$526M Jupiter set to rule EU's tech orbit by 2024

US govt talks up $2B X-ray photobooth to check its nuke weapon sims are right

Sub-critical plutonium implosion to be snapped on nanosecond scale

Fujitsu, RIKEN open Japan's first superconducting quantum 'puter to eggheads

64-qubit system paired with 40-qubit simulator to get some sort of accuracy