bagchi-tlcc

100,000 machines to test nuclear weapons

PURDUE (US) — Researchers are perfecting simulators that show a nuclear weapon’s performance in precise molecular detail.

The simulations are a key tool for national defense because international treaties forbid the detonation of nuclear test weapons and must be operated on supercomputers containing thousands of processors. Doing so has posed reliability and accuracy problems, researchers say.

Now researchers at Purdue University and high-performance computing experts at the National Nuclear Security Administration’s (NNSA) Lawrence Livermore National Laboratory have solved several problems hindering the use of the ultra-precise simulations. NNSA is the quasi-independent agency within the U.S. Department of Energy that oversees the nation’s nuclear security activities.

The simulations, which are needed to more efficiently certify nuclear weapons, may require 100,000 machines, a level of complexity that is essential to accurately show molecular-scale reactions taking place over milliseconds, or thousandths of a second.

The same types of simulations also could be used in areas such as climate modeling and studying the dynamic changes in a protein’s shape. Such highly complex jobs must be split into many processes that execute in parallel on separate machines in large computer clusters.

“Due to natural faults in the execution environment there is a high likelihood that some processing element will have an error during the application’s execution, resulting in corrupted memory or failed communication between machines,” says Saurabh Bagchi, an associate professor in Purdue University’s School of Electrical and Computer Engineering. “There are bottlenecks in terms of communication and computation.”

These errors are compounded as long as the simulation continues to run before the glitch is detected and may cause simulations to stall or crash altogether.

“We are particularly concerned with errors that corrupt data silently, possibly generating incorrect results with no indication that the error has occurred,” says Bronis R. de Supinski, co-leader of the ASC Application Development Environment Performance Team at Lawrence Livermore. “Errors that significantly reduce system performance are also a major concern since the systems on which the simulations run are very expensive.”

Advanced Simulation and Computing is the computational arm of NNSA’s Stockpile Stewardship Program, which ensures the safety, security, and reliability of the nation’s nuclear deterrent without underground testing.

The new findings will be detailed in a paper to be presented during the Annual IEEE/IFIP International Conference on Dependable Systems and Networks from June 25-28 in Boston. Recent research findings were detailed in two papers last year, one presented during the IEEE Supercomputing Conference and the other during the International Symposium on High-Performance Parallel and Distributed Computing.

The researchers have developed automated methods to detect a glitch soon after it occurs.

“You want the system to automatically pinpoint when and in what machine the error took place and also the part of the code that was involved,” Bagchi says. “Then, a developer can come in, look at it and fix the problem.” One bottleneck arises from the fact that data are streaming to a central server.

“Streaming data to a central server works fine for a hundred machines, but it can’t keep up when you are streaming data from a thousand machines,” says Purdue doctoral student Ignacio Laguna, who worked with Lawrence Livermore computer scientists. “We’ve eliminated this central brain, so we no longer have that bottleneck.”

Each machine in the supercomputer cluster contains several cores, or processors, and each core might run one “process” during simulations. The researchers created an automated method for “clustering,” or grouping the large number of processes into a smaller number of “equivalence classes” with similar traits. Grouping the processes into equivalence classes makes it possible to quickly detect and pinpoint problems.

“The recent breakthrough was to be able to scale up the clustering so that it works with a large supercomputer,” Bagchi says.

Lawrence Livermore computer scientist Todd Gamblin came up with the scalable clustering approach. A lingering bottleneck in using the simulations is related to a procedure called checkpointing, or periodically storing data to prevent its loss in case a machine or application crashes. The information is saved in a file called a checkpoint and stored in a parallel system distant from the machines on which the application runs.

“The problem is that when you scale up to 10,000 machines, this parallel file system bogs down,” Bagchi says. “It’s about 10 times too much activity for the system to handle, and this mismatch will just become worse because we are continuing to create faster and faster  computers.”

Doctoral student Tanzima Zerin and Rudolf Eigenmann, a professor of electrical and computer engineering, along with Bagchi, led work to develop a method for compressing the checkpoints, similar to the compression of data for images. “We’re beginning to solve the checkpointing problem,” Bagchi said. “It’s not completely solved, but we are getting there.”

The checkpointing bottleneck must be solved in order for researchers to create supercomputers capable of “exascale computing,” or 1,000 quadrillion operations per second, Bagchi says.

“It’s the Holy Grail of supercomputing.”

The research has been funded by Lawrence Livermore and the National Science Foundation. The work also involves Lawrence Livermore scientists Greg Bronevetsky, Dong H. Ahn, Martin Schulz, and IBM Austin researcher Mootaz Elnozahy.

Purdue researchers did not work with the actual classified nuclear weapons software code, but instead used generic benchmarks, a set of programs designed to help evaluate the performance of parallel supercomputers.

More news from Purdue University: http://www.purdue.edu/newsroom/

chat2 Comments

You are free to share this article under the Creative Commons Attribution-NoDerivs 3.0 Unported license.

2 Comments

  1. Dean

    I have to say, I find this really quite crass. I’m a computer expert and not anti-nuclear weapons by any means, but why are we spending, investing so much time in nuclear weapons research? Aren’t there better uses for such a high performance computer.

    A nuke does one thing, it destroys. Who cares if you need a 100 Megaton nuke or a 120Megaton..just about any sized nuke is going to cause an exceptional amount of damage, so why do we need to go into so much detail and accuracy in a simulation?

    The major countries of the world have more than enough nukes to cause more damage than the earth can ever stand, and there’s a very slim chance of a nuke being used anyway.

    So I find the whole idea of simulating nukes, because we can’t physically detonate one, a wasteful use of time, money and computing resource. There are far better things that such massive computing power can be put towards.

  2. R.Will

    Dean,

    While your point is largely valid, the article does indicate that this work may represent a solution to computation challenges that will be robust over other domains: protein folding, weather simulations and what not. In addition, though I’m not anxious to see this application, I’d much rather see these experiments done in silicon rather that in the biosphere where, historically, they left a lot of undesirable nuclear byproducts. One could imagine a world where this might lead to the sort of weak/strong AI that will further lead to giant steps in medicine and fuel research (a lot of this work seems to be slanted to more energy efficient combustion technologies). So, while your specific point is right, I think you’re seeing a half-empty glass.

    R.Will

We respect your privacy.