The Listserve: a STORM is Coming Follow-up

Last week I received an e-mail titled You've won The Listserve. In case you don't know: The Listservedoesn't refer to just any mailing list, but to an experimental lottery: each day one subscriber is given the chance to broadcast his ideas to all recipients. So now was my turn.

Initially The Listserve had set out reach millions. The project started when 20k had registered. Growth came to a grinding halt quickly after. My suspicion is that the initial swath of life advice mails is to blame.

Anyway. Of course I was excited. I had been thinking for a long time of what I would write. Initially I had planned to write on how programming and algorithm design had influenced my thinking, but after all this life advice I had quickly dropped that idea. Instead I wanted to share my excitement about working with supercomputers and which impact they have on our lives. For the latter I picked the STORM project, but more on that later. The full text of my message is here.

With this post I'd like to expand upon my original subject, based on the feedback I've received. So far I've got about two dozen mails, that's about 0.1% of the total subscribers -- about what I had expected. The Listserve's guidelines limit every entry to 600 words (no images, no links), so I had to be terse -- despite my subject being of an highly technical nature. Also, I'm not a native speaker and, as they say, English is the easiest language to speak badly, so I sometimes lack a feeling for if I'm expressing myself in an idiomatic way. It's like threading a needle while wearing boxing gloves. Most feedback has been praise and none was negative, so I assume that the writing was kind of OK, albeit some of the praise was apparently just because I didn't give any life advice. He he.

The STORM Project

The STORM project is an NSF-funded research initiative which will develop a next-generation storm surge forecast model. I am an associated investigator on this project, a role for which I have to thank Hartmut Kaiser, who in turn is a principal investigator on STORM. Stakeholders of the project come from different areas. Zach Byerly has nicely summarized these as the computer science community, where I come from, the coastal community, the modelling community and the emergency management community.
CERA Screenshot

The project's center piece is ADCIRC, which is to me both awe-inspiring and frightening. Awe, because it contains about a quarter of a century of distilled domain knowledge on coastal modelling. Frightening, because this is the first time someone will do serious performance engineering on this code. And I'm part of this effort. In a nutshell, our plan is to interface ADCIRC with LibGeoDecomp, a computer simulation library (I'm the project lead) and HPX, a parallel/scalable/adaptable runtime. LibGeoDecomp's purpose here is to help with optimizing datastructures and interfaces. HPX allows us to program on a higher level of abstraction. It lets us think in terms of tasks and dependencies. This makes it easier to balance out the computational load -- which is moving along with the water inside of an ADCIRC simulation.

Last week Zach gave a slightly more technical introduction to the project at the ADCIRC Users' Group Meeting, you can watch the recording here. I gave a tutorial on LibGeoDecomp at GTC 2015. And here is an introduction to HPX by my colleague Thomas Heller.

But STORM is also exciting for another reason. My usual work is very abstract, its contribution is far from daily life. But ADCIRC is an application which is used throughout the world to simulate phenomena ranging from tidal movements to coastal inundation. Its impact is tangible. This becomes apparent when viewing the CERA website. During huricane season it's tracking the current storms path, but its database also contains many storms of the recent past, including Katrina.

Some have asked how these computer simulations affect our lives. I'm not involved in emergency management, so please correct me if I'm mistaken in the following paragraph. As far as I know the procedure is about as follows: if a hurricane threatens to make landfall computer simulations are run to estimate the expected inundation in certain areas. Right now the accuracy of these simulations is limited by

  1. the available/usable compute power,
  2. the accuracy of the underlying model and maps (grids), and
  3. last, but not least, time.

There is only a very short window of time in which a forecast has to be made. A forecast is not a computer simulation, but an interpretation of an (expert) human, based on simulations. The shorter the required time for these simulations, the more time there is to react. Mandatory lead time for evacuations is about 24 hours. The kicker: simulations will be more exact the closer the storm is to the coast, but lead time will be greatly reduced. Evacuation of one square mile estimated at one 1M USD per square mile. Play safe and potentially waste hundreds of millions of dollars? Decide to late and people will be caught by the storm on the road, exposed? I have the deepest respect for the people making these decisions. They deserve all the tools we can deliver to lessen their burden.

Use Cases

So, weather aside, what are these supercomputers being used for? Over the course of history military uses have always been a driver of this technology. ENIAC, one of the first computers, was used to calculate artillery tables and parameters for the hydrogen bomb. Today nuclear weapons stewardship (purpose: what happens to the decaying nukes, sitting in the shelves?) remains an important task. Astrophysics is another domain (question: hey, what happens if we drop that star into a black hole?), as is the simulation of engineering designs (think: crash tests of new cars/planes/ships before building actual prototypes).
Among the most exciting examples is computational drug design. Let me tell you a quick story. David E. Shaw is a computer scientist who was among the first to work on massively parallel computers. He was also among the first to take this kind of expertise to Wall Street. He made billions. Later he decided to pursue even higher goals by funding DESRES, a research lab that seeks to develop hardware/software tools to further the research on computational drug development. Well known results of this lab are the software Desmond and the line of Anton supercomputers. I briefly met David during a job interview. He is an extraordinarily charismatic and likable man and I have the utmost admiration for him. I ended up not getting an offer, but still feel like a winner for all the interesting conversations I had with members of the lab during the interview process.

Quantum Computers

I received a couple of inquiries on how I'd imagine that quantum computers would affect supercomputers. This field is still very immature, so I don't expect much of a stir anytime soon. The potential of quantum computers is that we might be able to use them as non-deterministic Turing machines, which would essentially allow us to solve all problems from NP in P time. That doesn't make any sense? No worries. P and NP are terms from theoretical computer science. Some theoretical computer scientists (or mathematicians, the lines are blurred) strive to classify which computational problems can be solved in which time -- and with how much storage. For problems which are, colloquially put, representative of class NP, we currently only know algorithms which require exponential time.
Travelling Salesman Movie Poster

This means they're currently intractable as the required compute time typically equates to billions and billions of years. One famous example would be breaking cryptographic methods -- at least as long as no government agency did weaken these before. Speaking of which: there is a highly acclaimed movie on this very subject and itreceived way to little attention: Travelling Salesman.

But back to supercomputers vs. quantum computers. Computer simulations are very different than breaking crypto keys. Essentially a quantum computer would be very good at guessing solutions to a system of equations. Many simulation codes spend considerable amounts of time on solving equations, but the tricky part here is that these are huge and many. Huge means that a quantum computer would need petabytes of memory -- current experimental machines operate on a couple of bits. Many means that even a quantum computer would need a lot of repetitions, with repetitions equating to time. In summary, even with quantum computers on the horizon, I expect supercomputing to remain an expensive domain -- in terms of both, money and time.

(Programming) Supercomputers

Emacs screenshow, editing the HPX backend of LibGeoDecomp

So, how does programming of these machines work? Well, in its simplest form just like any other programming: open your favorite text editor/IDE and type away. But there is a catch.

Typical programming languages are C, C++ and Fortran. ADCIRC is written chiefly in Fortran. Fortran has a reputation for being highly efficient since the language standard enables the compiler to perform certain optimizations (wink, aliasing, wink).

Not all of these are possible out of the box with other languages, but for most there are pragmas that you can insert into your code to let the compiler know that certain optimizations are safe (e.g. pragma noalias, pragma vector always). It's a mess. C++ has/had a reputation for being slow. I'm not a big fan on programming languages turf wars. To me they are merely tools. Yes, one can write slow code in C++. But C++ has the advantage of being able to encapsulate complexity (objects...) while still lending itself to write efficient code. The preprocessor and, more importantly, templates are powerful tools for compile time code generation. My experience (and research result) is that this is an essential building block of high performance code. Fortran has little support for compile time code generation, thus much more magic is required on behalf of the compiler. Sometimes this works out, sometimes not. Few projects are using Java. Python has gained some traction, but mostly on the workflow level, not inside of computational kernels.

There is no clear winner regarding the IDE of choice. Many in the industry seem to use vim and Emacs, with the latter being my choice. Eclipse and other IDEs have a certain following, but lack a distinctive killer feature in this setting.

Debugging and profiling on these machines is a problem. In theory it is possible to attach gdb to any running process, but in practice you often won't know which one of the tens of thousands of nodes you'd need to connect to. Plus, you're probably not around when the fault occurs, as your job will be sitting in the queue for a couple of days (resource arbitration on these machines is handled by a batch system) and you can't just block the job for hours as the cost of of a system hour may easily be in the five digits range. So if your code fails, you'll typically hope that the crash output will be descriptive enough to allow you to fix the bug (my advice: fail early). If it's not, then the next best hope is to somehow reproduce the behavior on a development machine (e.g. with a smaller dataset). If this fails as well then you should hope that the bug simply doesn't occur again. Many projects are hard-pressed for compute time, and extensive debugging sessions at scale will quickly burn through even the beefiest compute time budget.