2015-04-06 | Andreas Schäfer
supercomputing, storm
Last week I received an e-mail titled You've won The
Listserve. In case you don't know: The Listservedoesn't refer to
just any mailing list, but to an experimental lottery: each day one
subscriber is given the chance to broadcast his ideas to all
recipients. So now was my turn.
Initially The Listserve
had set out reach millions. The project started when 20k had
registered. Growth came to a grinding halt quickly after. My suspicion
is that the initial swath of life advice mails is to blame.
Anyway. Of course I was excited. I had been thinking for a long time of
what I would write. Initially I had planned to write on how
programming and algorithm design had influenced my thinking, but after
all this life advice I had quickly dropped that idea. Instead I wanted
to share my excitement about working with supercomputers and which
impact they have on our lives. For the latter I picked the STORM project, but more
on that later. The full text of my message is here.
With this post I'd like to expand upon my original subject, based on
the feedback I've received. So far I've got about two dozen
mails, that's about 0.1% of the total subscribers -- about what I had
expected. The Listserve's guidelines limit every entry to 600 words
(no images, no links), so I had to be terse -- despite my subject
being of an highly technical nature. Also, I'm not a native speaker
and, as they say, English is the easiest language to speak
badly, so I sometimes lack a feeling for if I'm expressing myself in
an idiomatic way. It's like threading a needle while wearing boxing
gloves. Most feedback has been praise and none was negative, so I
assume that the writing was kind of OK, albeit some of the praise was
apparently just because I didn't give any life advice. He he.
The STORM Project
The STORM project is an NSF-funded research initiative which will
develop a next-generation storm surge forecast model. I am an
associated investigator on this project, a role for which I have to
thank
Hartmut Kaiser,
who in turn is a principal investigator on STORM. Stakeholders of the
project come from different areas. Zach Byerly has nicely summarized
these as the computer science community, where I come from, the
coastal community, the modelling community and the emergency
management community.
The project's center piece is ADCIRC, which is to me both
awe-inspiring and frightening. Awe, because it contains about a
quarter of a century of distilled domain knowledge on coastal
modelling. Frightening, because this is the first time someone will do
serious performance engineering on this code. And I'm part of this
effort. In a nutshell, our plan is to interface ADCIRC with LibGeoDecomp, a computer
simulation library (I'm the project lead) and HPX, a
parallel/scalable/adaptable runtime. LibGeoDecomp's purpose here is to
help with optimizing datastructures and interfaces. HPX allows us to
program on a higher level of abstraction. It lets us think in terms of
tasks and dependencies. This makes it easier to balance out the
computational load -- which is moving along with the water inside of
an ADCIRC simulation.
Last week Zach gave a slightly more technical introduction to the
project at the ADCIRC Users' Group Meeting, you can watch the
recording here.
I gave a tutorial
on LibGeoDecomp at GTC 2015. And here is an
introduction to HPX by my colleague Thomas Heller.
But STORM is also exciting for another reason. My usual work is very
abstract, its contribution is far from daily life. But ADCIRC is an
application which is used throughout the world to simulate phenomena
ranging from tidal movements to coastal inundation. Its impact is
tangible. This becomes apparent when viewing the CERA website. During
huricane season it's tracking the current storms path, but its
database also contains many storms of the recent past, including Katrina.
Some have asked how these computer simulations affect our lives. I'm
not involved in emergency management, so please correct me if I'm
mistaken in the following paragraph. As far as I know the procedure is
about as follows: if a hurricane threatens to make landfall computer
simulations are run to estimate the expected inundation in certain
areas. Right now the accuracy of these simulations is limited by
- the available/usable compute power,
- the accuracy of the underlying model and maps (grids), and
- last, but not least, time.
There is only a very short window of time in which a forecast has
to be made. A forecast is not a computer simulation, but an
interpretation of an (expert) human, based on simulations. The shorter
the required time for these simulations, the more time there is to
react. Mandatory lead time for evacuations is about 24 hours. The
kicker: simulations will be more exact the closer the storm is to the
coast, but lead time will be greatly reduced. Evacuation of one square
mile estimated at one 1M USD per square mile. Play safe and
potentially waste hundreds of millions of dollars? Decide to late and
people will be caught by the storm on the road, exposed? I have the
deepest respect for the people making these decisions. They deserve
all the tools we can deliver to lessen their burden.
Use Cases
So, weather aside, what are these supercomputers being used for? Over
the course of history military uses have always been a driver of this
technology. ENIAC, one of the first computers, was used to calculate
artillery tables and parameters for the hydrogen bomb. Today nuclear
weapons stewardship (purpose: what happens to the decaying nukes,
sitting in the shelves?) remains an important task. Astrophysics is
another domain (question: hey, what happens if we drop that star into
a black hole?), as is the simulation of engineering designs (think:
crash tests of new cars/planes/ships before building actual prototypes).
Among the most exciting examples is computational drug design. Let me
tell you a quick story.
David E. Shaw
is a computer scientist who was among the first to work on massively
parallel computers. He was also among the first to take this kind of
expertise to Wall Street. He made billions. Later he decided to
pursue even higher goals by funding
DESRES, a research lab
that seeks to develop hardware/software tools to further the research
on computational drug development. Well known results of this lab are
the software
Desmond
and the line of
Anton
supercomputers. I briefly met David during a job interview. He is an
extraordinarily charismatic and likable man and I have the utmost
admiration for him. I ended up not getting an offer, but still feel
like a winner for all the interesting conversations I had with members
of the lab during the interview process.
Quantum Computers
I received a couple of inquiries on how I'd imagine that quantum
computers would affect supercomputers. This field is still very
immature, so I don't expect much of a stir anytime soon. The potential
of quantum computers is that we might be able to use them as
non-deterministic Turing machines, which would essentially allow us
to solve all problems from NP in P time. That doesn't make any sense?
No worries. P and NP are terms from theoretical computer science. Some
theoretical computer scientists (or mathematicians, the lines are
blurred) strive to classify which computational problems can be solved
in which time -- and with how much storage. For problems which are,
colloquially put, representative of class NP, we currently only know
algorithms which require exponential time.
This means they're
currently intractable as the required compute time typically equates
to billions and billions of years. One famous example would be
breaking cryptographic methods -- at least as long as no government
agency did weaken these before. Speaking of which: there is a highly
acclaimed movie on this very subject and itreceived way to little
attention: Travelling
Salesman.
But back to supercomputers vs. quantum computers. Computer simulations
are very different than breaking crypto keys. Essentially a quantum
computer would be very good at guessing solutions to a
system of equations. Many simulation codes spend considerable amounts
of time on solving equations, but the tricky part here is that these
are huge and many. Huge means that a quantum computer
would need petabytes of memory -- current experimental machines
operate on a couple of bits. Many means that even a
quantum computer would need a lot of repetitions, with repetitions
equating to time. In summary, even with quantum computers on the
horizon, I expect supercomputing to remain an expensive domain --
in terms of both, money and time.
(Programming) Supercomputers
So, how does programming of these machines work? Well, in its simplest
form just like any other programming: open your favorite text
editor/IDE and type away. But there is a catch.
Typical programming languages are C, C++ and Fortran. ADCIRC is written chiefly in Fortran. Fortran has a
reputation for being highly efficient since the language standard
enables the compiler to perform certain optimizations (wink, aliasing,
wink).
Not all of these are possible out of the box with other
languages, but for most there are pragmas that you can insert into your
code to let the compiler know that certain optimizations are safe
(e.g. pragma noalias, pragma vector always). It's a mess. C++ has/had
a reputation for being slow. I'm not a big fan on programming
languages turf wars. To me they are merely tools. Yes, one can write slow code in C++. But C++ has the
advantage of being able to encapsulate complexity (objects...) while
still lending itself to write efficient code. The preprocessor and,
more importantly, templates are powerful tools for compile time code
generation. My experience (and
research
result) is that this is an essential building block of high
performance code. Fortran has little support for compile time code
generation, thus much more magic is required on behalf of the
compiler. Sometimes this works out, sometimes not. Few projects are
using Java. Python has gained some traction, but mostly on the
workflow level, not inside of computational kernels.
There is no clear winner regarding the IDE of choice. Many in the
industry seem to use vim and Emacs, with the latter being my choice.
Eclipse and other IDEs have a certain following, but lack a
distinctive killer feature in this setting.
Debugging and profiling on these machines is a problem. In theory it
is possible to attach gdb to any running process, but in practice you
often won't know which one of the tens of thousands of nodes you'd
need to connect to. Plus, you're probably not around when the fault
occurs, as your job will be sitting in the queue for a couple of days
(resource arbitration on these machines is handled by a batch system)
and you can't just block the job for hours as the cost of of a system hour may
easily be in the five digits range. So if your code fails, you'll
typically hope that the crash output will be descriptive enough to
allow you to fix the bug (my advice: fail early). If it's not, then
the next best hope is to somehow reproduce the behavior on a
development machine (e.g. with a smaller dataset). If this fails as
well then you should hope that the bug simply doesn't occur again.
Many projects are hard-pressed for compute time, and extensive
debugging sessions at scale will quickly burn through even the beefiest
compute time budget.