Targeting GPUs with OpenMP Directives on Summit: A Simple and Effective Fortran Experience. (arXiv:1812.07977v1 [physics.comp-ph])
<a href="http://arxiv.org/find/physics/1/au:+Budiardja_R/0/1/0/all/0/1">Reuben D. Budiardja</a>, <a href="http://arxiv.org/find/physics/1/au:+Cardall_C/0/1/0/all/0/1">Christian Y. Cardall</a>

We use OpenMP directives to target hardware accelerators (GPUs) on Summit, a
newly deployed supercomputer at the Oak Ridge Leadership Computing Facility
(OLCF), demonstrating simplified access to GPU devices for users of our
astrophysics code GenASiS and useful speedup on a sample fluid dynamics
problem. At a lower level, we use the capabilities of Fortran 2003 for C
interoperability to provide wrappers to the OpenMP device memory runtime
library routines (currently available only in C). At a higher level, we use C
interoperability and Fortran 2003 type-bound procedures to modify our workhorse
class for data storage to include members and methods that significantly
streamline the persistent allocation of and on-demand association to GPU
memory. Where the rubber meets the road, users offload computational kernels
with OpenMP target directives that are rather similar to constructs already
familiar from multi-core parallelization. In this initial example we
demonstrate total wall time speedups of ~4X in ‘proportional resource tests’
that compare runs with a given percentage of nodes’ GPUs with runs utilizing
instead the same percentage of nodes’ CPU cores, and reasonable weak scaling up
to 8000 GPUs vs. 56,000 CPU cores (1333 1/3 Summit nodes). These speedups
increase to over 12X when pinned memory is used strategically. We make
available the source code from this work at
https://github.com/GenASiS/GenASiS_Basics.

We use OpenMP directives to target hardware accelerators (GPUs) on Summit, a
newly deployed supercomputer at the Oak Ridge Leadership Computing Facility
(OLCF), demonstrating simplified access to GPU devices for users of our
astrophysics code GenASiS and useful speedup on a sample fluid dynamics
problem. At a lower level, we use the capabilities of Fortran 2003 for C
interoperability to provide wrappers to the OpenMP device memory runtime
library routines (currently available only in C). At a higher level, we use C
interoperability and Fortran 2003 type-bound procedures to modify our workhorse
class for data storage to include members and methods that significantly
streamline the persistent allocation of and on-demand association to GPU
memory. Where the rubber meets the road, users offload computational kernels
with OpenMP target directives that are rather similar to constructs already
familiar from multi-core parallelization. In this initial example we
demonstrate total wall time speedups of ~4X in ‘proportional resource tests’
that compare runs with a given percentage of nodes’ GPUs with runs utilizing
instead the same percentage of nodes’ CPU cores, and reasonable weak scaling up
to 8000 GPUs vs. 56,000 CPU cores (1333 1/3 Summit nodes). These speedups
increase to over 12X when pinned memory is used strategically. We make
available the source code from this work at
https://github.com/GenASiS/GenASiS_Basics.

http://arxiv.org/icons/sfx.gif