2DECOMP&FFT - Non_blocking Communication API

Non-blocking API for Overlap of Communication and Computation

Transpose-based parallelisation is inherently communication intensive. For large-scale applications, it is not unusual that communication accounts for more than half of the total cost. Application performance may be significantly improved if algorithms can be redesigned to allow overlap of communication and computation. From version 1.4.x, 2DECOMP&FFT provides a low-level communication API to facilitate such effort.

The API is based on ideas of non-blocking MPI collectives (such as MPI_IALLTOALL and MPI_IALLTOALLV), as reported by Kandalla et al. in a recent study^[1]. These are not part of the present MPI specification, but are widely expected to be introduced in the MPI 3 standard. Currently, one can rely on libNBC^[2], a library that implements the non-blocking MPI collectives using existing MPI 1 functions.

The API

Each of the four transposition routines in the base decomposition library contains three key elements: algorithm to pack the MPI send buffers, MPI_ALLTOALL(V) communication, and algorithms to unpack the MPI receive buffers. When the non-blocking version of the MPI_ALLTOALL(V) is used, these routines are broken into smaller routines. For example, when transposing from X pencils to Y pencils, the blocking version of the communication routine is:

        call transpose_x_to_y(in, out, decomp)

The corresponding non-blocking routines are:

        call transpose_x_to_y_start(handle, in, out, sbuf, rbuf, decomp)
        call transpose_x_to_y_wait(handle, in, out, sbuf, rbuf, decomp)

The start routine packs the MPI send buffer, starts the MPI_ALLTOALL(V) communication, and returns immediately. Later, a call to the corresponding wait routine ensures the communication is completed and then unpacks the MPI receive buffer. The first parameter handle is used to uniquely identify each communication. Because several non-blocking communications may be ongoing at the same time, each has to define its own send buffer sbuf and receive buffer rbuf ¹. It is up to the applications to supply (and if possible, reuse) these buffers, the size and shape of which should match the corresponding input array in and output array out. Between the start call and the wait call, the content of sbuf should not be modified and the content of out should not be referenced, to avoid unpredictable results. Other unrelated computations may be carried out while the communication is ongoing.

There are similar start/wait routines defined to all other transpositions.

Finally, on systems without separate software threads or dedicated hardware to process the communication stack, one has to call MPI_TEST explicitly to progress the non-blocking communication. A utility subroutine is provided for this purpose:

        call transpose_test(handle)

This needs to be called from time to time from the computational part of application, in order to progress the communication identified by handle. Of course, the practical difficulty is whence and how frequently this should be called, a matter that is entirely application dependent. Hopefully, once the MPI-3 standard is finalised and high-quality implementations become available, it would be possible to progress communication asynchronously (possibly via separate software threads spawned by the MPI library).

To Use the API

For advanced users who wish to use this API, first of all libNBC has to be built. Download this library from here. Once downloaded and unpacked on your system, the library can be built using the standard ./configure; make; make install procedure. The following command line is for a Cray XE system:

      ./configure --prefix=/path/to/libNBC MPICXX=CC
      make
      make install

Compile 2DECOMP&FFT as normal, but with the -DOCC flag at compile time to switch on the API. Additional flags at link stage are also necessary to build applications. For example on Cray XE:

      /path/to/libNBC/lib/libnbc.a -lstdc++ -lmpi_cxx

Sample Application

To demonstrate the use of this API, sample applications are provided to compute multiple independent FFTs, using both the blocking and non-blocking versions of the communication library. The idea of overlapping the communication of one 3D FFT and the computation of another, as described by Kandalla et al.,^[1] is implemented in the sample application. The algorithm's pseudo-code looks like:

      1D FFT in X for V_1
      call transpose_x_to_y for V_1 (blocking)
      1D FFT in Y for V_1
      call transpose_y_z_start for V_1
      do k=2,N
        1D FFT in X for V_k
        call transpose_x_to_y for V_k (blocking)
        1D FFT in Y for V_k
        call transpose_y_to_z_start for V_k
        call transpose_y_to_z_wait for V_(k-1)
        1D FFT in Z for V_(k-1)
      end do
      call transpose_y_to_z_wait for V_N to complete
      1D FFT in Z for V_N

As can be seen, the Y=>Z transpose for variable k and the computation of 1D FFT in Z for variable k-1 are overlapped. It is up to the application users to identify opportunities in their algorithms that may benefit from this API.

One may notice that in the sample application the computations are done using loops of 1D FFTs, rather than with FFTW's advanced interface that allows multiple 1D FFTs to be done in one go. This design is to allow MPI_TEST calls to be inserted to progress the communication. Once the MPI 3 standard is finalised and proper library implementations are available, this practice is no longer necessary.

References

K. Kandalla, H. Subramoni, K. Tomko, D. Pekurovsky, S. Sur and D.K. Panda, "High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand clusters: a study with parallel 3D FFT", Computer Science - Research and Development, vol. 26(3-4):237-246, 2011.
T. Hoefler, A. Lumsdaine and W. Rehm, "Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI", International Conference for High Performance Computing, Networking, Storage and Analysis (SC07), Reno, USA, 2007.

Footnotes

1. The blocking version also needs to define send/recv buffers. But because there is only one communication at any time, the buffers can be temporarily allocated as required, or for performance reason defined globally and shared by multiple communication calls.