2DECOMP Logo Library for 2D pencil decomposition and distributed Fast Fourier Transform

Case Study -- Vortex generation using FFT in a CFD code

As already covered in other parts of the documentation, one major strength of the 2DECOMP&FFT library is its user-friendliness. This case study demonstrates the parallelisation practice of a real-world CFD application.

Background information

I work for the CSE team supporting HECToR, the UK's national supercomputing facility. I was one day contacted by Dr. Shankar Balakrishnan from Southampton University regarding the use of FFT in his CFD code. He later provided the following summary:

"For direct numerical simulations of fluid flow phenomenon involving vortex rings, the distribution of vorticity within the core of the vortex ring is initialised by an analytical expression usually a Gaussian distribution. From this known vorticity field the corresponding velocity field needs to be obtained to initialise the flow. A three-dimensional Fourier-transform is performed on the vorticity field and the Fourier coefficients are obtained in 'wave-space'. The Fourier components of the velocity field are then obtained and an inverse Fourier transform is performed to obtain the velocity field in real-space.

The computational time for this initialisation process is typically small compared to the overall runtime for the simulation of the evolution of the fluid flow with time. Hence the gain in computational time does not justify the parallelisation of the code used for the initialisation process. However, the memory required for the process becomes too large to be performed on a single node as the domain size increases. For this reason the initialisation process needs to be parallelised to distribute the work over multiple nodes. Using a code for performing fast-Fourier transforms in parallel, Vorticity field-Velocity field transformations can be performed for computational domains of indefinitely large sizes."

Without knowing much scientific details of his work, it was possible to parallelise his code using the 2DECOMP&FFT library in a matter of days. Some essential parts of the serial and the the parallel code are compared side-by-side below to show how easy it is to use the library.

Serial Code Parallel Code
program	vort

implicit none

! use FFTW for the Fourier transforms
include "fftw3.f"


integer, parameter :: mytype = 8

! global size
integer, parameter :: n1=8, n2=8, n3=8






real(mytype), allocatable, dimension(:,:,:) :: &
ox,oy,oz, uu,vv,ww
complex(mytype), allocatable, dimension(:,:,:) :: &
Ax,Ay,Az,U,V,W

integer*8 :: plan



! allocate space for real-space vorticity field
allocate(ox(0:n1-1,0:n2-1,0:n3-1))
allocate(oy(0:n1-1,0:n2-1,0:n3-1))
allocate(oz(0:n1-1,0:n2-1,0:n3-1))







! initialise vorticity in real-space
do k=0,n3-1
do j=0,n2-1
do i=0,n1-1

! init ox, oy, oz
call vortexOmega( &
ox(i,j,k),oy(i,j,k),oz(i,j,k), &
xgs+dx*(i+0.5-1), ygs+dy*(j+0.5-1), &
zgs+dz*(k+0.5-1))
end do
end do
end do






! allocate space for wave-space variables
allocate(Ax(0:n1/2,0:n2-1,0:n3-1))
allocate(Ay(0:n1/2,0:n2-1,0:n3-1))
allocate(Az(0:n1/2,0:n2-1,0:n3-1))
allocate(U(0:n1/2,0:n2-1,0:n3-1))
allocate(V(0:n1/2,0:n2-1,0:n3-1))
allocate(W(0:n1/2,0:n2-1,0:n3-1))













! Using FFTW,
! forward FFT - ox,oy,oz -> Ax,Ay,Az
call dfftw_plan_dft_r2c_3d( &
plan,n1,n2,n3,ox,Ax,FFTW_ESTIMATE)
call dfftw_execute_dft_r2c(plan,ox,Ax)
call dfftw_plan_dft_r2c_3d( &
plan,n1,n2,n3,oy,Ay,FFTW_ESTIMATE)
call dfftw_execute_dft_r2c(plan,oy,Ay)
call dfftw_plan_dft_r2c_3d( &
plan,n1,n2,n3,oz,Az,FFTW_ESTIMATE)
call dfftw_execute_dft_r2c(plan,oz,Az)

deallocate(ox,oy,oz)

! Fourier domain stuff
do k=0,n3-1
do j=0,n2-1
do i=0,n1/2

! wave-space calculation omitted
U(i,j,k) = ......
V(i,j,k) = ......
W(i,j,k) = ......

end do
end do
end do

deallocate(Ax,Ay,Az)

! allocate real-space velocity variables
allocate(uu(0:n1-1,0:n2-1,0:n3-1))
allocate(vv(0:n1-1,0:n2-1,0:n3-1))
allocate(ww(0:n1-1,0:n2-1,0:n3-1))







! Using FFTW,
! inverse FFT U,V,W -> uu,vv,ww
call dfftw_plan_dft_c2r_3d( &
plan,n1,n2,n3,U,uu,FFTW_ESTIMATE)
call dfftw_execute_dft_c2r(plan,U,uu)
call dfftw_plan_dft_c2r_3d( &
plan,n1,n2,n3,V,vv,FFTW_ESTIMATE)
call dfftw_execute_dft_c2r(plan,V,vv)
call dfftw_plan_dft_c2r_3d( &
plan,n1,n2,n3,W,ww,FFTW_ESTIMATE)
call dfftw_execute_dft_c2r(plan,W,WW)

! rescale
uu = uu*scale
vv = vv*scale
ww = ww*scale

deallocate(U,V,W)

! write out using Fortran direct access
inquire(iolength=iol) uu(1,1,1)
open(10, FILE='data_serial', &
FORM='unformatted', &
ACCESS='DIRECT', RECL=iol)
n=1
do k=0,n3-1
do j=0,n2-1
do i=0,n1-1
write(10,rec=n) uu(i,j,k)
n=n+1
end do
end do
end do
do k=0,n3-1
do j=0,n2-1
do i=0,n1-1
write(10,rec=n) vv(i,j,k)
n=n+1
end do
end do
end do
do k=0,n3-1
do j=0,n2-1
do i=0,n1-1
write(10,rec=n) ww(i,j,k)
n=n+1
end do
end do
end do
close(10)

deallocate(uu,vv,ww)






end program vort
program	vort

! use the 2DECOMP&FFT instead of FFTW
use decomp_2d
use decomp_2d_fft
use decomp_2d_io
use MPI

implicit none

! global size
 integer, parameter :: n1=8, n2=8, n3=8
! 2D processor grid
 integer, parameter :: p_row=2, p_col=2

integer (kind=MPI_OFFSET_KIND) :: filesize, disp
integer, dimension(3) :: fft_start, fft_end, fft_size

real(mytype), allocatable, dimension(:,:,:) :: &
ox,oy,oz, uu,vv,ww
complex(mytype), allocatable, dimension(:,:,:) :: &
Ax,Ay,Az,U,V,W

! initialisation of 2DECOMP&FFT
call MPI_INIT(ierror)
call decomp_2d_init(n1,n2,n3,p_row,p_col)

! allocate space for real-space vorticity field
 allocate(ox(xstart(1)-1:xend(1)-1, &
xstart(2)-1:xend(2)-1, &
xstart(3)-1:xend(3)-1))
allocate(oy(xstart(1)-1:xend(1)-1, &
xstart(2)-1:xend(2)-1, &
xstart(3)-1:xend(3)-1))
allocate(oz(xstart(1)-1:xend(1)-1, &
xstart(2)-1:xend(2)-1, &
xstart(3)-1:xend(3)-1))

! initialise vorticity in real-space
do k=xstart(3)-1,xend(3)-1
do j=xstart(2)-1,xend(2)-1
do i=xstart(1)-1,xend(1)-1
! subroutine identical in serial
! and parallel version
call vortexOmega( &
ox(i,j,k),oy(i,j,k),oz(i,j,k), &
xgs+dx*(i+0.5-1), ygs+dy*(j+0.5-1), &
zgs+dz*(k+0.5-1))
end do
end do
end do

! initialise FFT library
call decomp_2d_fft_init
call decomp_2d_fft_get_size( &
fft_start,fft_end,fft_size)

! allocate space for wave-space variables
 allocate(Ax(fft_start(1)-1:fft_end(1)-1, &
fft_start(2)-1:fft_end(2)-1, &
fft_start(3)-1:fft_end(3)-1) )
allocate(Ay(fft_start(1)-1:fft_end(1)-1, &
fft_start(2)-1:fft_end(2)-1, &
fft_start(3)-1:fft_end(3)-1) )
allocate(Az(fft_start(1)-1:fft_end(1)-1, &
fft_start(2)-1:fft_end(2)-1, &
fft_start(3)-1:fft_end(3)-1) )
allocate(U(fft_start(1)-1:fft_end(1)-1, &
fft_start(2)-1:fft_end(2)-1, &
fft_start(3)-1:fft_end(3)-1) )
allocate(V(fft_start(1)-1:fft_end(1)-1, &
fft_start(2)-1:fft_end(2)-1, &
fft_start(3)-1:fft_end(3)-1) )
allocate(W(fft_start(1)-1:fft_end(1)-1, &
fft_start(2)-1:fft_end(2)-1, &
fft_start(3)-1:fft_end(3)-1) )

! Using 2DECOMP&FFT,
! forward FFT - ox,oy,oz -> Ax,Ay,Az
call decomp_2d_fft_3d(ox, Ax)


call decomp_2d_fft_3d(oy, Ay)


call decomp_2d_fft_3d(oz, Az)



deallocate(ox,oy,oz)

! Fourier domain stuff
do k=fft_start(3)-1,fft_end(3)-1
do j=fft_start(2)-1,fft_end(2)-1
do i=fft_start(1)-1,fft_end(1)-1

! wave-space calculation omitted
U(i,j,k) = ......
V(i,j,k) = ......
W(i,j,k) = ......

end do
end do
end do

deallocate(Ax,Ay,Az)

! allocate real-space velocity variables
allocate(uu(xstart(1)-1:xend(1)-1, &
xstart(2)-1:xend(2)-1, &
xstart(3)-1:xend(3)-1))
allocate(vv(xstart(1)-1:xend(1)-1, &
xstart(2)-1:xend(2)-1, &
xstart(3)-1:xend(3)-1))
allocate(ww(xstart(1)-1:xend(1)-1, &
xstart(2)-1:xend(2)-1, &
xstart(3)-1:xend(3)-1))

! Using 2DECOMP&FFT,
!inverse FFT U,V,W -> uu,vv,ww
call decomp_2d_fft_3d(U,uu)


call decomp_2d_fft_3d(V,vv)


call decomp_2d_fft_3d(W,ww)



! rescale
uu = uu*scale
vv = vv*scale
ww = ww*scale

deallocate(U,V,W)

! write out using 2DECOMP&FFT MPI-IO routines

call MPI_FILE_OPEN(MPI_COMM_WORLD, &
'data_parallel', &
MPI_MODE_CREATE+MPI_MODE_WRONLY, &
MPI_INFO_NULL, &
fh, ierror)

! guarantee overwriting
filesize = 0_MPI_OFFSET_KIND
call MPI_FILE_SET_SIZE(fh,filesize,ierror)
disp = 0_MPI_OFFSET_KIND
call decomp_2d_write_var(fh,disp,n1,n2,n3,1,uu)
call decomp_2d_write_var(fh,disp,n1,n2,n3,1,vv)
call decomp_2d_write_var(fh,disp,n1,n2,n3,1,ww)















call MPI_FILE_CLOSE(fh,ierror)

deallocate(uu,vv,ww)

! clean up
call decomp_2d_fft_finalize
call decomp_2d_finalize
call MPI_FINALIZE(ierror)

end program vort

As can be seen, the code structure remains unchanged in the parallel version. The main effort in the parallel code is to properly define the portion of data locally resided on each processor (in allocation statements and loop count variables) - all such decomposition information is available through public variables after the initialisation of the library. The code actually looks messier than it should be because it was translated from C with a 0-based index system. The actual FFT computations are much simpler because the use of FFTW (or whatever other 3rd party libraries) is handled internally by 2DECOMP&FFT. It is also noticeable that very few MPI routines are called in the parallel code - the library hides as much communication details as possible.

In the end, Dr Balakrishnan reported back that the data generated and written out using the MPI-IO routines in the parallel code is identical to that written out using Fortran direct-access IO from the serial code. He is now able to initialise much bigger data sets for his direct numerical simulations.