The upcoming release of Star-P features a significant upgrade in
the functionality and performance of Fast Fourier Transform functions
like fft, fft2 and fftn on distributed data. In this article, we
will take a quick look behind the scenes and briefly examine some
of the design and implementation issues in creating high-performance
distributed-memory Fourier transforms – operations that are
200%-300% faster in the new release of Star-P.
By a “high-performing” distributed FFT, we mean an implementation
that performs well on one processor, scales well with increasing
numbers of processors on a variety of interconnects and to the extent
possible, is not sensitive to the problem size and data distribution.
For example, a two-dimensional FFT implementation must perform equally
well regardless of whether the input matrix is distributed by rows
or by columns.
Extending Functionality through Star-P Connect
The revamped FFT functionality is implemented using the Star-P Connect
library API link. Star-P Connect enables anyone – our customers,
partners, and Interactive Supercomputing’s engineering team
– to extend the functionality of the Star-P compute engine
based on your particular application and algorithm requirements.
You can plug in existing serial and parallel libraries, access them
via the desktop tools such as MATLAB® and Python, and execute them
in a task- and data-parallel modes.
Star-P Connect provides optimized, high-level functions for global-array
operations such as transpose and redistribution. The high-level
constructs enabled the entire functionality (one-dimensional FFTs
of arbitrary strides as well as multidimensional FFTs) for all supported
Star-P data types and distributions to be implemented in a few-hundred
lines of C++ code.
The local FFT operations on each process currently use the FFTW
library, but other high-performance FFT libraries such as those
available using the Intel Math Kernel Library or the AMD Core Math
Library can be easily substituted.
Some of the optimizations in the distributed FFT implementation
include special transforms for special sizes (such as powers of
2) and the option of choosing different strategies for performing
multidimensional FFTs. For instance, the two-dimensional FFT of
a distributed MxN matrix can be performed in at least three different
ways: using two stride-1 FFTs of length M and N and two transposes,
using a stride 1 and stride M FFT with two redistributions and using
a stride 1 and stride N FFT with two redistribution steps. The choice
of the particular method to use depends on the problem size and
is chosen at runtime.
Performance
The following figure illustrates some of the performance gains that
have been incorporated in the upcoming release of Star-P for a two-dimensional
FFT of size 8192x8192 on a 32-processor SGI Altix machine with 1.5
GHz processors with 64 GB of total memory. The new implementation
shows a 200%-300% gain.


Performance gains are also achieved for other platforms, problem
sizes and data distributions.
Conclusion
Among commonly used linear algebra operations, FFTs, especially
multidimensional ones, typically have a very low computation to
memory access ratio. Thanks to libraries such as FFTW, it is possible
to achieve close-to-optimal performance on most serial machines.
However, it is evident that the last word on distributed-memory
architectures is yet to be written. At Interactive Supercomputing,
we take high-performance seriously and are constantly trying to
push the envelope on “block-buster” math and signal
processing operations such as FFTs. As better algorithms and implementations
become available, our customers will be pleased to know that these
improvements will automatically find their way into the product.
References
Matteo Frigo and Steven G. Johnson, "The Design and Implementation
of FFTW3," Proceedings of the IEEE 93 (2), 216–231 (2005)
http://www.fftw.org/fftw-paper-ieee.pdf
|