Task Parallel Computing with
     PPEVAL >

Vectorizing Your MATLAB®      Code for Best Performance >

Parallel File I/O >

 

Computational Ecology at University of California at Santa Barbara >

Genomic Analysis at the National Cancer Institute >

Integration with Numerical Algorithms Group >

Fast and Fourier: FFTW Integrated in Star-P >

“Fast and Fourier”: Under
the hood of high-performance FFTs in Star-P

Significant upgrade in the functionality and performance of Fast Fourier Transform functions like fft, fft2 and fftn on distributed data. More >

Python Users: Extend your codes to parallel servers and clusters. More >

Interactive Tour >

Python Users: Join our early adopter program!
Whitepaper Library: Going Parallel: An Implementation Guide.
Productivity Breakthrough for Python, and Others. Learn More.

“Fast and Fourier”: Under the hood of high-performance FFTs in Star-P


The upcoming release of Star-P features a significant upgrade in the functionality and performance of Fast Fourier Transform functions like fft, fft2 and fftn on distributed data. In this article, we will take a quick look behind the scenes and briefly examine some of the design and implementation issues in creating high-performance distributed-memory Fourier transforms – operations that are 200%-300% faster in the new release of Star-P.
By a “high-performing” distributed FFT, we mean an implementation that performs well on one processor, scales well with increasing numbers of processors on a variety of interconnects and to the extent possible, is not sensitive to the problem size and data distribution. For example, a two-dimensional FFT implementation must perform equally well regardless of whether the input matrix is distributed by rows or by columns.

Extending Functionality through Star-P Connect
The revamped FFT functionality is implemented using the Star-P Connect library API link. Star-P Connect enables anyone – our customers, partners, and Interactive Supercomputing’s engineering team – to extend the functionality of the Star-P compute engine based on your particular application and algorithm requirements. You can plug in existing serial and parallel libraries, access them via the desktop tools such as MATLAB® and Python, and execute them in a task- and data-parallel modes.

Star-P Connect provides optimized, high-level functions for global-array operations such as transpose and redistribution. The high-level constructs enabled the entire functionality (one-dimensional FFTs of arbitrary strides as well as multidimensional FFTs) for all supported Star-P data types and distributions to be implemented in a few-hundred lines of C++ code.

The local FFT operations on each process currently use the FFTW library, but other high-performance FFT libraries such as those available using the Intel Math Kernel Library or the AMD Core Math Library can be easily substituted.

Some of the optimizations in the distributed FFT implementation include special transforms for special sizes (such as powers of 2) and the option of choosing different strategies for performing multidimensional FFTs. For instance, the two-dimensional FFT of a distributed MxN matrix can be performed in at least three different ways: using two stride-1 FFTs of length M and N and two transposes, using a stride 1 and stride M FFT with two redistributions and using a stride 1 and stride N FFT with two redistribution steps. The choice of the particular method to use depends on the problem size and is chosen at runtime.

Performance
The following figure illustrates some of the performance gains that have been incorporated in the upcoming release of Star-P for a two-dimensional FFT of size 8192x8192 on a 32-processor SGI Altix machine with 1.5 GHz processors with 64 GB of total memory. The new implementation shows a 200%-300% gain.

Performance gains are also achieved for other platforms, problem sizes and data distributions.

Conclusion
Among commonly used linear algebra operations, FFTs, especially multidimensional ones, typically have a very low computation to memory access ratio. Thanks to libraries such as FFTW, it is possible to achieve close-to-optimal performance on most serial machines. However, it is evident that the last word on distributed-memory architectures is yet to be written. At Interactive Supercomputing, we take high-performance seriously and are constantly trying to push the envelope on “block-buster” math and signal processing operations such as FFTs. As better algorithms and implementations become available, our customers will be pleased to know that these improvements will automatically find their way into the product.

References
Matteo Frigo and Steven G. Johnson, "The Design and Implementation of FFTW3," Proceedings of the IEEE 93 (2), 216–231 (2005) http://www.fftw.org/fftw-paper-ieee.pdf

 

ISC Home | Forward to Friend | Subscribe

©Copyright 2007 Interactive Supercomputing, Inc. and its licensors. All rights reserved.
Interactive Supercomputing, Inc. | 135 Beaver St. | Waltham, MA 02452
Phone: +1.781.419.5050 | Fax: +1.781.419.6050 www.interactivesupercomputing.com
STAR-P™ and the "star" logo are trademarks of Interactive Supercomputing. MATLAB® is a registered trademark of The MathWorks, Inc.