GPUFFTW

Introduction

GPUFFTW is a fast FFT library designed to exploit the computational performance and memory bandwidth on GPUs. Our library exploits the data parallelism available on current GPUs and pipelines the computation to the different stages of the graphics processor. Moreover, our library uses an efficient tiling strategy to further improve the memory performance of our algorithm. GPUFFTW can efficiently handle large real and complex 1-D arrays at 32-bit floating point precision on commodity GPUs. Using a NVIDIA 8800 GPU and the FFTW metric for measuring performance, our algorithm is able to achieve over 29 GFLOPS of performance on large 1-D FFTs. Furthermore, our FFT algorithm achieves comparable precision to the IEEE 32-bit FFT algorithms on CPUs even on large 1-D arrays. The library supports both Windows and Linux platforms.

Please refer to the documentation for details regarding the API and the contents of the distribution. Also, please read through the system requirements below before using the library.

Note: GPUFFTW does run correctly on Windows XP and 8800 GTX using the latest NVIDIA drivers 158.19. It also runs on Windows Vista and earlier NVIDIA GPUs and drivers

System Requirements

OS: Microsoft Windows XP/2000 and Linux
RAM: Atleast a size of graphics processor video memory is required.
GPU: NVIDIA GeForce/Quadro family card with support for the following OpenGL extensions:
1. EXT_framebuffer_object
2. ARB_texture_rectangle
3. ARB_fragment_program
4. Multiple Render Targets
The above requirements are met by NV40-based GPUs and above (GeForce 6, GeForce 7 and GeForce 8 series)
The library has been tested on the following cards:
- GeForce 8 series (use latest drivers)
- GeForce 7 series
- GeForce 6 series
- Quadro FX 4000
- Laptop graphics cards: GeForce 7/6 series based laptop cards
  For obtaining reasonably high performance, we recommend a PC with AGP8X/PCI-Express NVIDIA GeForce 6800 GT or faster GPU.
Video RAM: The Video RAM will determine the maximum array length that can be sorted on the GPU. A rough guideline for performing FFT on 32-bit floats is: Maximum array length in millions = Video RAM in MB / 32. Therefore, on a card with 256 MB VRAM, the maximum-length array which can be sorted is 256/32 = 8 Million real values or 4M complex values
Drivers: Latest drivers from NVIDIA (version 7772 or higher for windows, and 7664 for linux)

Note:

FFTW : We are not porting FFTW to GPUs and our project is not related to FFTW. FFTW is a more general library designed mainly for CPUs. GPUFFTW is the fastest FFT library on GPUs similar to FFTW on CPUs and there is no other similarity between these two projects.

ATI cards: ATI cards are not supported in the present release of GPUFFTW mainly due to the lack of suport for ARB_texture_rectangle in fragment programs on current ATI drivers. These cards will be supported in future releases.
Higher Dimensional FFTs Our current code only handles 1D power-of-two single-precision FFTs. Future releases may include 2D and 3D FFTs.

Publications

Cache-Efficient Numerical Algorithms using Graphics Hardware,
Naga K. Govindaraju, Dinesh Manocha
UNC Tech. Report 2007.
A Memory Model for Scientific Algorithms on Graphics Processors,
Naga K. Govindaraju, Scott Larsen, Jim Gray, Dinesh Manocha
UNC Tech. Report 2006.

Acknowledgements

We would like to thank Jim Gray for useful discussions and encouragement. Many thanks to Scott Larsen, John Owens, Daniel Horn, David Tuft and members of UNC GAMMA group for helping us compare against optimized libraries on GPUs and CPUs.

Research Sponsors

Army Modeling and Simulation Office
Army Research Office
Defense Advanced Research Projects Agency
RDECOM
NVIDIA Corporation

Department of Computer Science, UNC Chapel Hill

GPUFFTW: High Performance Power-of-Two FFT Library using Graphics Processors