Department of Computer Science, UNC Chapel Hill

GPUSort: High Performance Sorting using Graphics Processors

GPUSort versus qsort()on theCPU

We have observed considerable speedups with GPUSort over qsort() running on the CPU, even for high-end machines. The following graphs show the timings for two test systems. Note that in both cases, GPUSort performs considerably faster. The qsort routine in the Intel C++ compiler is optimized using hyper-threading and SSE instructions.

The implementation can handle both 16 and 32-bit floats. Our results on the ATI X800 XT and NVIDIA GeForce 6800 GPUs indicate better performance in comparison to the qsort and stlsort routines on a 3.4 GHz Pentium IV PC. Our timings for GPU-based sorting include the data upload and readback time from the GPU. Our algorithm performs much better than CLAPACK on a 3.4 GHz Pentium IV PC with a NVIDIA 7800GTX GPU. The performance on 32-bit floating point data is comparable to a hand optimized implementation on 3.2 GHz Intel dual core processors.

On pointer data (example 3 of the distribution), we have observed a similar comparable performance against the same hand-optimized implementation on a 3.2 GHz Intel dual core processor (CPU data courtesy: Cody Northrop, Intel Corporation), and an order of magnitude improvement over other sorting implementations on high end single core processors. For instance, our algorithm takes nearly 2s to sort 8 million database records on a 7800 GTX GPU whereas the hand-optimized algorithm takes 1.5s on a 3.2 GHz dual core processor.

Comparison of GPUSort with previous approaches for sorting on the GPU

The following graph compares the performance of GPUSort with two prior GPU-based sorting algorithms (Purcell et al. 2003, Kipfer et al. 2005) on a NVIDIA 7800 GTX GPU. These algorithms also implement Bitonic Sort on the GPU for 16/32-bit floats using the programmable pipeline of GPUs. However, as shown in the graphs below, GPUSort performs at least 2 times better than these algorithms for 32-bit floats as we effectively utilize the special purpose texture mapping functionality of the GPUs.

GPUSort v/s other GPU sorting implementations

GPU performance Improvement: Super-Moore's Law

The following graph compares the performance of GPUSort using the programmable pipeline on 32-bit floats on two video cards: GeForce 6800 Ultra, which was the latest graphics card from NVIDIA until recently, and GeForce 7800 GTX, which is the latest card from NVIDIA. Notice that sorting on the 7800 GTX is more than thrice as fast than on a 6800 Ultra. This trend is expected to continue for the next several years, which makes sorting on the GPU an attractive option.

Super Moore's Law

©2003 Department of Computer Science, UNC Chapel Hill