Department of 
Computer Science

Back to Main Page

Line
Workshop Program

Registration

Travel Information

Sponsors

Local Guide

Steering Committee

Workshop Contact

Call for Posters

 

UNC GAMMA Research Group

UNC Walkthrough Research Group

UNC Graphics & Imaging Analysis Research Cluster

 

Confirmed List of Invited Speakers

The Why, How and When of Multicore
Anant Agarwal
, Professor
Department of Electrical Engineering and Computer Science
Massachusetts Institute of Technology

As single processors reach their power and performance limits, chip designers are increasingly turning to multicore architectures to stay on the trajectory identified by Moore’s Law. Whether they are used as desktop CPUs, DSPs, or other embedded processors, the number of cores is expected to double every 18 months. Going multicore offers a game-changing opportunity for improvements in processing performance and power efficiency, but multicore designers also face many new challenges. Using MIT’s Raw multicore processor as an example, the presentation will address the benefits of multicore, as well as discuss the challenges they face.

The presentation identifies the 3 P’s: Power efficiency, Performance and Programmability, as the three biggest challenges faced by multicore, and accordingly the yardstick by which various multicore designs will be judged. Interestingly, power efficiency and performance are not only the biggest opportunities afforded by multicore over single processors, but also significant challenges as we scale the number of cores beyond today's single digit designs. A multicore programming model is a third and daunting challenge involved in efficiently using multiple cores on a chip. The presentation will also discuss some of the myths related to multicore, as well as the realities we face with multicore designs both today and in the future.

Stream Programming: Managing Explicit Parallelism and Locality
William J. Dally, Willard R. and Inez Kerr Bell Professor
Departments of Computer Science and Electrical Engineering
Stanford University

Computing on the edge requires exploiting parallelism to achieve performance and exploiting locality to efficiently use bandwidth. Parallelism can take advantage of the plentiful and inexpensive arithmetic units made possible by modern VLSI technology. However, without locality, bandwidth quickly becomes a bottleneck. Bandwidth, not arithmetic is the critical resource in a modern computing system. Stream programming simplifies the exploitation of both parallelism and locality. A stream program naturally exposes parallelism across stream elements and kernels. Locality is also exposed - both within and between kernels. This talk will discuss exploitation of paralleism and locality with examples drawn from the Imagine and Merrimac projects and from three generations of stream programming systems.

Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy 
Jack Dongarra, University Distinguished Professor
Department of Computer Science
University of Tennessee, and
Distinguished Research Staff, CS and Mathematics Division
Oak Ridge National Laboratory

Recent versions of microprocessors exhibit performance characteristics for 32 bit floating point arithmetic (single precision) that is substantially higher than 64 bit floating point arithmetic (double precision). Examples include the Intel's Pentium IV and M processors, AMD's Opteron architectures and the IBM's Cell processor. When working in single precision, floating point operations can be performed up to two times faster on the Pentium and up to ten times faster on the Cell over double precision. The performance enhancements in these architectures are derived by accessing extensions to the basic architecture, such as SSE2 in the case of the Pentium and the vector functions on the IBM Cell. The motivation for this paper is to exploit single precision operations whenever possible and resort to double precision at critical stages while attempting to provide the full double precision results. The results described here are fairly general and can be applied to various problems in linear algebra such as solving large sparse systems, using direct or iterative methods and some eigenvalue problems. There are limitations to the success of this process, such as when the conditioning of the problem exceeds the reciprocal of the accuracy of the single precision computations. In that case the double precision algorithm should be used.

Scalable Benchmarks and Kernels for Data Mining and Analytics
Vipin Kumar, William Norris Professor and Head
Department of Computer Science and Engineering
University of Minnesota

Today's connect anytime and anywhere digital society is fueling tremendous data growth, transforming the way business, science, and society function. Data in terabytes range are not uncommon today and are expected to reach petabytes in the near future for many application domains in science, engineering, business, bioinformatics, and medicine. In addition, the complexity of data is also increasing. For these reasons, there is an increasing need for automated data analysis and mining to extract the required information and knowledge from these data sets. However, the computational complexity of data mining algorithms combined with this deluge of data creates an important challenge. Hence, without
a significant leap forward in computing capabilities and technological innovation, the opportunity to harvest this wealth of data will be lost. In this work, we aim to take important first steps towards such a revolution in computing capabilities and develop the underlying infrastructure that will allow other researchers to embark upon this important challenge.

This talk will present an overview of our collaborative project with Northestern University (Alok Choudhary and Gokhan Memik) whose goal is to (a) develop a benchmarking suite that will be used to understand the bottlenecks in high performance data mining and guide in the development of next-generation processors and (b) devise data mining kernels that can be efficiently executed on existing and future processors.

Program Analysis and Synthesis for Parallel Computing
David Padua, Donald Biggar Willett Professor
Department of Computer Science
University of Illinois at Urbana-Champaign

Despite the effort of the last few decades, compiler technology has failed to impact parallel programming practice. Popular parallel programming paradigms today rely on library routines (MPI, threads) or simple translation strategies (OpenMP). However, program analysis and transformation technology have the potential of making significant contributions in the near future to parallel programming. I will discuss three of the most promising possibilities: trace analysis, program synthesis and program transformations to facilitate parallelization. The first approach has already found its way into commercial products. The second approach has also led to useful implementations (e.g ATLAS, FFTW) although so far the target codes are not parallel. The third approach is only a proposal, but a feasible one.

Communication Analysis of the Cell Broadband Engine Processor
Fabrizio Petrini, Laboratory Fellow
Computational Sciences and Mathematics Division
Computational and Information Sciences Directorate
Pacific NorthWest Labs

The existence of major obstacles to the traditional path to processor performance improvement has led chip manufacturers to consider multi-core designs. These architectural solutions promise a variety of power/performance and area/performance benefits. But additional care must be taken to ensure that these benefits are not lost due to inadequate design of the on-chip communication network.

This paper presents the design challenges of the on-chip network of the Cell Broadband Engine (Cell BE) processor, and describes in detail its architectural design and the network, communication and synchronization protocols. In the experimental evaluation, performed on an early prototype, we analyze the communication characteristics of the Cell BE processor, using a series of microbenchmarks involving various DMA traffic patterns and synchronization protocols. We find that the on-chip communication subsystem is well matched to the to computational capacity of the processor. A Synergistic Processing Element (SPE) can issue an internal direct memory access (DMA) operation in less than 4 nanoseconds, and a DMA of a single cache line can be executed in less the than 100 nanoseconds. SPEs can achieve the optimal bandwidth of 25.6 GB/second in point to point communication with surprisingly small messages -only a few KB, using batches of non-blocking DMAs. The aggregate network behavior under heavy load is also remarkably efficient, reaching almost 200 GB/second with collective patterns and optimal contention resolution under hot-spot traffic.

Who needs optimizing compilers when we have self-optimizing programs?
Keshav Pingali, India Chair of Computer Science
Department of Computer Science
Cornell University

Self-optimizing programs have been advocated recently as an alternative to optimizing compilers. A number of different approaches to self-optimization have been explored in the literature. For example, the ATLAS system for generating optimized BLAS codes performs empirical search for key parameter values like cache and register tile sizes. Other systems like FFTW use self-tuning, cache-oblivious algorithms and data structures to implement portable, high-level, machine-independent programs whose performance can be competitive with that of cache-aware programs customized to particular machines.
 
In this talk, we describe the main results of an extensive experimental study of self-optimizing programs for dense linear algebra problems, and discuss the main lessons from this study for compiler writers. This is joint work with David Padua, Fred Gustavson, Kamen Yotov, Xiaoming Li, Maria Garzaran, John Gunnels, Paul Stodghill, and Tom Roeder.

Computing the Future: Release 2016
Daniel A. Reed, Chancellor's Eminent Professor
Renaissance Computing Institute, and
Department of Computer Science,
Vice Chancellor for Information Technology
University of North Carolina at Chapel Hill

Ten years - a geological epoch on the computing time scale. Looking back, a decade brought the web and consumer email, digital cameras and music, broadband networking, multifunction cell phones, WiFi, HDTV, telematics, multiplayer games, electronic commerce and computational science. It also brought spam, phishing, identity theft, software insecurity, outsourcing and globalization, information warfare and blurred work-life boundaries. What will a decade of technology advances bring in communications and collaboration, sensors and knowledge management, modeling and discovery, electronic commerce and digital entertainment, critical infrastructure management and security?

Prognostication is always fraught with challenges, especially when predicting the effects of exponential change. Aggressively inventing the future, based on perceived needs and opportunities, is far more valuable. In this presentation, we discuss the risks inherent in incrementalism and some of the challenges in research culture change to accelerate innovation.

Database Systems on Modern Processors
Kenneth Ross, Professor
Department of Computer Science
Columbia University

Database systems are ubiquitous in the modern enterprise, being used for transaction processing, data warehousing, web-site support and many other purposes. Many of today's commercial database systems were initially developed more than 25 years ago. Processor and disk characteristics have changed dramatically over that time, creating major changes to the trade-offs inherent in database system design. For example, today's database applications are typically CPU/memory bound, and not I/O bound. I will give a brief overview of the development of database software systems, with an emphasis on how database system design has responded to the technology of the time. I will describe various ways that researchers today are rethinking database systems in an architecture-sensitive way. Finally, I will touch on opportunities for database systems on future architectures.

Bringing Co-processor Performance to Every Programmer
Turner Whitted, Senior Researcher
Hardware Devices and Graphics Groups
Microsoft Research

While parallel co-processors such as GPUs are commodities on today's computers, access to their performance is blocked by the need for specialized programming skills. Microsoft Research's Accelerator project seeks to remove this hurdle through additions to high level general purpose languages. Programmers can declare a data parallel array, place data in the array, and extract data from the array. Operations on the data parallel array take place within the co-processor. The operations are defined in a library which automatically generates an optimized set of calls into the co-processor.

While not all benchmark applications benefit from execution on a GPU, many show as much as an order of magnitude speed improvement. Moreover, for the benchmarks that we have measured, the performance obtained with Accelerator is typically about 50% of that produced by hand-coded GPU programs.

The list will continue to be updated.

Horizontal Line
Department of Computer Science
Campus Box 3175, Sitterson Hall
College of Arts & Sciences
The University of North Carolina at Chapel Hill
Chapel Hill, NC 27599-3175 USA
Phone: (919) 962-1700
Fax: (919) 962-1799
Content Manager: pubs@cs.unc.edu
Server Manager: webmaster@cs.unc.edu
Last Content Review:
4 May 2006