|
Back to Main Page
Workshop
Program
Registration
Travel
Information
Sponsors
Local
Guide
Steering
Committee
Workshop
Contact
Call
for Posters
UNC
GAMMA Research Group
UNC
Walkthrough Research Group
UNC
Graphics & Imaging Analysis Research Cluster
|
|
Confirmed List of Invited
Speakers
|
The Why, How and When of Multicore
Anant Agarwal, Professor
Department of Electrical Engineering and Computer Science
Massachusetts Institute of Technology
As single processors reach their power and performance limits,
chip designers are increasingly turning to multicore architectures
to stay on the trajectory identified by Moore’s Law. Whether
they are used as desktop CPUs, DSPs, or other embedded processors,
the number of cores is expected to double every 18 months. Going
multicore offers a game-changing opportunity for improvements in
processing performance and power efficiency, but multicore designers
also face many new challenges. Using MIT’s Raw multicore processor
as an example, the presentation will address the benefits of multicore,
as well as discuss the challenges they face.
The presentation identifies the 3 P’s: Power efficiency,
Performance and Programmability, as the three biggest challenges
faced by multicore, and accordingly the yardstick by which various
multicore designs will be judged. Interestingly, power efficiency
and performance are not only the biggest opportunities afforded
by multicore over single processors, but also significant challenges
as we scale the number of cores beyond today's single digit designs.
A multicore programming model is a third and daunting challenge
involved in efficiently using multiple cores on a chip. The presentation
will also discuss some of the myths related to multicore, as well
as the realities we face with multicore designs both today and in
the future.
|
|
Stream Programming: Managing Explicit Parallelism and Locality
William
J. Dally, Willard R. and Inez Kerr Bell Professor
Departments of Computer Science and Electrical Engineering
Stanford University
Computing on the edge requires exploiting parallelism to achieve
performance and exploiting locality to efficiently use bandwidth.
Parallelism can take advantage of the plentiful and inexpensive
arithmetic units made possible by modern VLSI technology. However,
without locality, bandwidth quickly becomes a bottleneck. Bandwidth,
not arithmetic is the critical resource in a modern computing system.
Stream programming simplifies the exploitation of both parallelism
and locality. A stream program naturally exposes parallelism across
stream elements and kernels. Locality is also exposed - both within
and between kernels. This talk will discuss exploitation of paralleism
and locality with examples drawn from the Imagine and Merrimac projects
and from three generations of stream programming systems.
|
Exploiting the
Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit
Accuracy
Jack
Dongarra, University Distinguished Professor
Department of Computer Science
University of Tennessee, and
Distinguished Research Staff, CS and Mathematics Division
Oak Ridge National Laboratory
Recent
versions of microprocessors exhibit performance characteristics for 32
bit floating point arithmetic (single precision) that is substantially
higher than 64 bit floating point arithmetic (double precision).
Examples include the Intel's Pentium IV and M processors, AMD's
Opteron architectures and the IBM's Cell processor. When working in
single precision, floating point operations can be performed up to two
times faster on the Pentium and up to ten times faster on the Cell
over double precision. The performance enhancements in these
architectures are derived by accessing extensions to the basic
architecture, such as SSE2 in the case of the Pentium and the vector
functions on the IBM Cell. The motivation for this paper is to exploit
single precision operations whenever possible and resort to double
precision at critical stages while attempting to provide the full
double precision results. The results described here are fairly
general and can be applied to various problems in linear algebra such
as solving large sparse systems, using direct or iterative methods and
some eigenvalue problems. There are limitations to the success of this
process, such as when the conditioning of the problem exceeds the
reciprocal of the accuracy of the single precision computations. In
that case the double precision algorithm should be used. |
|
Scalable Benchmarks and Kernels
for Data Mining and Analytics
Vipin
Kumar, William Norris Professor and Head
Department of Computer Science and Engineering
University of Minnesota
Today's connect anytime and anywhere digital society is fueling
tremendous data growth, transforming the way business, science,
and society function. Data in terabytes range are not uncommon today
and are expected to reach petabytes in the near future for many
application domains in science, engineering, business, bioinformatics,
and medicine. In addition, the complexity of data is also increasing.
For these reasons, there is an increasing need for automated data
analysis and mining to extract the required information and knowledge
from these data sets. However, the computational complexity of data
mining algorithms combined with this deluge of data creates an important
challenge. Hence, without
a significant leap forward in computing capabilities and technological
innovation, the opportunity to harvest this wealth of data will
be lost. In this work, we aim to take important first steps towards
such a revolution in computing capabilities and develop the underlying
infrastructure that will allow other researchers to embark upon
this important challenge.
This talk will present an overview of our collaborative project
with Northestern University (Alok Choudhary and Gokhan Memik) whose
goal is to (a) develop a benchmarking suite that will be used to
understand the bottlenecks in high performance data mining and guide
in the development of next-generation processors and (b) devise
data mining kernels that can be efficiently executed on existing
and future processors. |
Program Analysis and Synthesis for
Parallel Computing
David
Padua, Donald Biggar Willett Professor
Department of Computer Science
University of Illinois at Urbana-ChampaignDespite the effort
of the last few decades, compiler technology has failed to impact
parallel programming practice. Popular parallel programming paradigms
today rely on library routines (MPI, threads) or simple translation
strategies (OpenMP). However, program analysis and transformation
technology have the potential of making significant contributions in
the near future to parallel programming. I will discuss three of the
most promising possibilities: trace analysis, program synthesis and
program transformations to facilitate parallelization. The first
approach has already found its way into commercial products. The
second approach has also led to useful implementations (e.g ATLAS,
FFTW) although so far the target codes are not parallel. The third
approach is only a proposal, but a feasible one. |
|
Communication
Analysis of the Cell Broadband Engine Processor
Fabrizio Petrini,
Laboratory Fellow
Computational Sciences and Mathematics Division
Computational and Information Sciences Directorate
Pacific NorthWest Labs
The existence of major obstacles to the traditional path to processor
performance improvement has led chip manufacturers to consider multi-core
designs. These architectural solutions promise a variety of power/performance
and area/performance benefits. But additional care must be taken
to ensure that these benefits are not lost due to inadequate design
of the on-chip communication network.
This paper presents the design challenges of the on-chip network
of the Cell Broadband Engine (Cell BE) processor, and describes
in detail its architectural design and the network, communication
and synchronization protocols. In the experimental evaluation, performed
on an early prototype, we analyze the communication characteristics
of the Cell BE processor, using a series of microbenchmarks involving
various DMA traffic patterns and synchronization protocols. We find
that the on-chip communication subsystem is well matched to the
to computational capacity of the processor. A Synergistic Processing
Element (SPE) can issue an internal direct memory access (DMA) operation
in less than 4 nanoseconds, and a DMA of a single cache line can
be executed in less the than 100 nanoseconds. SPEs can achieve the
optimal bandwidth of 25.6 GB/second in point to point communication
with surprisingly small messages -only a few KB, using batches of
non-blocking DMAs. The aggregate network behavior under heavy load
is also remarkably efficient, reaching almost 200 GB/second with
collective patterns and optimal contention resolution under hot-spot
traffic. |
Who
needs optimizing compilers when we have self-optimizing programs?
Keshav
Pingali, India Chair of Computer Science
Department of Computer Science
Cornell University
Self-optimizing programs have been advocated recently as an
alternative to optimizing compilers. A number of different approaches
to self-optimization have been explored in the literature. For
example, the ATLAS system for generating optimized BLAS codes performs
empirical search for key parameter values like cache and register tile
sizes. Other systems like FFTW use self-tuning, cache-oblivious
algorithms and data structures to implement portable, high-level,
machine-independent programs whose performance can be competitive with
that of cache-aware programs customized to particular machines.
In this talk, we describe the main results of an extensive
experimental study of self-optimizing programs for dense linear
algebra problems, and discuss the main lessons from this study for
compiler writers. This is joint work with David Padua, Fred Gustavson,
Kamen Yotov, Xiaoming Li, Maria Garzaran, John Gunnels, Paul Stodghill,
and Tom Roeder. |
|
Computing the Future: Release 2016
Daniel A. Reed,
Chancellor's Eminent Professor
Renaissance Computing
Institute, and
Department of Computer Science,
Vice Chancellor for Information Technology
University of North Carolina at Chapel Hill
Ten years - a geological epoch on the computing
time scale. Looking back, a decade brought the web and consumer email,
digital cameras and music, broadband networking, multifunction cell
phones, WiFi, HDTV, telematics, multiplayer games, electronic commerce
and computational science. It also brought spam, phishing, identity
theft, software insecurity, outsourcing and globalization, information
warfare and blurred work-life boundaries. What will a decade of
technology advances bring in communications and collaboration, sensors
and knowledge management, modeling and discovery, electronic commerce
and digital entertainment, critical infrastructure management and
security?
Prognostication is always fraught with challenges,
especially when predicting the effects of exponential change.
Aggressively inventing the future, based on perceived needs and
opportunities, is far more valuable. In this presentation, we discuss
the risks inherent in incrementalism and some of the challenges in
research culture change to accelerate innovation. |
|
Database Systems on Modern Processors
Kenneth
Ross, Professor
Department of Computer Science
Columbia University
Database systems are ubiquitous in the modern enterprise, being
used for transaction processing, data warehousing, web-site support
and many other purposes. Many of today's commercial database systems
were initially developed more than 25 years ago. Processor and disk
characteristics have changed dramatically over that time, creating
major changes to the trade-offs inherent in database system design.
For example, today's database applications are typically CPU/memory
bound, and not I/O bound. I will give a brief overview of the development
of database software systems, with an emphasis on how database system
design has responded to the technology of the time. I will describe
various ways that researchers today are rethinking database systems
in an architecture-sensitive way. Finally, I will touch on opportunities
for database systems on future architectures. |
|
Bringing Co-processor
Performance to Every Programmer
Turner
Whitted, Senior Researcher
Hardware Devices and Graphics Groups
Microsoft Research
While parallel co-processors such as GPUs are commodities on today's
computers, access to their performance is blocked by the need for
specialized programming skills. Microsoft Research's Accelerator
project seeks to remove this hurdle through additions to high level
general purpose languages. Programmers can declare a data parallel
array, place data in the array, and extract data from the array.
Operations on the data parallel array take place within the co-processor.
The operations are defined in a library which automatically generates
an optimized set of calls into the co-processor.
While not all benchmark applications benefit from execution on
a GPU, many show as much as an order of magnitude speed improvement.
Moreover, for the benchmarks that we have measured, the performance
obtained with Accelerator is typically about 50% of that produced
by hand-coded GPU programs. |
| The list will continue to be updated. |
|