Hotel Cap Roig, Platja d'Aro, Catalunya, Spain July 7-9, 2014
Keynote speaker: Prof. Murali Annavaram, University of Southern California
Title: GPU Reliability: Why it matters and what we can do about it?
Abstract: In the past 10 years graphics processing units (GPUs) moved from
gamers darlings to the backbone of supercomputing infrastructure. While
computational inaccuracies can be barely tolerated even in multimedia domains,
in scientific and general purpose domains computational integrity becomes
sacrosanct. Unlike large out-of-order processors, GPUs use the last available
transistor for non-speculative computation where errors cannot be easily masked.
Unfortunately the technology progression is not on our side in this battle.
Smaller dimensions increase soft error vulnerability of SRAMs, while logic
circuits face fast wearout. Hence, new solutions must be explored for improving
computing fabric in general, and GPU fabric in particular.
This talk first looks at accurately modeling the vulnerability of caches to
soft errors. We will discuss the challenges in accurately quantifying the FIT
rate (failures in a billion hours) of caches that are protected by complex error
correction schemes. We will answer questions such as: is parity sufficient to
meet a given FIT goal or do I really need to use the SECDED code?
In the second half of this talk we will look at mechanisms for verifying
the computational integrity of hundreds of execution units in GPUs. We will
look at both error detection and error correction schemes that take advantage
of resource replication and resource underutilization in GPUs to provide strong
computational integrity guarantees. And, along the way I hope to fire our
collective imagination for new research directions to improve reliability.
Bio: Murali Annavaram has been a faculty member in the Ming-Hsieh Department
of Electrical Engineering at the University of Southern California from 2007.
He currently holds the Robert G. and Mary G. Lane Early Career Chair. His
research focuses on energy efficiency and reliability of computing platforms.
His group also works on energy efficient sensor management for body area sensor
networks for continuous and real-time health monitoring. Murali received NSF
CAREER award in 2010 and an IBM Faculty Partnership award in 2009.
Prior to his appointment at USC, he was a senior research scientist at the Intel
Microprocessor Research Labs from 2001 to 2007 working on energy efficient
server design and 3D stacking architectures. In 2007 he was a visiting
researcher at the Nokia Research Center, Palo Alto working on virtual trip line
based traffic sensing. His work on Energy Per Instruction Throttling at Intel is
the foundation for Turboboost that improves performance at a fixed power
budget. His work on Virtual-Trip-Lines at Nokia formed the foundation for Nokia
Traffic Works product that provides real time traffic sensing using mobile
phones. He received the Ph.D. degree in Computer Engineering from the
University of Michigan, Ann Arbor, in 2001. He is a Senior Member of IEEE and
ACM.
More info at http://www.usc.edu/dept/ee/scip/.