
Architecting for Resilience at Scale: From Research to Practice
Dr. Sudhanva Gurumurthi โ AMD Fellow
TBD โ TBD
๐Abstract: Computing must be reliable. From a computer architecture perspective, achieving this goal begins with understanding the root causes of faults and applying systematic, quantitative methods to improve the resilience of hardware components. This talk will illustrate this approach through two case studies. The first describes research that led to a new resilience architecture for die-stacked DRAM that was adopted into the third generation of the JEDEC High-Bandwidth Memory standard (HBM3) and incorporated in GPUs and AI accelerators deployed at scale today in data centers. The second focuses on techniques for designing and testing high-performance CPUs to improve their resilience to faults arising from silicon defects. Together, these examples highlight how principled reliability research can translate into practical impact.
๐คBio: Sudhanva Gurumurthi is a Fellow at AMD, where he is responsible for research and advanced development in Reliability, Availability, and Serviceability (RAS). His work has impacted numerous AMD products, multiple industry standards, and external research in the field. Before joining industry, he was an Associate Professor with tenure in the Computer Science Department at the University of Virginia. Sudhanva is the recipient of an NSF CAREER Award, a Google Focused Research Award, an IEEE Computer Society Distinguished Contributor recognition, and is named to the ISCA Hall of Fame. He currently serves as the Editor-in-Chief of IEEE Computer Architecture Letters. Sudhanva received his PhD in Computer Science and Engineering from Penn State in 2005.
