Application-Driven Target Setting, Fault Impact and High-Performance Computing and Machine Learning Resiliency

Speaker: Michael Paulitsch (Intel)

September 12th, 2022- 9:00 am – 10:00 am CET

Abstract:  Chip area is growing significantly in the next years, pushing requirements to detect and tolerate faults to new levels. Setting resiliency targets during chip design is essential in such an environment. We present different metrics for machine learning (ML) networks in such an environment and an accelerator-focused fault impact analysis. We apply this fault impact analysis framework to typical ML and High-Performance Computing applications and present impact analysis benefits. Furthermore, we present automated resiliency approaches at different levels (hardware, software, application). E.g. we show the effectiveness of application-level monitors in such environments. We also speculate on needed research.

Short bio: brings 20 years of work in theoretical and applied research and technology. He worked at university and in different industries (aerospace, railway, automotive) in dependability in safety-critical and real-time systems, including security aspects of all types.
Michael has been filling the role of a Dependability Systems Architect (Principal Engineer) at Intel, Munich, Germany, as part of Intel Labs since 2018. He pursues Dependable Artificial Intelligence and Machine Learning systems (resiliency) evaluation and ensures safe and dependable use of neural network models in safety-critical systems. The focus is on platform faults and the impact of accelerator technology on ML/AI networks as well as High-Performance Applications. He also looks at novel safety monitoring approaches at different system levels (chip, platform, application) for safety-related topics for autonomous systems.

Partial Replication for Reliability from Micro-architecture to Application Levels

Speaker: Osman Sabri Unsal (Barcelona Supercomputing Center)

September 14th, 2022 9:00 am – 10:00 am CET

Abstract:  While coding-based reliability implementations such as ECC and parity can protect against bit flips in the memory, they do not protect against bit flips in combinational logic or other catastrophic faults that can cause system crashes. To guard against these errors, duplex- or triplex-modular redundancy-based methods have been proposed for error detection and correction, respectively. Moreover, replication-based approaches are required in market segments such as Supercomputing, hard real-time systems, and higher fault-rate scenarios such as systems operating in noisy, vibrating, and dirty environments. However, monolithic replication is not resource-efficient, especially regarding overheads on performance and power dissipation. Therefore, there is a need for energy and performance-efficient partial replication. In this talk, I will present the state-of-the-art in partial replication with examples from micro-architecture (leveraging idle processor resources for opportunistic replication) to application level (replicating only the most reliability critical threads).

Short bio: Osman Sabri Unsal has managed the Computer Architecture for Parallel Paradigms research group at Barcelona Supercomputing Center (BSC) since 2006. He holds BS, MS, and Ph.D. degrees in electrical and computer engineering from Istanbul Technical University, Brown University, and the University of Massachusetts, Amherst, respectively. His current research interests include computer architecture, reliability / fault-tolerance, vector processors, and ensuring programmer productivity. Before BSC, he worked at Intel Microprocessor Research Lab; and co-managed the BSC-Microsoft Research Center while at BSC. He was the technical leader for four European research projects and is currently involved in the design of the vector accelerator chip in the European Processor Initiative.