Philipp Samfass
Doubt and Redundancy Kill Soft Errors---Towards Detection and Correction of Silent Data Corruption in Task-based Numerical Software
Samfass, Philipp; Weinzierl, Tobias; Reinarz, Anne; Bader, Michael
Authors
Professor Tobias Weinzierl tobias.weinzierl@durham.ac.uk
Professor
Dr Anne Reinarz anne.k.reinarz@durham.ac.uk
Associate Professor
Michael Bader
Abstract
Resilient algorithms in high-performance computing are subject to rigorous non-functional constraints. Resiliency must not increase the runtime, memory footprint or I/O demands too significantly. We propose a task-based soft error detection scheme that relies on error criteria per task outcome. They formalise how “dubious” an outcome is, i.e. how likely it contains an error. Our whole simulation is replicated once, forming two teams of MPI ranks that share their task results. Thus, ideally each team handles only around half of the workload. If a task yields large error criteria values, i.e. is dubious, we compute the task redundantly and compare the outcomes. Whenever they disagree, the task result with a lower error likeliness is accepted. We obtain a self-healing, resilient algorithm which can compensate silent floating-point errors without a significant performance, I/O or memory footprint penalty. Case studies however suggest that a careful, domain-specific tailoring of the error criteria remains essential.
Citation
Samfass, P., Weinzierl, T., Reinarz, A., & Bader, M. (2021, November). Doubt and Redundancy Kill Soft Errors---Towards Detection and Correction of Silent Data Corruption in Task-based Numerical Software. Presented at Supercomputing 21 - FTXS Workshop - 2021 IEEE/ACM 11th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS), St Louis, MO
Presentation Conference Type | Conference Paper (published) |
---|---|
Conference Name | Supercomputing 21 - FTXS Workshop - 2021 IEEE/ACM 11th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) |
Start Date | Nov 14, 2021 |
End Date | Nov 19, 2021 |
Acceptance Date | Oct 4, 2021 |
Publication Date | 2021-12 |
Deposit Date | Oct 5, 2021 |
Publicly Available Date | Nov 3, 2022 |
Publisher | Institute of Electrical and Electronics Engineers |
Pages | 1-10 |
DOI | https://doi.org/10.1109/ftxs54580.2021.00005 |
Public URL | https://durham-repository.worktribe.com/output/1138975 |
Additional Information | 14-19 Nov. 2021 |
Files
Accepted Conference Proceeding
(806 Kb)
PDF
Copyright Statement
© 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
You might also like
Upscaling ExaHyPE – on each and every core
(2023)
Report
Downloadable Citations
About Durham Research Online (DRO)
Administrator e-mail: dro.admin@durham.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2025
Advanced Search