Département Informatique

Computer Science Department of Telecom SudParis

Pallas: HPC Trace Analysis at scale

Team work: Catherine Guelque presented "Pallas: HPC Trace Analysis at scale" (Compas'24) at 4A312 the 17/5/2024 at 10h30.


Traces are used in HPC for post-mortem performance analysis. It is a useful tool for investigating performance problems of applications. However, identifying a performance bottleneck often requires collecting lots of information, which causes the trace to become huge. This problem gets worse for large-scale applications that run many threads for a long time. In addition to the problem of storing these large traces, another problem arises when analyzing them to identify problems. The analysis tool needs to process gigabytes, or even terabytes of data, which is time-consuming. However, it has been shown that many HPC applications have recurring patterns, that time data is the heaviest part of a trace, and that similar events have similar duration, meaning they can be efficiently compressed.
We propose a new trace format named Pallas, which uses the regularity of HPC applications to provide both quick and efficient post-mortem analysis and light traces. Pallas is a library that provides tracing tools with event storage functionalities. When writing a trace, Pallas automatically detects patterns, and stores statistical data for later analysis. The trace is then stored by separating the timestamps from the structure. This allows loading and analyzing the structure separately from the timestamps, which grants near-instantaneous analysis when the timestamps are not needed.
Our implementation provides an OTF2-like API, which allows tracing tools such as EZTrace to transparently use it. We evaluate Pallas by comparing it with OTF2 and Pilgrim on several applications, and by running several types of performance analysis on execution traces. Our experiments show that Pallas reduces the size of traces in most cases, and it allows executing several types of performance analysis in near-constant time.