PhD offer: hierarchical trace format

General context

Running traditional performance analysis tools on a parallel application either collects too little information for detecting problems, or too much information, which significantly degrades the performance of the program being analyzed. Moreover, analyzing the generated traces that consist of GBs of data requires a lot of computing power.

As part of the NumPex project, we propose to design a new tracing tool for exascale applications that would be able to log data from the whole software stack without altering the performance of the application or generating terabytes of data. The tool suite will rely on a hierarchical trace format that groups sequences of events in order to limit the trace size, and to allows fast processing of millions of events.

This tool suite consist of several pieces of software: a tracing tool, and a post-mortem analysis tool. Other works will extend these tools in order to make them collaborate with the other components of the ExaSoft software stack.

Objective: designing a scalable tracing tool

The tracing tool will adapt its intrusiveness by enabling/disabling collection points depending on the event frequency. Selecting the functions or software components to instrument is crucial for investigating performance problems. We plan to provide users with a convenient way to specify the points of interest. We also intend to design a mechanism to automatically adapt the intrusiveness depending on the event frequency. When outliers events are detected in a software component, the tracing tool could instrument in details this component in order to pinpoint the source of the performance problem.

Moreover, in order to limit the memory/CPU overhead of data collection, the tracing tool will exploit the repetitive nature of most parallel applications to perform on the fly data compression. We intend to detect at runtime sequences of events that repeat and replace them with meta-events. Recent work show that detecting this kinds of repetitive patterns is feasible at execution time without degrading the performance of a parallel application.

Expected work

As a PhD working on this subject, you will have to conduct experimental research on system, which includes

reading/analyzing research papers
implementing system software that scale on large machine
- designing high performance data structures
- designing runtime systems for collecting and storing lots of data without degrading the application performance
Running experiments on real distributed systems
- designing experiments to evaluate your runtime system properties
- making real applications run on your runtime system
- understanding the performance/behavior of the applications

PhD advisors

– Prof. François Trahay — https://trahay.wp.imtbs-tsp.eu/
– Dr. Valentin Honoré — http://web4.ensiie.fr/~valentin.honore

This PhD will be in collaboration with the [Polaris team at Inria Grenoble](https://polaris.imag.fr/)

Related work

This PhD is in line with several works done in the PDS group, including:
– A work where we perform a post-mortem detection of repeative sequences of events in a trace in order to filter out redondant part of a large trace [1]

– A recent PhD which detected repeative sequences at runtime in order to detect the overall behavior of a program [2]

Several external works are related:
– Scalatrace records compressed traces [3]. It detects sequences of events that repeat, and reduces the amount of data to be recorded. This work focuses on compressing data, and does not take into account the post-mortem processing of the produced trace.

[1] Trahay, F., Brunet, E., Bouksiaa, M. M., & Liao, J. Selecting points of interest in traces using patterns of events. In PDP’15. [pdf]
[2] Colin, A., Trahay, F., & Conan, D. PYTHIA: an oracle to guide runtime system decisions. In CLUSTER’22. [pdf]
[3] Mueller, F., Wu, X., Schulz, M., Supinski, B. R. D., & Gamblin, T. Scalatrace: tracing, analysis and modeling of HPC codes at scale. In International Workshop on Applied Parallel Computing, 2010. [pdf]

← Permanent position in the PDS group

Towards a privacy preserving serverless computing runtime →

Parallel and Distributed Systems Group