Parallel and Distributed Systems Group

Computer Science Department of Telecom SudParis

New paper “KAKURENBO: Adaptively Hiding Samples in Deep Neural Network Training”, to be presented at NeurIPS’23

Available online: https://hal.archives-ouvertes.fr/hal-03750441/document

Code available at https://github.com/TruongThaoNguyen/kakurenbo

Authors: Thao Truong Nguyen, Balazs Gerofi, Edgar Josafat Martinez-Noriega, François Trahay, Mohamed Wahib.

Abstract: This paper proposes a method for hiding the least-important samples during the training of deep neural networks to increase efficiency, i.e., to reduce the cost of training. Using information about the loss and prediction confidence during training, we adaptively find samples to exclude in a given epoch based on their contribution to the overall learning process, without significantly degrading accuracy. We explore the converge properties when accounting for the reduction in the number of SGD updates. Empirical results on various large-scale datasets and models used directly in image classification and segmentation show that while the withreplacement importance sampling algorithm performs poorly on large datasets, our method can reduce total training time by up to 22% impacting accuracy only by 0.4% compared to the baseline.

New paper “PYTHIA: an oracle to guide runtime system decisions” to be presented at Cluster’22

New paper “PYTHIA: an oracle to guide runtime system decisions” to be presented at Cluster’22.

Available online: https://hal.archives-ouvertes.fr/hal-03750441/document

Abstract

Runtime systems are commonly used by parallel applications in order to efficiently exploit the underlying hardware resources. A runtime system hides the complexity of the management of the hardware and exposes a high-level interface to application developers. To this end, it makes decisions by relying on heuristics that estimate the future behavior of the application. In this paper, we propose PYTHIA, a library that serves as an oracle capable of predicting the future behavior of an application, so that the runtime system can make more informed decisions. PYTHIA builds on the deterministic nature of many HPC applications: by  recording an execution trace, PYTHIA captures the application main behavior. The trace can be provided for future executions of the application, and a runtime system can ask for predictions of future program behavior. We evaluate PYTHIA on 13 MPI applications and show that PYTHIA can accurately predict the future of most of these applications, even when varying the problem size. We demonstrate how PYTHIA predictions can guide a runtime system optimization by implementing an adaptive thread parallelism strategy in GNU OpenMP runtime system. The evaluation shows that, thanks to PYTHIA prediction, the adaptive strategy reduces the execution time of an application by up to 38 %.

New paper “Why Globally Re-shuffle? Revisiting Data Shuffling in Large Scale Deep Learning” to be presented at IPDPS’22.

New paper “Why Globally Re-shuffle? Revisiting Data Shuffling in Large Scale Deep Learning” to be presented at IPDPS’22.

Available online: https://hal.archives-ouvertes.fr/hal-03599740/document

Abstract

Stochastic gradient descent (SGD) is the most prevalent algorithm for training Deep Neural Networks (DNN). SGD iterates the input data set in each training epoch processing data samples in a random access fashion. Because this puts enormous pressure on the I/O subsystem, the most common approach to distributed SGD in HPC environments is to replicate the entire dataset to node local SSDs. However, due to rapidly growing data set sizes this approach has become increasingly infeasible. Surprisingly, the questions of why and to what extent random access is required have not received a lot of attention in the literature from an empirical standpoint.

In this paper, we revisit data shuffling in DL workloads to investigate the viability of partitioning the dataset among workers and performing only a partial distributed exchange of samples in each training epoch. Through extensive experiments on up to 2,048 GPUs of ABCI and 4,096 compute nodes of Fugaku, we demonstrate that in practice validation accuracy of global shuffling can be maintained when carefully tuning the partial distributed exchange. We provide a solution implemented in PyTorch that enables users to control the proposed data exchange scheme.

New paper “J-NVM: Off-heap Persistent Objects in Java” to be presented at SOSP’21

New paper “J-NVM: Off-heap Persistent Objects in Java” to be presented at SOSP’21. Congrats to Anatole, Yohan, Kwabena, Pierre and Gaël!

New paper “Montsalvat: Intel SGX Shielding for GraalVM Native Images” to be presented at Middleware’21

New paper “Montsalvat: Intel SGX Shielding for GraalVM Native Images” to be presented at Middleware’21. Congrats to Gaël!

New paper “The Serverless Shell” to be presented at Middleware’21

New paper “The Serverless Shell” to be presented at Middleware’21. Congrats to Aurele and Pierre!

New paper “Highly-available and consistent group collaboration at the edge with Colony” to be presented at Middleware’21

New paper “Highly-available and consistent group collaboration at the edge with Colony” to be presented at Middleware’21. Congrats to Pierre!

New paper “Efficient Replication via Timestamp Stability” to be presented at Eurosys’21

New paper “Efficient Replication via Timestamp Stability” to be presented at Eurosys’21. Congrats to Pierre!

New paper “FaaSCache: an opportunistic free caching system for FaaS platforms” to be presented at Eurosys’21

New paper “FaaSCache: an opportunistic free caching system for FaaS platforms” to be presented at Eurosys’21. Congrats to Mathieu!

New paper “EZIOTracer: Unifying Kernel and User Space I/O Tracing for Data-Intensive Applications” to be presented at the CHEOPS workshop of Eurosys’21

New paper “EZIOTracer: Unifying Kernel and User Space I/O Tracing for Data-Intensive Applications” to be presented at the CHEOPS workshop of Eurosys’21. Congrats to Alexis C and François!

New paper “NVCache: A Plug-and-Play NVMM-based I/O booster for Legacy Systems” to be presented at DSN’21

New paper “NVCache: A Plug-and-Play NVMM-based I/O booster for Legacy Systems” to be presented at DSN’21. Congrats to Rémi and Gaël!

New paper “Transparent Overlapping of Blocking Communication in MPI Applications” to be presented at IEEE HPCC’20

New paper “Transparent Overlapping of Blocking Communication in MPI Applications” to be presented at IEEE HPCC’20. Congrat to Alexis, Elisabeth, François and Gaël!

New paper “Leaderless State-Machine Replication: Specification, Properties, Limits” to be presented at DISC’20

New paper “Leaderless State-Machine Replication: Specification, Properties, Limits” to be presented at Eurosys’20. Congrat to Pierre and Tuanir!

New paper “State-Machine Replication for Planet-Scale Systems” to be presented at Eurosys’20

New paper “State-Machine Replication for Planet-Scale Systems” to be presented at Eurosys’20. Congrat to Pierre and Tuanir!

New paper “Using differential execution analysis to identify thread interference”. To appear in IEEE Transactions on Parallel and Distributed Systems

Abstract Understanding the performance of a multi-threaded application is difficult. The threads interfere when they access the same shared resource, which slows down their execution. Unfortunately, current profiling tools report the hardware components or the synchronization primitives that saturate, but they cannot tell if the saturation is the cause of a performance bottleneck. In this paper, we propose a holistic metric able to pinpoint the blocks of code that suffer interference the most, regardless of the interference cause. Our metric uses performance variation as a universal indicator of interference problems. With an evaluation of 27 applications we show that our metric can identify interference problems caused by 6 different kinds of interference in 9 applications. We are able to easily remove 7 of the bottlenecks, which leads to a performance improvement of up to 9 times

https://hal.archives-ouvertes.fr/hal-02179717v1

New paper “ScalOMP: analyzing the Scalability of OpenMP applications” to be presented at IWOMP’19

Anton Daumen will present his work “ScalOMP: analyzing the Scalability of OpenMP applications” at IWOMP’19. 

His paper is available online: https://hal.archives-ouvertes.fr/hal-02179726

Abstract : Achieving good scalability from parallel codes is becoming increasingly difficult due to the hardware becoming more and more complex. Performance tools help developers but their use is sometimes complicated and very iterative. In this paper we propose a simple methodology for assessing the scalability and for detecting performance problems in an OpenMP application. This methodology is implemented in a performance analysis tool named ScalOMP that relies on the capabilities of OMPT for analyzing OpenMP applications. ScalOMP reports the code regions with scalability issues and suggests optimization strategies for those issues. The evaluation shows that ScalOMP incurs low overhead and that its suggestions lead to significant performance improvement of several OpenMP applications