Roofline: An insightful Visual Performance model for multicore Architectures

Reading group: Jules Risse presented "Roofline: An insightful Visual Performance model for multicore Architectures" (CACM'09) at 4A312 the 3/5/2024 at 10h00.

Abstract

Conventional wisdom in computer architecture produced similar designs. Nearly every desktop and server computer uses caches, pipelining, superscalar instruction issue, and out-of-orderexecution. Although the instruction sets varied, the microprocessors were all from the same school of design. design. The relatively recent switch to multicore means that microprocessors will become more diverse, since no conventional wisdom has yet emerged concerning their design. For example, some offer many simple processors vs. fewer complex processors, some depend on multithreading, and some even replace caches with explicitly addressed local stores. Manufacturers will likely offer multiple products with differing numbers of cores to cover multiple price-performance points, since Moore’s Law will permit the doubling of the number of cores per chip every two years. 4 While diversity may be understandable in this time of uncertainty, it exacerbates the already difficult jobs of programmers,compiler writers, and even architects. Hence, an easy-to-understand model that offers performance guidelines would be especially valuable. Such a model need not be perfect,just insightful. The 3Cs (compulsory,capacity, and conflict misses) model for caches is an analogy. It is not perfect, as it ignores potentially important factors like block size, block-allocation policy, and block-replacement policy. It also has quirks; for example, a miss might be labeled “capacity” in one design and “conflict” in another cache of the same size. Yet the 3Cs model has been popular for nearly 20 years precisely because it offers insight into the behavior of programs, helping pro-grammers, compiler writers, and architects improve their respective designs. Here, we propose one such model we call Roofline, demonstrating it on four diverse multicore computers using four key floating-point kernels.

← State-Machine Replication Scalability Made Simple

PALMED: Throughput Characterization for Superscalar Architectures →

Parallel and Distributed Systems Group

Roofline: An insightful Visual Performance model for multicore Architectures

Abstract

Next seminars