Repeatability, Reproducibility and Rigor

Computer systems research spans sub-disciplines that include embedded systems, programming languages, networking, and operating systems. In this talk my contention is that a number of structural factors inhibit quality systems research. Symptoms of the problem include unrepeatable and unreproduced results as well as results that are either devoid of meaning or that measure the wrong thing. I will illustrate the impact of these issues on our research output with examples in my research on real-time garbage collection. I will argue that our field should foster: repetition of results, independent reproduction, as well as rigorous evaluation. I will conclude with some progress made within SIGPLAN with the Artefact Evaluation Process.

[slides]

Back to #bench16

Why Aren't We Benchmarking Bioinformatics?

Bioinformatics (computational manipulation, analysis and modelling of biological and biochemical data) is a substantial area of primary research activity as well as a growing market worth up to $13bn US (2020). by some estimates In particular, exponential growth in the capacity and speed of DNA sequencers has led to a critical shortfall in analytical tools often termed a DNA deluge. Despite this critical need and substantial and growing investment in bioinformatics software, many best practices in software design are neglected, ignored, or even unacknowledged. I will illustrate the problem with an overview of two case studies in current bioinformatics projects, and present data showing worrying reproducibility deficits in commonly-used software.

[slides]

Back to #bench16

Introduction to Time Series Data and Analysis

When collecting data used for benchmarking, can it be assumed that each observation is not related to another? This is the independence assumption often made in classical statistics. In general we typically have some ordering to the collected data and cannot make this assumption. For example, the execution times of consecutive iterations of a micro-benchmark constitute a time series.

The aim of this talk is provide an introduction to the analysis of time series. Some of the questions that are discussed are: When is a data set a time series? Why you shouldn't use regular statistics to analyse a time series? How may a data set vary over time? What tools and models are available to identify and describe the dependence structure?

[slides]

Back to #bench16

Evaluating Performance using Ratio of Execution Times

Execution time is the main ends-based metric used in performance evaluation of programming language implementations and computer systems. Many studies compare performance of two systems based on a ratio of execution times (e.g. percentage performance improvement, parallel speedup, relative overhead, etc). It is well known that execution time on current computer systems is prone to various sources of non-determinism, and hence the ratio is as well, but execution time ratios are almost never presented with confidence intervals, and when they are, the method with which the intervals were obtained is undocumented. In my talk, I will report on the current practice and offer my recommendations on how to evaluate execution time ratios and specifically how to compute confidence intervals.

[slides]

Back to #bench16

The role of benchmarking in buying a High-Performance Computer

The speaker led for the University of Bath in their recent procurement of a £1.2M High Performance Computer system. Here he will talk about the role of benchmarking in that purchase: how internal requirements were defined, what was specified, what the vendor responses were, and what happened in reality.

[slides]

Back to #bench16

What Exactly do we Mean by VM Warmup?

Virtual Machines (VMs) with Just-In-Time (JIT) compilers are traditionally thought to execute programs in two phases: first the warmup phase determines which parts of a program would most benefit from dynamic compilation; after compilation has occurred the program is said to be at peak performance. When measuring the performance of JIT compiling VMs, data collected during the warmup phase is generally discarded, placing the focus on peak performance.

In this talk we present a rigorous experiment which aims to faithfully measure the warmup behaviours of modern JITted language implementations. Although we initially we set out to measure how long VMs take to achieve peak performance, our results surprised us, showing that the traditional model of warmup rarely holds. In turn, our results place doubt upon long-standing benchmarking methodologies, since existing techniques assume the traditional model of warmup.

[slides]

Back to #bench16

Scalability in compiler development: how to get testing and optimization done in a reasonable time

Modern compilers such as GCC and LLVM have hundreds of optimizations, all of which interact. GCC for example has over 220 distinct passes. Just testing all combinations turning passes on and off would require 1.6 x 10^66 executions of the compiler. And that's without considering that some passes can run multiple times, and almost all are parameterizable.

This isn't just a matter of testing the compiler; exploring the optimization space is central to iterative compilation, and its derivative, machine learning optimization.

In this talk we present the technique of fractional factorial design (FFD) as a way to choose a much smaller sample of optimizations to consider.

While FFD can reduce the problem of testing from billions of years to months, even that is not really practical for most applications. We'll present the related technique of Plackett-Burman design as a way of identifying the subset of optimizations that are most significant. This allows us to use FFD to identify a representative set of tests that can be run in hours.

These theoretical techniques will be presented in the context of the MAGEEC project, which uses machine learning to identify optimal compiler optimizations to minimize energy usage by generated code. We'll look at the practical challenges to turning theory into practice.

In the second part of the talk, we'll look at how superoptimization faces the same problem of combinatorial explosion. While this approach can generate truly optimal code, it has only recently proved possible to scale beyond a handful of instructions. We'll look at how this has been achieved, in the context of the TSERO project, which is considering optimization techniques that can reduce energy consumption in high performance computing.

[slides]

Back to #bench16