Cite this paper as: Edd Barrett, Carl Friedrich Bolz-Tereick, Rebecca Killick, Sarah Mount, and Laurence Tratt. 2017. Virtual Machine Warmup Blows Hot and Cold, OOPSLA (October 2017). (BibTeX file).

Also available as: PDF.
See also: Software.

Virtual Machine Warmup Blows Hot and Cold

Edd Barrett, King's College London, UK

Carl Friedrich Bolz-Tereick, King's College London, UK

Rebecca Killick, Lancaster University, UK

Sarah Mount, King's College London, UK

Laurence Tratt, King's College London, UK

September 26, 2017

Back to main paper

Appendix

In this appendix, we first show that our statistical method can be applied to well known benchmarking suites (Appendix A) before showing additional results from the main part of our experiment (Appendix B). We then present a curated selection of interesting run-sequence plots (Appendix C). The complete series of plots is available in a separate document.

A Applying the Statistical Method to Existing Benchmark Suites

The statistical method presented in Section 4 is not limited to data produced from Krun. To demonstrate this, we have applied it to two standard benchmark suites: the DaCapo suite for Java Blackburn et al. [2006] and the Octane suite for JavaScript Google [2012]. Octane was run on all three of our benchmarking machines, whereas (due to time constraints), Dacapo was run only on Linux4790. For both suites we used 30 process executions and 2000 in-process iterations.

We ran DaCapo (with its default benchmark size) on Graal and HotSpot. As it already has support for altering the number of in-process iterations, we used it without modification. However, we were unable to run 3 of its 14 benchmarks: batik crashes with a InvocationTargetException; eclipse, tomcat fail their own internal validation checks. We experienced semi-regular hangs with Graal which led us to disable luindex, tradebeans, and tradesoap on Graal only.

We ran Octane on the same version of V8 used in the main experiment, and (on Linux only, due to benchmark failures on OpenBSD) on SpiderMonkey (#1196bf3032e1, a JIT compiling VM for JavaScript). We replaced its complex runner (which reported timings with a non-monotonic microsecond timer) with a simpler alternative (using a monotonic millisecond timer). We also had to decide on an acceptable notion of ‘iteration’. Many of Octane’s benchmarks consist of a relatively quick ‘inner benchmark’; an ‘outer benchmark’ specifies how many times the inner benchmark should be run in order to make an adequately long in-process iteration. We recorded 2000 in-process iterations of the outer benchmark; our runner fully resets the benchmark and the random number generator between each iteration. The box2d, gameboy, mandreel benchmarks do not properly reset their state between runs, leading to run-time errors we have not been able to fix; typescript’s reset function, in contrast, frees constant data needed by all iterations, which we were able to easily fix. When run for 2000 iterations, CodeLoadClosure, pdfjs, and zlib all fail due to memory leaks. We were able to easily fix pdfjs by emptying a global list after each iteration, but not the others. We therefore include 12 of Octane’s benchmarks (including lightly modified versions of pdfjs and typescript). Because we run fewer benchmarks, our modified runner is unable to fully replicate the running order of Octane’s original runner. Since Octane runs all benchmarks in a single process execution, this could affect the performance of later benchmarks in the suite.

Table 4 shows the full DaCapo results. Because we had to run a subset of benchmarks on Graal, a comparison with HotSpot is difficult though of the benchmarks both could run, there is a reasonable degree of similarity. However, even on this most carefully designed of benchmark suites, only 42% of ⟨VM, benchmark⟩ pairs have ‘good’ warmup.

Tables 5–7 show the full Octane results. These show a greater spread of classifications than the DaCapo results, with 33% of ⟨VM, benchmark⟩ pairs over the three machines having ‘good’ warmup.

As these results show, our automated statistical method produces satisfying results even on existing benchmark suites that have not been subject to the Krun treatment. Both DaCapo and (mostly) Octane use much larger benchmarks than our main experiment. We have no realistic way of understanding to what extent this makes ‘good’ warmup more or less likely. For example, it is likely that there is CFG non-determinism in many of these benchmarks; however, their larger code-bases may give VMs the ability to ‘spread out’ VM costs, making smaller blips less noticeable.

Table 4: DaCapo results for Linux4790.

Table 5: Octane results for Linux4790.

Table 6: Octane results for OpenBSD4790.

Table 7: Octane results for Linux1240v5.

B Further Results

The main experiment’s results for Linux4790 and OpenBSD4790 can be seen in Tables 8 and 9.

Table 8: Benchmark results for Linux4790.

Table 9: Benchmark results for OpenBSD4790.

C Curated Plots

The remainder of this appendix shows curated plots: we have selected 4 interesting plots from each classification, to give readers a sense of the range of data obtained from our experiment. A separate document contains the complete series of plots.

C.1 Examples of Warmup Behaviour

C.2 Examples of Flat Behaviour

C.3 Examples of Slowdown Behaviour

C.4 Examples of No Steady State Behaviour

Back to main paper

Internal

info@soft-dev.org