In this appendix, we first show that our statistical method can be applied to well known benchmarking suites (Appendix A) before showing additional results from the main part of our experiment (Appendix B). We then present a curated selection of interesting run-sequence plots (Appendix C). The complete series of plots is available in a separate document.
The statistical method presented in Section 4 is not limited to data produced from Krun. To demonstrate this, we have applied it to two standard benchmark suites: the DaCapo suite for Java Blackburn et al. [2006] and the Octane suite for JavaScript Google [2012]. Octane was run on all three of our benchmarking machines, whereas (due to time constraints), Dacapo was run only on Linux4790. For both suites we used 30 process executions and 2000 in-process iterations.
We ran DaCapo (with its default benchmark size) on Graal and HotSpot. As it
already has support for altering the number of in-process iterations, we used it
without modification. However, we were unable to run 3 of its 14 benchmarks:
We ran Octane on the same version of V8 used in the main experiment, and
(on Linux only, due to benchmark failures on OpenBSD) on SpiderMonkey
(#1196bf3032e1, a JIT compiling VM for JavaScript). We replaced its complex
runner (which reported timings with a non-monotonic microsecond timer) with a
simpler alternative (using a monotonic millisecond timer). We also had to decide on
an acceptable notion of ‘iteration’. Many of Octane’s benchmarks consist of a
relatively quick ‘inner benchmark’; an ‘outer benchmark’ specifies how many times
the inner benchmark should be run in order to make an adequately long in-process
iteration. We recorded 2000 in-process iterations of the outer benchmark; our runner
fully resets the benchmark and the random number generator between each
iteration. The
Table 4 shows the full DaCapo results. Because we had to run a subset of
benchmarks on Graal, a comparison with HotSpot is difficult though of the
benchmarks both could run, there is a reasonable degree of similarity. However, even
on this most carefully designed of benchmark suites, only 42% of
Tables 5–7 show the full Octane results. These show a greater spread of
classifications than the DaCapo results, with 33% of
As these results show, our automated statistical method produces satisfying results even on existing benchmark suites that have not been subject to the Krun treatment. Both DaCapo and (mostly) Octane use much larger benchmarks than our main experiment. We have no realistic way of understanding to what extent this makes ‘good’ warmup more or less likely. For example, it is likely that there is CFG non-determinism in many of these benchmarks; however, their larger code-bases may give VMs the ability to ‘spread out’ VM costs, making smaller blips less noticeable.
The main experiment’s results for Linux4790 and OpenBSD4790 can be seen in Tables 8 and 9.
The remainder of this appendix shows curated plots: we have selected 4 interesting plots from each classification, to give readers a sense of the range of data obtained from our experiment. A separate document contains the complete series of plots.