Most of the time, when we talk about reproducibility in computational sciences, we assume that of numerical results. We expect computational experiments to produce identical results in different execution environments for same input data.
But it is not the case all the time – quite often, the goal of a research endeavor is to design a faster algorithm. Then, the result of the experiment is performance information and demonstration of a speedup over existing algorithms for finding a solution for the same problem. Speedup, just like numerical results, should be reproducible in different execution environments.
Sid-Ahmed-Ali Touati, Julien Worms, and Sebastien Briais of INRIA published an excellent work on methodology of reproducible speedup tests.
A part of the introduction from their paper is worth quoting on this blog:
Known hints for making a research result non reproducible
Hard natural sciences such as physics, chemistry and biology impose strict experimental methodologies and rigorous statistical measures in order to guarantee the reproducibility of the results with a measured confidence (probability of error/success). The reproducibility of the experimental results in our community of program optimisation is a weak point. Given a research article, it is in practice impossible or too difficult to reproduce the published performance. If the results are not reproducible, the benefit of publishing becomes limited. We note below some hints that make a research article non-reproducible:
- Non using of precise scientific languages such as mathematics. Ideally, mathematics must always be preferred to describe ideas, if possible, with an accessible difficulty.
- Non available software, non released software, non communicated precise data.
- Not providing formal algorithms or protocols make impossible to reproduce exactly the ideas.
- Hide many experimental details.
- Usage of deprecated machines, deprecated OS, exotic environment, etc.
- Doing wrong statistics with the collected data.
Part of the non-reproducibility (and not all) of the published experiments is explained by the fact that the observed speedups are sometimes rare events. It means that they are far from what we could observe if we redo the experiments multiple times. Even if we take an ideal situation where we use exactly the original experimental machines and software, it is sometimes difficult to reproduce exactly the same performance numbers again and again, experience after experience. Since some published performances numbers represent exceptional events, we believe that if a computer scientist succeeds in reproducing the performance numbers of his colleagues (with a reasonable error ratio), it would be equivalent to what rigorous probabilists and statisticians call a surprise. We argue that it is better to have a lower speedup that can be reproduced in practice, than a rare speedup that can be remarked by accident.
Read the full document for a thorough explanation of how to avoid creating non-reproducible and erroneous speedup tests by using proper scientific techniques.