So far, there are two main approaches that we've been using:
Unit tests at the level of the language
Benchmarks
The former are invariant to whether optimizations are applied during compilation or not, since semantically the optimizations don't change results of execution. And while benchmarks may show a performance change, they don't tell us which optimizations were applied, aren't deterministic, and may even show a performance improvement if the code happens to be wrong (in which case the unit tests should fail).
For instance, if we end up rewriting a complicated expression to a single value like #t, then the benchmarks would report an amazing speedup , and we'd need to run the tests to know that something has gone wrong . On the other hand, the tests would continue to pass even when an expected optimization is not being applied, and only the benchmarks would give us a clue that maybe it's not working as we hoped. But in this case, is it that the optimization didn't result in the speedup we were expecting? Or that it wasn't applied at all?
One option that we've considered is to implement unit tests for individual compiler rewrite rules (which are transformations of syntax). This is deterministic and validates that individual compiler rules:
would match the expected pattern in the input syntax
would produce the expected optimized code
But there are a few concerns with this approach:
The tests are very specific to the current implementation of these optimizations. If we change the internal implementations or even the names of functions used, the tests would need to be updated (there would be "churn").
The tests assume that the code has already been expanded in a very specific way. If the result of expansion ever were to change (for instance, upon upgrading a library used in the expander such as syntax-spec), then the tests would continue to pass but the optimization would not actually be getting applied.
It seems as though we should be testing many permutations of the input syntax pattern, to validate that the pattern matches in each case and that the intended subexpression is rewritten. This seems like it could get out of hand quickly.
We're not sure. Is this OK? Can we do better?
If you have worked with compilers before, or work on an existing compiler project such as Racket/Chez or LLVM, how do you do it?
Racket generally relies mostly on the language-level unit tests approach (especially when considering building all of Racket as one of the tests). Additionally, Racket has a bunch of tests that check that two expressions optimize to the same thing (eg (+ 2 2) and 4). This works well for some kinds of optimizations but not for others (like register allocation).
(test-comp '(lambda (x y) (and x (car y) (number? y)))
'(lambda (x y) (and x (car y) #f)))
expands both expressions and then compare the expansions. This is nice because it is not necessary to care about the details of the expansion of and or any interesting reductions when the compiler notice that the result is always #f, but it must raise an error if y is not a pair...
The Chez Scheme team uses a similar version, but it only expands the first expression, so it's more easy to break a test with an unrelated change, and it's more difficult to write the tests because you must think about all the chained reductions instead of the single one you care about in the test.
Perhaps you can copy the code of test-comp, but it's somewhat unstable. Future versions of Racket may have slightly different quirks in the expansions and break your test. It's not as bad as it sounds. IIRC I fixed a problem with flvetors last year and the previous change was done by @mflatt a few years ago. Also, when the primitive random was shadowed by an extended version, it broke a few test, but again, this was many years ago. So copying test-comp is a good option if fix if you tolerate to fix it every few years.
About benchmarks: I think they are too slow. To get a reliable time you need like .1 seconds per test, and they accumulate rapidly. In racket some test are written by hand, but other are a loop over a list with many functions with similar properties. For example, teat that all the nice predicates disappear in (lambda (x) (if (pred? x) 1 1)).
That is a very clever approach! I think we will adopt your suggestion of using test-comp or something like it. I will need to think more about the details you've shared here, and thank you for pointing out the tests in the Racket source as well.
The Typed Racket optimizer takes a different approach, which (IME) is a bit more robust to unrelated compiler changes.
Every time it applies an optimization, the optimizer emits a log message describing the optimization it just applied. The tests then compare the logs produced when compiling test programs to an expected log that includes all the optimizations that should have happened.
Details here:
This works best with optimizations that are distinct transformations, rather than more global things like register allocation, as Sam said.
Credit: Eli Barzilay initially suggested that approach.
Using logs as a way to create a data structure representing the sequence of optimizations is an unusual and seems an elegant idea that could be broadly applied. Does the logging cause any increase in compile time?