Thanks for those observations, Kristian!
How's this theory:
The embedded variant of the compat test doesn't wait for the forked process to shut down before it moves on to the next version (it does wait until is sees that the process prints "OK", but that doesn't mean every database thread has stopped). If a checkpoint happens to be running, the forked process could be deleting the stub conglomerates when we invoke removeDirectory().
If such a stub is deleted after removeDirectory() calls File.list() and before it actually calls delete(), it'll fail to delete it (since it's already deleted) and add it to the list of failed deletes. However, it will still try to delete the parent directory, and that's successful since there aren't any files there.
The code that deleted the directory before we switched to using BaseTestCase.removeDirectory() would give up as soon as one of the files couldn't be deleted, so the next test would fail because it found a half-deleted database. BaseTestCase.removeDirectory() is on the other extreme: it goes on trying to delete files even after it has failed to delete one, and gets surprised when it sees that it actually had succeeded.
I can see these possible solutions (not mutually exclusive):
1) Make the compatibility test wait for the forked process to complete.
2) Stop BaseTestCase.removeDirectory() from failing if the reason why it couldn't delete a file was that the file no longer existed.
We should probably at least do 1. Not so sure about 2, since it usually means that there is a problem if we have a delete race, and it would be good to get it reported.