Currently, we have to be a little conservative in how granularly we measure things to avoid heavy synchronization costs in the metrics.
It should be possible to refactor the thread-safe implementation to use volatile and java.util.concurrent.atomic instead and realize a pretty large performance improvement.
However, before investing too much time in it, we should run some benchmarks to gauge how much improvement we can expect.
I'd propose to run the benchmarks on trunk with debug turned on, and then to just remove all synchronization and run again to get an upper-bound performance improvement.
If the results are promising, we can start prototyping a lock-free implementation.