I've been looking over the patch after seeing the lower performance numbers than I would have hoped to see.
I'm wary of using the ThreadLocal as a way to synchronize updates. I don't think we're getting as much batching as we could because the tabletserver is going to be running multiple clients in their own threads. You'll only get batching when a single write session writes to multiple tablets on that one tserver. I think trying to synchronize access around the InternalBatchWriter class (or use a concurrent data structure as the queue) might be cleaner to understand and would cache across different threads. Switching to a concurrent data structure would also need some extra synchronization around commitBatch() as you'd have a race condition on clearing the data structure after a flush() on the underlying batchwriter.
Actually, looking at ThriftClientHandler.applyUpdates(TInfo, long, TKeyExtent, List<TMutation>), I'm not sure you're going to be getting any batching at all. The tserver is only receiving updates for a single tablet in one thrift call (a thread) which means that all writes to multiple tablets are in different threads. I could be missing something, but that might drive home the point.
Ideally, you'd want something that can very quickly (maybe even lock free somehow?) add Mutations that need to be flush()'ed and then get a single notification point that all of the threads could wait on to know that the sync happened (a CountdownLatch, perhaps).