It feels like we have a mismatch between the underlying data and our objects. The need for the VHS-rewind in getTxidBatchSize is one example, what we really want there is an iterator of EditEvents, with one EditEvents per txid (name is just a suggestion).
Basically, the code in getTxidBatchSize is converting between what we get back from the NameNode RPC (a bunch of Events, not grouped by txid) to what we want to return (a batch of events that all have the same txid).
I guess we could do the conversion earlier, in PBHelper#convert(GetEditsFromTxidResponseProto resp). This would mean pushing the batch concept into EventsList. I'm not sure there's much benefit in this, though. EventsList is used on the NameNode as well, and the NN doesn't care about grouping Events of the same txid into batches. But we'd have to implement all that logic anyway if we went down this route. (The NN has its own concept of batching... how many Events it sends back in a single RPC... and of course it will not split a txid across RPCs. But that's the only batching we get out of it.)
I admit that rewinding the iterator feels awkward. I was looking for a utility function in Iterators or something that would help with this, but I didn't find anything. Java doesn't have the concept of duplicating iterators that C++ does, so you can't just make a copy of the iterator and then push only that copy forward. I wanted to avoid allocating an array until I knew what size I wanted. I suppose I could create a LinkedList rather than an array, and avoid the rewinding that way. That's not very efficient, though. One slightly awkward function feels like an acceptable tradeoff for memory efficiency.
The txid could also be moved into EditEvents which would also save some bytes.
Most batches are small-- I think the median size will be a batch of size 1, with maybe some outliers at 2, 3, or 4. I don't think there's a lot of memory savings to be had by moving the txid into EditEvents.
A bigger issue is that we need (well maybe not need, but the code will be really, really awkward if we don't) to add txid to the protobuf EventProto structure in inotify.proto. This is because there can be "gaps" in the txids returned by EventsListProto. Currently, all EventsListProto has is firstTxid, lastTxid, and a list of Events. But you don't know what the txid of any of those Events actually is... if firstTxid is 100 and lastTxid is 200, and you have 50 events, where are they in that sequence? Who knows. And if you add this to the PB, it's extremely awkward not to add it to the Java Event class as well.