Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
The streaming mode for VectorGroupBy allocates a large number of arrays due to VectorKeyHashWrapper::duplicateTo()
Since the vectors can't be mutated in-place while a single batch is being processed, this operation can be cut by 1000x by allocating a streaming key at the end of the loop, instead of reallocating within the loop.
for(int i = 0; i < batch.size; ++i) { if (!batchKeys[i].equals(streamingKey)) { // We've encountered a new key, must save current one // We can't forward yet, the aggregators have not been evaluated rowsToFlush[flushMark] = currentStreamingAggregators; if (keysToFlush[flushMark] == null) { keysToFlush[flushMark] = (VectorHashKeyWrapper) streamingKey.copyKey(); } else { streamingKey.duplicateTo(keysToFlush[flushMark]); } currentStreamingAggregators = streamAggregationBufferRowPool.getFromPool(); batchKeys[i].duplicateTo(streamingKey); ++flushMark; }
The duplicateTo can be pushed out of the loop since there only one to truly keep a copy of is the last unique key in the VRB.
The actual byte[] values within the keys are safely copied out by - VectorHashKeyWrapperBatch.assignRowColumn() which calls setVal() and not setRef().