Description
The PutKudu processor's existing implementation uses a Map of KuduOperation -> FlowFile to keep track of which FlowFile was processing when the KuduOperation was created. This is mapping is eventually used to associate FlowFiles with the RowError (if any occurs), a mapping that is necessary for transferring FlowFiles to success/failure relationships or logging failures among other things.
For very large inputs, Kudu Operation objects can grow very large. There is no memory leak, but still could cause OutOfMemory issues in very large input data. There is a possibility to not require the use of a KuduOperation -> FlowFile map for unbatched flush modes (e.g. when using the AUTO_FLUSH_SYNC flush mode, where the KuduSession.apply() would have already flushed the buffer before returning, https://kudu.apache.org/apidocs/org/apache/kudu/client/SessionConfiguration.FlushMode.html)
This Jira attempts to capture the efforts for refactoring PutKudu processor to make it more memory optimized.