Details
-
New Feature
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
SplitParquet processor that expects as input a FlowFile with Parquet content and would take as parameter a number of records as the split configuration.
The processor would generate X flow files with unmodified content and would add attributes with the offsets required to read the group of rows in the flowfile's content.
Then the Parquet Reader would be improved to accept optional flow file attributes containing the information so that the reader can only read the required part of the data.
Instead of having something like
X -> SplitRecord (Parquet / JSON) -> ...
It'd be something like
X -> SplitParquet -> ConvertRecord (Parquet / JSON) -> ...
The goal here is to increase the overall efficiency of this operation for extremely large Parquet files (hundreds of GBs). With the second approach, it could leverage multi-threading for processing a single file.
SplitParquet processor should also have a property (true/false) to write zero-content flow files. The existing FetchParquet processor should be enhanced to accept the flow file attributes for giving offsets. It'd give something like
X -> SplitParquet -> FetchParquet (JSON Writer) -> ...
This way, a load balanced connection could be used between SplitParquet and FetchParquet in order to distribute the work across the nodes (without transferring a lot of data across the nodes of the cluster).
Attachments
Issue Links
- links to