Details
-
New Feature
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.23.0
-
None
-
None
-
Docker
Description
We have an environment with two separate NiFi clusters, with no direct connectivity between them. We need to pass lots of small data packets between these instances.
I've tested the FlowFile Stream v3 format with MergeContent at the sending end, and the same format at the receiving end - works great. Automatic packing of FlowFile attributes is exactly what we need. We have another requirement though, which is to ultimately archive these merged bundles in a format that's essentially platform agnostic - i.e. can be read in its original form using standard tooling (think Bash/Python scripts) or third-party applications. The FlowFile Stream v3 format isn't really suitable for this I believe, as only NiFi can read it. I suppose technically one could invoke the relevant Java class to read it, but that's not workable where certain third-party tools are involved.
Avro format is an option, however the FlowFile attributes are aggregated (merged or combined, depending on configuration) into a single merged FlowFile. We need the original attributes preserved for each individual FlowFile within the merged archive.
The formats I have in mind are TAR and ZIP, both of which are already supported by MergeContent and UnpackContent. The missing part is the storage and retrieval of FlowFile attributes, which are currently discarded by the relevant TAR/ZIP implementations of these processors.
My proposal is to extend the basic TAR and ZIP functionality, giving the user the option of storing FlowFile attributes in files within the archive, using the FlowFile archive entry name as the base and adding a user-configurable extension. For instance, MergeContent would produce a file like:
merged.zip:
|_ abc.txt
|_ abc.txt.attributes
|_ xyz.txt
|_ xyz.txt.attributes
Similarly, UnpackContent would read the archive and if configured to parse attributes from files, it would for each FlowFile entry, read the next file in sequence as an attributes file and merge any attributes defined in that file with any existing attributes written by that processor.
The user would be able to configure a MergeContent RecordWriter for writing out attributes, which would provide the flexibility of choosing their output format (for instance, CSV or JSON). They would likewise be able to configure a RecordReader in UnpackContent.
The extension of attribute storage and retrieval for these common archive formats would enhance the ability of dataflow admins to store FlowFiles, along with their attributes, in such a way that they are (and remain) readable by current and future archiving systems - without being dependent on NiFi.