In Defragment mode, MergeContent can improperly reassemble the pieces of a split file. I understand this was previously discussed in
NIFI-378, and the outcome was to update the documentation for fragment.index :
Applicable only if the <Merge Strategy> property is set to Defragment. This attribute indicates the order in which the fragments should be assembled. This attribute must be present on all FlowFiles when using the Defragment Merge Strategy and must be a unique (i.e., unique across all FlowFiles that have the same value for the "fragment.identifier" attribute) integer between 0 and the value of the fragment.count attribute. If two or more FlowFiles have the same value for the "fragment.identifier" attribute and the same value for the "fragment.index" attribute, the behavior of this Processor is undefined.
I believe this could (and probably should) be improved upon. Specifically, the discussion around
NIFI-378 focused on the "improper" use of MergeContent, in using the same fragment.identifier to "pair up" files. The situation I've encountered isn't really unusual in any way...
I have a file, being split and sent via PostHTTP to another nifi instance. If something "goes wrong", the sending NiFi may not get an acknowledgement of success even if the file made it to the receiving NiFi. It then sends the segment again. NiFi favors duplication over loss, so this is not unexpected. However, I now have a file broken into X fragments arriving on the other side as X+1 (or more). The reassembly may work... or both duplicates may be chosen, and result in an incorrectly recreated file.
To satisfy the contract as it exists, you would need to use a DetectDuplicate before the MergeContent to filter these out. However, that could potentially incur a great of overhead. In contrast, simply checking that there are no duplicate fragment id's in a bin should be relatively straightforward. How to handle duplicates is a legitimate question... are they ignored, or are they discard (if they're actually the same)? If the duplicate id's aren't identical, what is the behavior? Personally, I would say if you have actual duplicates, drop one and continue with the merge... if you have unequal "duplicates", fail the bin. But there's room for discussion there.
The point is, in this circumstance it is very easy for a user to do a very reasonable thing and end up with a corrupt file for reasons that are somewhat esoteric. Then, we would need to explain to them why "defragment" doesn't actually defragment, but just kind of sorts a bin of matching things. I think we can do better than that.