Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.27.0, 2.0.0-M4
-
None
-
None
Description
Summary
There is a race condition when using "Batch Output" in conjunction with "Single Batch Per Node". The process group allows the output valve to get opened while FlowFiles are still being read from the inputs. This allows some FlowFiles from the same batch to be transferred out of the process group early, separating them from the batch.
The race condition can occur when an input port is connected directly to an output port within the affected process group (see example below).
Example
I simplified my use case to the following flow (that is attached in "Batch_Output_Bug_Flow.json"):
The input is one FlowFile that represents a batch of work. In the first process group I split the FlowFile into its parts, but also keep the original batch FlowFile around because I need it later. The parts and original are transferred via different ports.
In the second group I process the parts. I still want to keep the original batch file around so I pass it through the same process group unmodified. The "Single Batch per Node" and "Batch Output" settings of the process group should keep all the FlowFiles together. This is where the bug occurs, because sometimes the batch file is transferred in and out before the parts are transferred in.
In the third process group I want to re-merge the processed parts with the original batch file, which is why I need to bring it along. This works fine as long as the batch file and parts are processed together as a batch, but suddenly fails if the batch file is transferred before the part files.
This is how the second process group "Process parts" looks like:
Reproduction steps
- Import the attached flow and start all processors/processor groups except the frist one ("Generate batch input").
- Run the input processor once.
- Immediately refresh the UI.
Expected: 11 FlowFiles are queued in "Process parts"
Race condition A: 10 FlowFiles are queued in "Process parts", 1 is queued in "Re-merge batch"
Race condition B: 10 FlowFiles are queued in "Process parts", 1 is queued in the "Batch" connection between "Process parts" and "Re-merge batch" - Wait 7 seconds for the flow to complete.
- Refresh the UI.
Expected: 1 FlowFile is in the final queue ("Processed batch").
Race condition A: 2 FlowFiles are in the final queue (none of which contain the expected contents).
Race condition B: 1 FlowFile is in the final queue (the final output is unaffected by race condition B in this example, but the behavior is still incorrect).
Because this is a race condition, it's unreliable to reproduce. If you don't see the race condition, clear the final queue and try again from step 2. It can take me 10 or so tries to see the bug. It is possible that the likelihood of the race condition is influenced by outside factors, such as other flows/processes running on the node, but I haven't investigated that.
Log
I've enabled TRACE level logging for org.apache.nifi.groups and org.apache.nifi.connectable to diagnose the race condition. The full logs for the expected case, race condition A, and race condition B are attached.
The important events from the logs are summarized in the table below:
RC A | RC B | Expected | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Log line | PG | Valve | State | ** | Log line | PG | Valve | State | ** | Log line | PG | Valve | State |
7 | 1 | IN | OPEN | 8 | 1 | IN | OPEN | 11 | 1 | IN | OPEN | ||
10 | 1 | IN | CLOSE | 11 | 1 | IN | CLOSE | 14 | 1 | IN | CLOSE | ||
13 | 1 | OUT | OPEN | 14 | 1 | OUT | OPEN | 18 | 1 | OUT | OPEN | ||
19 | 1 | OUT | CLOSE | 22 | 1 | OUT | CLOSE | 24 | 1 | OUT | CLOSE | ||
20 | 2 | IN | OPEN | 23 | 2 | IN | OPEN | 25 | 2 | IN | OPEN | ||
26 | 2 | OUT | OPEN | 29 | 2 | OUT | OPEN | ||||||
28 | 2 | OUT | CLOSE | stays OPEN | 2 | OUT | |||||||
32 | 2 | IN | CLOSE | 36 | 2 | IN | CLOSE | 33 | 2 | IN | CLOSE | ||
34 | 3 | IN | OPEN | ||||||||||
37 | 3 | IN | CLOSE | ||||||||||
41 | 3 | OUT | OPEN | ||||||||||
43 | 3 | OUT | CLOSE | ||||||||||
47 | 2 | OUT | OPEN | already OPEN | 2 | OUT | 670 | 2 | OUT | OPEN | |||
49 | 2 | OUT | CLOSE | 1244 | 2 | OUT | CLOSE | 676 | 2 | OUT | CLOSE | ||
50 | 3 | IN | OPEN | 1245 | 3 | IN | OPEN | 677 | 3 | IN | OPEN | ||
53 | 3 | IN | CLOSE | 1253 | 3 | IN | CLOSE | 685 | 3 | IN | CLOSE | ||
57 | 3 | OUT | OPEN | 1257 | 3 | OUT | OPEN | 689 | 3 | OUT | OPEN | ||
59 | 3 | OUT | CLOSE | 1259 | 3 | OUT | CLOSE | 691 | 3 | OUT | CLOSE |
(PG = Process Group, 1 = "Split batch", 2 = "Process parts", 3 = "Re-merge batch")
The bold, red row is where the bug happens. The output valve for process group 2 is opened while its input valve is still open. This should not be allowed for process groups with "Batch Output" and "Single Batch per Node".
The orange rows are the unintended consequences of the bug.
Code pointers
The main check for process groups with "Batch Output" - that doesn't allow FlowFiles to leave until the whole batch has been processed - happens here:
Additional checking happens here:
This is also where it's possible to check whether the input valve is still open (groupsWithDataFlowingIn.contains(sourceGroup.getIdentifier())).
A check should be added somewhere for process groups that are in "Batch Output" mode that doesn't allow the output valve to be opened while the input valve is still open.
Attachments
Attachments
Issue Links
- relates to
-
NIFI-13340 Flowfiles stopped to be ingested before a processor group
- Resolved