Uploaded image for project: 'Apache NiFi'
  1. Apache NiFi
  2. NIFI-13687

Race condition with "Batch Output" mode allows FlowFiles from the same batch to get separated

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.27.0, 2.0.0-M4
    • None
    • Core Framework
    • None

    Description

      Summary

      There is a race condition when using "Batch Output" in conjunction with "Single Batch Per Node". The process group allows the output valve to get opened while FlowFiles are still being read from the inputs. This allows some FlowFiles from the same batch to be transferred out of the process group early, separating them from the batch.

      The race condition can occur when an input port is connected directly to an output port within the affected process group (see example below).

      Example

      I simplified my use case to the following flow (that is attached in "Batch_Output_Bug_Flow.json"):

      The input is one FlowFile that represents a batch of work. In the first process group I split the FlowFile into its parts, but also keep the original batch FlowFile around because I need it later. The parts and original are transferred via different ports.

      In the second group I process the parts. I still want to keep the original batch file around so I pass it through the same process group unmodified. The "Single Batch per Node" and "Batch Output" settings of the process group should keep all the FlowFiles together. This is where the bug occurs, because sometimes the batch file is transferred in and out before the parts are transferred in.

      In the third process group I want to re-merge the processed parts with the original batch file, which is why I need to bring it along. This works fine as long as the batch file and parts are processed together as a batch, but suddenly fails if the batch file is transferred before the part files.

      This is how the second process group "Process parts" looks like:

      Reproduction steps

      1. Import the attached flow and start all processors/processor groups except the frist one ("Generate batch input").
      2. Run the input processor once.
      3. Immediately refresh the UI.
        Expected: 11 FlowFiles are queued in "Process parts"
        Race condition A: 10 FlowFiles are queued in "Process parts", 1 is queued in "Re-merge batch"
        Race condition B: 10 FlowFiles are queued in "Process parts", 1 is queued in the "Batch" connection between "Process parts" and "Re-merge batch"
      4. Wait 7 seconds for the flow to complete.
      5. Refresh the UI.
        Expected: 1 FlowFile is in the final queue ("Processed batch").
        Race condition A: 2 FlowFiles are in the final queue (none of which contain the expected contents).
        Race condition B: 1 FlowFile is in the final queue (the final output is unaffected by race condition B in this example, but the behavior is still incorrect).

      Because this is a race condition, it's unreliable to reproduce. If you don't see the race condition, clear the final queue and try again from step 2. It can take me 10 or so tries to see the bug. It is possible that the likelihood of the race condition is influenced by outside factors, such as other flows/processes running on the node, but I haven't investigated that.

      Log

      I've enabled TRACE level logging for org.apache.nifi.groups and org.apache.nifi.connectable to diagnose the race condition. The full logs for the expected case, race condition A, and race condition B are attached.

      The important events from the logs are summarized in the table below:

      RC A         RC B         Expected      
      Log line PG Valve State ** Log line PG Valve State ** Log line PG Valve State
      7 1 IN OPEN   8 1 IN OPEN   11 1 IN OPEN
      10 1 IN CLOSE   11 1 IN CLOSE   14 1 IN CLOSE
      13 1 OUT OPEN   14 1 OUT OPEN   18 1 OUT OPEN
      19 1 OUT CLOSE   22 1 OUT CLOSE   24 1 OUT CLOSE
      20 2 IN OPEN   23 2 IN OPEN   25 2 IN OPEN
      26 2 OUT OPEN   29 2 OUT OPEN          
      28 2 OUT CLOSE   stays OPEN 2 OUT            
      32 2 IN CLOSE   36 2 IN CLOSE   33 2 IN CLOSE
      34 3 IN OPEN                    
      37 3 IN CLOSE                    
      41 3 OUT OPEN                    
      43 3 OUT CLOSE                    
      47 2 OUT OPEN   already OPEN 2 OUT     670 2 OUT OPEN
      49 2 OUT CLOSE   1244 2 OUT CLOSE   676 2 OUT CLOSE
      50 3 IN OPEN   1245 3 IN OPEN   677 3 IN OPEN
      53 3 IN CLOSE   1253 3 IN CLOSE   685 3 IN CLOSE
      57 3 OUT OPEN   1257 3 OUT OPEN   689 3 OUT OPEN
      59 3 OUT CLOSE   1259 3 OUT CLOSE   691 3 OUT CLOSE

      (PG = Process Group, 1 = "Split batch", 2 = "Process parts", 3 = "Re-merge batch")

      The bold, red row is where the bug happens. The output valve for process group 2 is opened while its input valve is still open. This should not be allowed for process groups with "Batch Output" and "Single Batch per Node".

      The orange rows are the unintended consequences of the bug.

      Code pointers

      The main check for process groups with "Batch Output" - that doesn't allow FlowFiles to leave until the whole batch has been processed - happens here:

      https://github.com/apache/nifi/blob/5440c6d405643b627c48870acda1298961d4b1b2/nifi-framework-bundle/nifi-framework/nifi-framework-components/src/main/java/org/apache/nifi/connectable/LocalPort.java#L199-L220

      Additional checking happens here:

      https://github.com/apache/nifi/blob/5440c6d405643b627c48870acda1298961d4b1b2/nifi-framework-bundle/nifi-framework/nifi-framework-components/src/main/java/org/apache/nifi/groups/StandardDataValve.java#L190-L223

      This is also where it's possible to check whether the input valve is still open (groupsWithDataFlowingIn.contains(sourceGroup.getIdentifier())).

      A check should be added somewhere for process groups that are in "Batch Output" mode that doesn't allow the output valve to be opened while the input valve is still open.

      Attachments

        1. Batch_Output_Bug_Flow.json
          39 kB
          René Zeidler
        2. batch-output-expected.log
          223 kB
          René Zeidler
        3. batch-output-race-condition-A.log
          16 kB
          René Zeidler
        4. batch-output-race-condition-B.log
          514 kB
          René Zeidler
        5. image-2024-08-28-12-31-07-244.png
          116 kB
          René Zeidler
        6. image-2024-08-28-12-31-40-357.png
          81 kB
          René Zeidler

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Rene_Z René Zeidler
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: