Uploaded image for project: 'Apache NiFi'
  1. Apache NiFi
  2. NIFI-1527

Resource Claim counts not incremented on restart for FlowFiles that are swapped out

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.6.0, 0.5.1
    • Component/s: Core Framework
    • Labels:
      None

      Description

      When NiFi starts, it tallies the count of how many FlowFiles reference each Resource Claim and then removes or archives any file in the Content Repository for which there is no claim.

      However, the claim counts are not incremented for data that is swapped out. As a result, the content repository could delete or archive data on restart that it should not. This could potentially result in data loss, if NiFi is restarted while both of the following conditions are met:

      • All FlowFiles that reference a Resource Claim are swapped out
      • Content Repository's archive is disabled, or the archival threshold is already exceeded from data outside of the Content Repository

        Issue Links

          Activity

          Hide
          joewitt Joseph Witt added a comment -

          I think we should consider doing an 0.5.1 with this finding.

          Show
          joewitt Joseph Witt added a comment - I think we should consider doing an 0.5.1 with this finding.
          Hide
          markap14 Mark Payne added a comment -

          Joe: I am certainly not opposed to that.

          Show
          markap14 Mark Payne added a comment - Joe: I am certainly not opposed to that.
          Hide
          joewitt Joseph Witt added a comment -

          recommend changing this line

          + return new StandardSwapSummary(new QueueSize(0, 0L), null, Collections.<ResourceClaim> emptyList());

          To instead be
          return StandardSwapSummary.EMPTYSUMMARY

          Or something like that. Feels cleaner than having someone construct that concept externally.

          Show
          joewitt Joseph Witt added a comment - recommend changing this line + return new StandardSwapSummary(new QueueSize(0, 0L), null, Collections.<ResourceClaim> emptyList()); To instead be return StandardSwapSummary.EMPTYSUMMARY Or something like that. Feels cleaner than having someone construct that concept externally.
          Hide
          markap14 Mark Payne added a comment -

          +1 will incorporate this and add new patch.

          Show
          markap14 Mark Payne added a comment - +1 will incorporate this and add new patch.
          Hide
          JPercivall Joseph Percivall added a comment -

          Reviewing

          Show
          JPercivall Joseph Percivall added a comment - Reviewing
          Hide
          JPercivall Joseph Percivall added a comment -

          Need to update the comment on FlowFileQueue.recoverSwappedFlowFiles()

          Show
          JPercivall Joseph Percivall added a comment - Need to update the comment on FlowFileQueue.recoverSwappedFlowFiles()
          Hide
          JPercivall Joseph Percivall added a comment -

          The comment on line 138 of TestWriteAheadFlowFileRepository needs to be finished.

          Show
          JPercivall Joseph Percivall added a comment - The comment on line 138 of TestWriteAheadFlowFileRepository needs to be finished.
          Hide
          JPercivall Joseph Percivall added a comment -

          Verified the problem with a latest build of master. Then read through all the changes that were made, ran a contrib check build using this patch, and verified that the patch fixes the problem.

          +1 (pending the two comments being addressed)

          Show
          JPercivall Joseph Percivall added a comment - Verified the problem with a latest build of master. Then read through all the changes that were made, ran a contrib check build using this patch, and verified that the patch fixes the problem. +1 (pending the two comments being addressed)
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 3bb18b965348b439cadc8f548429ab1347f9c305 in nifi's branch refs/heads/master from Mark Payne
          [ https://git-wip-us.apache.org/repos/asf?p=nifi.git;h=3bb18b9 ]

          NIFI-1527: Ensure that we increment Claimant Counts for content claims that are referenced by Swapped-Out FlowFiles on restart of nifi

          Show
          jira-bot ASF subversion and git services added a comment - Commit 3bb18b965348b439cadc8f548429ab1347f9c305 in nifi's branch refs/heads/master from Mark Payne [ https://git-wip-us.apache.org/repos/asf?p=nifi.git;h=3bb18b9 ] NIFI-1527 : Ensure that we increment Claimant Counts for content claims that are referenced by Swapped-Out FlowFiles on restart of nifi
          Hide
          markap14 Mark Payne added a comment -

          Joe: thanks for taking the time to review the patch! Good catches on the comments. Addressed them both and pushed to master.

          Show
          markap14 Mark Payne added a comment - Joe: thanks for taking the time to review the patch! Good catches on the comments. Addressed them both and pushed to master.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 773d7fbd189837b1d68ab8f2640a101502198233 in nifi's branch refs/heads/support/nifi-0.5.x from Mark Payne
          [ https://git-wip-us.apache.org/repos/asf?p=nifi.git;h=773d7fb ]

          NIFI-1527: Ensure that we increment Claimant Counts for content claims that are referenced by Swapped-Out FlowFiles on restart of nifi

          Show
          jira-bot ASF subversion and git services added a comment - Commit 773d7fbd189837b1d68ab8f2640a101502198233 in nifi's branch refs/heads/support/nifi-0.5.x from Mark Payne [ https://git-wip-us.apache.org/repos/asf?p=nifi.git;h=773d7fb ] NIFI-1527 : Ensure that we increment Claimant Counts for content claims that are referenced by Swapped-Out FlowFiles on restart of nifi
          Hide
          joewitt Joseph Witt added a comment -

          Re-opened to evaluate an observation. Might be nothing...but want to be sure.

          Mark Payne Here is a testing scenario I ran.

          Start a flow with GenerateFlowFile and UpdateAttribute
          UpdAtt was disabled
          caused say 1M flow files to backup
          then stopped gneerating data just to keep things simple
          so now 1M backed up
          then stopped nifi and restarted
          all good upon restart its all there again
          Then
          i made another copy of that flow on the flow
          and let that one generate data and terminate data
          what i wanted to observe is that the flow files generated were actually having their backing content deleted
          because the archive size was set at 10% threshold and i am on a 22% full disk
          so was intentionally saying we don't have archive space left
          anyway - the data on the second flow was being properly removed from the content repo
          However then I stopped that flow
          and I ensureed all that was left in the content repo was the original stuff I made
          verified by timetstamp on the files in the content repo anyway
          THEN
          i let UpdateAttribute on the original flow run so it would clear out all the flow files
          in this case, the content repo was NOT getting cleared out

          What I see in the content repo are files like this
          ./71/1456079610973-71
          ...nothing in ./71/archive/...

          This does not seem right.

          Thanks
          Joe

          Show
          joewitt Joseph Witt added a comment - Re-opened to evaluate an observation. Might be nothing...but want to be sure. Mark Payne Here is a testing scenario I ran. Start a flow with GenerateFlowFile and UpdateAttribute UpdAtt was disabled caused say 1M flow files to backup then stopped gneerating data just to keep things simple so now 1M backed up then stopped nifi and restarted all good upon restart its all there again Then i made another copy of that flow on the flow and let that one generate data and terminate data what i wanted to observe is that the flow files generated were actually having their backing content deleted because the archive size was set at 10% threshold and i am on a 22% full disk so was intentionally saying we don't have archive space left anyway - the data on the second flow was being properly removed from the content repo However then I stopped that flow and I ensureed all that was left in the content repo was the original stuff I made verified by timetstamp on the files in the content repo anyway THEN i let UpdateAttribute on the original flow run so it would clear out all the flow files in this case, the content repo was NOT getting cleared out What I see in the content repo are files like this ./71/1456079610973-71 ...nothing in ./71/archive/... This does not seem right. Thanks Joe
          Hide
          markap14 Mark Payne added a comment -

          Joe - hmm, that does seem odd... I will try to replicate problem and see if i can figure out why it's behaving this way - or if it's a bug that we need to address.

          Show
          markap14 Mark Payne added a comment - Joe - hmm, that does seem odd... I will try to replicate problem and see if i can figure out why it's behaving this way - or if it's a bug that we need to address.
          Hide
          markap14 Mark Payne added a comment -

          Joseph Witt - that was a great catch! With some of the refactoring that was done, we ended up incrementing the claim count twice on NiFi restart. So we ended up not removing the content claim when we should have. On restart, it did clear the content because there were no longer any FlowFiles referencing the claim. I have attached a new patch that addresses this issue.

          Thanks!

          Show
          markap14 Mark Payne added a comment - Joseph Witt - that was a great catch! With some of the refactoring that was done, we ended up incrementing the claim count twice on NiFi restart. So we ended up not removing the content claim when we should have. On restart, it did clear the content because there were no longer any FlowFiles referencing the claim. I have attached a new patch that addresses this issue. Thanks!
          Hide
          joewitt Joseph Witt added a comment -

          am reviewing the new patch. Please do make a new one that is based on the latest master because this one as-is would require a revert then add.

          Will advise of results. Will run same tests plus some other goodies.

          Show
          joewitt Joseph Witt added a comment - am reviewing the new patch. Please do make a new one that is based on the latest master because this one as-is would require a revert then add. Will advise of results. Will run same tests plus some other goodies.
          Hide
          markap14 Mark Payne added a comment -

          Good call Joe - I forgot that we'd already pushed to master - generating a new patch now.

          Show
          markap14 Mark Payne added a comment - Good call Joe - I forgot that we'd already pushed to master - generating a new patch now.
          Hide
          joewitt Joseph Witt added a comment -

          Applied latest patch. Seeing better behavior. There does seem to be some sort of off-by-one kind of thing.

          To recreate
          1) Have a content repo on a partition that is already using more space than the content archive allows.
          2) Have a GenerateFlowFile feeding UpdateAttribute.
          3) Let it run for a few seconds making tons of data and removing it. Then wait until checkpointing runs.
          4) Check how many files are left in the content repository. I consistently end up with one file. It goes away on restart but will not otherwise go away.

          In looking at the info in the content repo i see tons of files with this sort of info:

          • 1049600 Feb 21 21:56 ./95/1456109782914-9
            and one single file with this info:
          • 343040 Feb 21 21:56 ./786/1456109799488-786

          So it looks like the file created that didn't trigger the max entries in a segment is the one that won't go away until restart. Seems interesting.

          Show
          joewitt Joseph Witt added a comment - Applied latest patch. Seeing better behavior. There does seem to be some sort of off-by-one kind of thing. To recreate 1) Have a content repo on a partition that is already using more space than the content archive allows. 2) Have a GenerateFlowFile feeding UpdateAttribute. 3) Let it run for a few seconds making tons of data and removing it. Then wait until checkpointing runs. 4) Check how many files are left in the content repository. I consistently end up with one file. It goes away on restart but will not otherwise go away. In looking at the info in the content repo i see tons of files with this sort of info: 1049600 Feb 21 21:56 ./95/1456109782914-9 and one single file with this info: 343040 Feb 21 21:56 ./786/1456109799488-786 So it looks like the file created that didn't trigger the max entries in a segment is the one that won't go away until restart. Seems interesting.
          Hide
          joewitt Joseph Witt added a comment -

          Oooh. I think it is because that is the segment still being written to. I restarted that flow and it filled that one up then cleared it out on next checkpoint. Makes good sense. Testing continues but this is good.

          Show
          joewitt Joseph Witt added a comment - Oooh. I think it is because that is the segment still being written to. I restarted that flow and it filled that one up then cleared it out on next checkpoint. Makes good sense. Testing continues but this is good.
          Hide
          joewitt Joseph Witt added a comment - - edited

          Have tested numerous scenarios and all are giving solid results.

          • Sustained running tests
          • Tested with content repo w/archive where nothing in it and already beyond target archive retention
          • Tested with content repo w/archive where has some space to keep data
          • Tested where I changed from it having some archive space to not having any while data was in repo
          • Tested with backlogs including expiration on the queue
          • Tested with archiving turned off
          • Tested restarting NiFi halfway through a few of these other tests.

          +1 having tested the current support/nifi-0.5.x branch with the latest 2KB applied.

          However, in looking at the latest 37KB patch (as is on master & support/050), latest 39KB patch, latest 2KB patch I cannot quite tell that it is sufficient to push the 2KB patch now. There appears to be some Javadoc changes and some structural/code changes this misses between the 37KB+2KB patch and the previous 39KB patch. So Mark Payne please verify and clearly articulate the patch you mean to provide rebased against master (which presumably means it will apply to support/nifi-0.5.x)

          Show
          joewitt Joseph Witt added a comment - - edited Have tested numerous scenarios and all are giving solid results. Sustained running tests Tested with content repo w/archive where nothing in it and already beyond target archive retention Tested with content repo w/archive where has some space to keep data Tested where I changed from it having some archive space to not having any while data was in repo Tested with backlogs including expiration on the queue Tested with archiving turned off Tested restarting NiFi halfway through a few of these other tests. +1 having tested the current support/nifi-0.5.x branch with the latest 2KB applied. However, in looking at the latest 37KB patch (as is on master & support/050), latest 39KB patch, latest 2KB patch I cannot quite tell that it is sufficient to push the 2KB patch now. There appears to be some Javadoc changes and some structural/code changes this misses between the 37KB+2KB patch and the previous 39KB patch. So Mark Payne please verify and clearly articulate the patch you mean to provide rebased against master (which presumably means it will apply to support/nifi-0.5.x)
          Hide
          markap14 Mark Payne added a comment -

          Joe - yes, the 2 KB should be applied to master (and support). The 39 KB patch i think had some minor tweaks to the Javadocs that i unintentionally pushed without uploading the patch. However, the logic is all correct in both cases, and master & support branches are up to date with the appropriate Javadocs. You should be able to apply the 2 KB patch without issue. Please let me know if not.

          Thanks
          -Mark

          Show
          markap14 Mark Payne added a comment - Joe - yes, the 2 KB should be applied to master (and support). The 39 KB patch i think had some minor tweaks to the Javadocs that i unintentionally pushed without uploading the patch. However, the logic is all correct in both cases, and master & support branches are up to date with the appropriate Javadocs. You should be able to apply the 2 KB patch without issue. Please let me know if not. Thanks -Mark
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 5da9f985bfda7bdcbb102857899fdecb98658e64 in nifi's branch refs/heads/support/nifi-0.5.x from Mark Payne
          [ https://git-wip-us.apache.org/repos/asf?p=nifi.git;h=5da9f98 ]

          NIFI-1527: Fixed issue that resulted in resource claims' claimant count getting incremented twice on restart

          Signed-off-by: joewitt <joewitt@apache.org>

          Show
          jira-bot ASF subversion and git services added a comment - Commit 5da9f985bfda7bdcbb102857899fdecb98658e64 in nifi's branch refs/heads/support/nifi-0.5.x from Mark Payne [ https://git-wip-us.apache.org/repos/asf?p=nifi.git;h=5da9f98 ] NIFI-1527 : Fixed issue that resulted in resource claims' claimant count getting incremented twice on restart Signed-off-by: joewitt <joewitt@apache.org>
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 122cba0ee79b7a9bfdded094c76bc14f76735ca6 in nifi's branch refs/heads/master from Mark Payne
          [ https://git-wip-us.apache.org/repos/asf?p=nifi.git;h=122cba0 ]

          NIFI-1527: Fixed issue that resulted in resource claims' claimant count getting incremented twice on restart

          Signed-off-by: joewitt <joewitt@apache.org>

          Show
          jira-bot ASF subversion and git services added a comment - Commit 122cba0ee79b7a9bfdded094c76bc14f76735ca6 in nifi's branch refs/heads/master from Mark Payne [ https://git-wip-us.apache.org/repos/asf?p=nifi.git;h=122cba0 ] NIFI-1527 : Fixed issue that resulted in resource claims' claimant count getting incremented twice on restart Signed-off-by: joewitt <joewitt@apache.org>

            People

            • Assignee:
              markap14 Mark Payne
              Reporter:
              markap14 Mark Payne
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development