[NIFI-7992] Content Repository can fail to cleanup archive directory fast enough - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.13.0
Component/s: Core Framework
Labels:
None

Description

For the scenario where a use is generating many small FlowFiles and has the "nifi.content.claim.max.appendable.size" property set to a small value, we can encounter a situation where data is constantly archived but not cleaned up quickly enough. As a result, the Content Repository can run out of space.

The FileSystemRepository has a backpressure mechanism built in to avoid allowing this to happen, but under the above conditions, it can sometimes fail to prevent this situation. The backpressure mechanism works by performing the following steps:

When a new Content Claim is created, the Content Repository determines which 'container' to use.
Content Repository checks if the amount of storage space used for the container is greater than the configured backpressure threshold.
If so, the thread blocks until a background task completes cleanup of the archive directories.

However, in Step #2 above, it determines if the amount of space currently being used by looking at a cached member variable. That cached member variable is only updated on the first iteration, and when the said background task completes.

So, now consider a case where there are millions of files in the content repository archive. The background task could take a massive amount of time performing cleanup. Meanwhile, processors are able to write to the repository without any backpressure being applied because the background task hasn't updated the cached variable for the amount of space used. This continues until the content repository fills.

There are three important very simple things that should be changed:

The background task should be faster in this case. While we cannot improve the amount of time it takes to destroy the files, we do create an ArrayList to hold all of the file info and then use an iterator, calling remove(). Under the hood, this creates a copy of the underlying array for each file that is removed. On my laptop, performing this procedure on an ArrayList with 1 million elements took approximately 1 minute. Changing to a LinkedList took 15 milliseconds but took much more heap. Keeping an ArrayList, then removing all of elements at the end (via ArrayList.subList(0, n).clear()) resulted in similar performance to LinkedList with the memory footprint of ArrayList.
The check to see whether or not the content repository's usage has crossed the threshold should not rely entirely on a cache that is populated by a process that can take a long time. It should periodically calculate the disk usage itself (perhaps once per minute).
When backpressure does get applied, it can appear that the system has frozen up, not performing any sort of work. The background task that is clearing space should periodically log its progress at INFO level to allow users to understand that this action is taking place.

Attachments

Issue Links

links to

GitHub Pull Request #4652

Activity

People

Assignee:: Mark Payne

Reporter:: Mark Payne

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 10/Nov/20 18:33

Updated:: 10/Nov/20 20:28

Resolved:: 10/Nov/20 20:28

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

0.5h