We sometimes see users (especially those with a mix of flows where some produce very large FlowFiles and some produce tons of tiny FlowFiles) run into an issue where the UI shows very little space is used up by FlowFiles but the content repository fills up.
Yesterday I was on a call with such a team. Their NiFi UI showed one node had about 200,000 FlowFiles totaling dozens of MB. However, the content repository was 300 GB in size (which was the entire content repo). As a result, their NiFi instance stopped processing data because the content repo was completely full.
We did some analysis to check if there were "orphaned" flowfiles filling the content repository, but there were not. Instead, the nifi.sh diagnostics --verbose command showed us that a handful of queues were causing the content repo to retain those 100's of GB of data, even though the FlowFiles themselves only amounted to a few MB.
This is a known issue and is caused by how we write FlowFile Content to disk, using the same file on disk for many content claims. By default, we allow up to 1 MB to be written to a file before we conclude that we should no longer write additional FlowFiles to it. This is controlled by the "nifi.content.claim.max.appendable.size" property.
The support team indicates that this happens frequently. We need to change the default value of this property from "1 MB" to "50 KB". This will dramatically decrease the incidence rate.
I setup a flow to test this locally. Queued up 5,000 FlowFiles totaling 610 KB, and the Content Repo was taking 45 GB of disk space. I then dropped all data, changed this property from the default 1 MB to 50 KB and repeated the test. As expected, I queued up the same number of files (610 KB worth), and the content repo occupied 2.6 GB of disk space. I.e., making the value 5% of the original value resulted in occupying only 5% as much "unnecessary" disk space.
Performance tests indicate that the performance was approximately the same, regardless of whether I used "1 MB" or "50 KB"
Additionally, when running the nifi.sh diagnostics --verbose command, the information that was necessary for tracking down the root cause of this was made available but took tremendous effort to decipher. We should update the diagnostics output when scanning the content repo to show the amount of data in the content repo that is being retained by each queue in the flow.