Uploaded image for project: 'Apache NiFi'
  1. Apache NiFi
  2. NIFI-8676

ListS3 and ListGCSObject sometimes miss objects in very active buckets



    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.13.2
    • 1.16.0
    • Extensions


      ListS3 and ListGCSBucket occasionally miss some objects in very active buckets and never list them. Through testing, it appears that exclusively using an object's last modified date for state tracking is unreliable when a large dump of objects of various sizes is uploaded simultaneously. For some reason, newer but smaller files are sometimes listed before older but larger files, which messes up the timestamp tracking state of the ListS3 and ListGCSBucket processors.

      We have flows that operate as ListS3 -> FetchS3Object -> DeleteS3Object -> (downstream processing) and ListGCSBucket -> FetchGCSObject -> DeleteGCSObject -> (downstream processing). We often notice files remain in the bucket until we manually clear the state of the relevant List processor and restart it. Examining the provenance logs shows that the objects that remained were never listed, which is confirmed by logs within the downstream processing showing the objects never made it there.

      Attached is a sample flow.xml.gz file which replicates this problem by simulating extreme conditions for both GCS and S3. Two GenerateFlowFile processors run with a schedule of 0.01 seconds. One of them generates flow files of size 1B and the other generates flow files of size 1GB. These feed into a PutS3Object or PutGCSObject processor which is set to use 10 concurrent threads, thus allowing 10 files to be uploaded simultaneously. The queue that is connected to the Put processors does not limit the number or size of flow files in order to preventing backpressure from causing the number of small and large sample flow files being uploaded simultaneously to become unbalanced.

      Another flow within the attached sample flow.xml.gz file uses ListS3/ListGCSBucket -> DeleteS3Object/DeleteGCSObject to mimic the receiving end where objects are missed. The List processors are set to a run schedule of 0 seconds to cause listing to occur as frequently as possible. After starting both the sending and receiving flows, you should see within a few seconds to a minute that the counts of flow files put into GCS or S3 are higher than the count of flow files output by the List processors. Additionally, if you stop the Put flow but let the receiving flow with its Delete processor continue to run, objects will remain in the bucket even after all queues are flushed. Examining provenance logs will confirm that those objects were never listed. Stopping the List processor, clearing its state, and restarting it will cause these remaining objects to be listed and then deleted by the Delete processor.

      We do not run into this problem with ListAzureBlobStorage since we can set it to track entities and not just track timestamps. ListS3 and ListGCSBucket do not allow tracking by entities and are hard-coded to only track timestamps. It'd be great if they could track by entities or if the timestamp issue could be resolved.


        1. flow.xml.gz
          3 kB
          Paul Kelly

        Issue Links



              tpalfy Tamas Palfy
              pkelly.nifi Paul Kelly
              0 Vote for this issue
              6 Start watching this issue



                Time Tracking

                  Original Estimate - Not Specified
                  Not Specified
                  Remaining Estimate - 0h
                  Time Spent - 4.5h