Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-12950

Missing events when using Python WriteToFiles in streaming pipeline

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • P1
    • Resolution: Fixed
    • 2.32.0
    • 2.34.0
    • io-py-files
    • None

    Description

      We have a Python streaming pipeline consuming events from PubSub and writing them into GCS, in Dataflow.

      After performing some tests, we realized that we were missing events. The reason was that some files were being deleted from the temporary folder before moving them to the destination.

      In the logs we can see that there might be a race condition: the code checks if the file exists in temp folder before it has been actually created, Thus it’s not moved to the destination. Afterwards, the file is considered orphaned and is deleted from the temp folder in this line: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio.py#L677.

      Since files are being moved from the temp folder to the final destination, they shouldn’t be deleted in any case, otherwise we would lose events. For what purpose we have “orphaned files”? Should they exist?

      We know that Python SDK WriteToFiles is still experimental, but missing events looks like a big deal and that's why I've created it as P1. If you think it should be lowered let me know.

      The easiest and safest approach right now would be to not delete any files and log a message to warn that some files might be left orphaned in the temporary folder. Eventually, in the next execution, that orphaned file will be moved to the destination, so we don’t lose any event. In addition, the log level should be always INFO, because Dataflow doesn’t accept DEBUG.level. 

      Attachments

        Issue Links

          Activity

            People

              davidpr David
              davidpr David
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 4.5h
                  4.5h