Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-2858

temp file garbage collection in BigQuery sink should be in a separate DoFn

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: P2
    • Resolution: Fixed
    • Affects Version/s: 2.1.0
    • Fix Version/s: 2.2.0
    • Component/s: io-java-gcp
    • Labels:
      None

      Description

      Currently the WriteTables transform deletes the set of input files as soon as the load() job completes. However this is incorrect - if the task fails partially through deleting files (e.g. if the worker crashes), the task will be retried. If the write disposition is WRITE_TRUNCATE, bad things could result.

      The resulting behavior will depend on what BQ does if one of input files is missing (because we had previously deleted it). In the best case, BQ will fail the load. In this case the step will keep failing until the runner finally fails the entire job. If however BQ ignores the missing file, the load will overwrite the previously-written table with the smaller set of files and the job will succeed. This is the worst-case scenario, as it will result in data loss.

        Attachments

        1. delete_file_diff.txt
          2 kB
          Chamikara Madhusanka Jayalath

          Issue Links

            Activity

              People

              • Assignee:
                reuvenlax Reuven Lax
                Reporter:
                reuvenlax Reuven Lax
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: