Details
-
Bug
-
Status: Resolved
-
P0
-
Resolution: Fixed
-
None
-
None
Description
Unique IDs ensure (best effort) that writes to BigQuery are idempotent, for example, we don't write the same record twice in a VM failure.
Currently Python BQ sink insert BQ IDs here but they'll be re-generated in a VM failure resulting in data duplication.
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L766
Correct fix is to do a Reshuffle to checkpoint unique IDs once they are generated, similar to how Java BQ sink operates.
Pablo, can you do an initial assessment here ?
I think this is a relatively small fix but I might be wrong.
Attachments
Issue Links
- links to