[BEAM-8367] Python BigQuery sink should use unique IDs for mode STREAMING_INSERTS - ASF JIRA

XML

Word

Printable

JSON

Unique IDs ensure (best effort) that writes to BigQuery are idempotent, for example, we don't write the same record twice in a VM failure.

Currently Python BQ sink insert BQ IDs here but they'll be re-generated in a VM failure resulting in data duplication.

Correct fix is to do a Reshuffle to checkpoint unique IDs once they are generated, similar to how Java BQ sink operates.

Pablo, can you do an initial assessment here ?

I think this is a relatively small fix but I might be wrong.

links to

GitHub Pull Request #9797

Estimated:

Not Specified

Remaining:

Logged: