Description
`WriteToBigQuery` pipeline returns `errors` when trying to insert rows that do not match the BigQuery table schema. `errors` is a dictionary that cointains one `FailedRows` key. `FailedRows` is a list of tuples where each tuple has two elements: BigQuery table name and the row that didn't match the schema.
This can be verified by running the `BigQueryIO deadletter pattern` https://beam.apache.org/documentation/patterns/bigqueryio/
Using this approach I can print the failed rows in a pipeline. When running the job, logger simultaneously prints out the reason why the rows were invalid. The reason should also be included in the tuple in addition to the BigQuery table and the raw row. This way next pipeline could process both the invalid row and the reason why it is invalid.
During my reasearch i found a couple of alternate solutions, but i think they are more complex than they need to be. Thats why i explored the beam source code and found the solution to be an easy and simple change.
Attachments
Issue Links
- is related to
-
BEAM-14447 BigQueryWriteIntegrationTests.test_big_query_write_insert_errors_reporting failing in Python PostCommit
- Resolved
- links to