[BEAM-14383] Improve "FailedRows" errors returned by beam.io.WriteToBigQuery - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: P2
Resolution: Implemented
Affects Version/s: None
Fix Version/s: 2.39.0
Component/s: io-py-gcp
Labels:
None

Description

`WriteToBigQuery` pipeline returns `errors` when trying to insert rows that do not match the BigQuery table schema. `errors` is a dictionary that cointains one `FailedRows` key. `FailedRows` is a list of tuples where each tuple has two elements: BigQuery table name and the row that didn't match the schema.

This can be verified by running the `BigQueryIO deadletter pattern` https://beam.apache.org/documentation/patterns/bigqueryio/

Using this approach I can print the failed rows in a pipeline. When running the job, logger simultaneously prints out the reason why the rows were invalid. The reason should also be included in the tuple in addition to the BigQuery table and the raw row. This way next pipeline could process both the invalid row and the reason why it is invalid.

During my reasearch i found a couple of alternate solutions, but i think they are more complex than they need to be. Thats why i explored the beam source code and found the solution to be an easy and simple change.

Attachments

Issue Links

is related to

BEAM-14447 BigQueryWriteIntegrationTests.test_big_query_write_insert_errors_reporting failing in Python PostCommit

Resolved

links to

GitHub Pull Request #17517

GitHub Pull Request #17601

Activity

People

Assignee:: Unassigned

Reporter:: Oskar Firlej

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 29/Apr/22 15:19

Updated:: 10/May/22 16:51

Resolved:: 07/May/22 00:04

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

5h 10m