Details
-
Bug
-
Status: Open
-
P3
-
Resolution: Unresolved
-
2.6.0
-
None
Description
Did some check on the beam code and find out that DataFlow is querying BigQuery and retrieve the result using pagination [1]. As per our understanding, this means no parallelism on reading BigQuery table. It is contradictory to what the documentation is telling us [2].
Is this some kind of work in progress? I'm filing as a bug since documentation telling me that it is using GCS meanwhile it's using NativeSourceReader which yield data per row as iterator.
[1] https://github.com/apache/beam/blob/520b3a24e49306c30940ceab09100d775a04d28e/sdks/python/apache_beam/io/gcp/bigquery.py#L1083
[2] https://github.com/apache/beam/blob/520b3a24e49306c30940ceab09100d775a04d28e/sdks/python/apache_beam/io/gcp/bigquery.py#L60