Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
Description
To query Hudi table from bigquery, the current BigQuerySyncTool creates two bigquery external tables, one over the data files and the other over a manifest file that contains the data file name. Based on these two tables, it creates a view to reflect the latest version of data using the following query: "SELECT * FROM data_table WHERE _hoodie_file_name IN ( SELECT filename FROM manifest_file_table)".
The direct reason for such a workaround is that bigquery cannot support manifest file. However, bigquery is rolling out its manifest file support , allowing users to specify manifest file as source uris. Right now the feature[1] roll-out seems to cover non-partitioned external tables (using hive parition would return an error "file_set_spec_type option is not supported for hive partition"), which should be covering partitioned external tables soon.
Given this new bigquery feature, it would be better to update BigQuerySyncTool correspondingly:
- Allow creating a bigquery compatible manifest file which expects absolute path of data files. This has been done in
HUDI-6254. - Allow using the new manifest file to create external table directly. This can be done by issuing one "CREATE EXTERNAL TABLE" query to bigquery.
- Avoid breaking existing user workflows. In case there are some users relying on the view-based workaround, it probably make sense to keep the workaround alive at least for now. That would require maintaining two versions of manifest files.
- Provide a temporary workaround for using bigquery manifest file support till this feature extends to partitioned table. Since it currently does not support hive partition, the "CREATE EXTERNAL TABLE" can only create a table over all the parquet data files without recognizing the partition columns. To keep the partition columns, a possible workaround is to set the "hoodie.datasource.write.drop.partition.columns" as false and allow users to not specify the "hoodie.gcp.bigquery.sync.source_uri_prefix", such that the partition columns can be written into the parquet files and the BigQuerySyncTool will not try to create a hive-partitioned external table.
[1]https://cloud.google.com/bigquery/docs/information-schema-table-options#options_table
Attachments
Issue Links
- links to