[HUDI-6333] allow using the manifest file with absolute path to directly create one bigquery external table over the Hudi table - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.14.0
Component/s: meta-sync
Labels:
- pull-request-available

Description

To query Hudi table from bigquery, the current BigQuerySyncTool creates two bigquery external tables, one over the data files and the other over a manifest file that contains the data file name. Based on these two tables, it creates a view to reflect the latest version of data using the following query: "SELECT * FROM data_table WHERE _hoodie_file_name IN ( SELECT filename FROM manifest_file_table)".

The direct reason for such a workaround is that bigquery cannot support manifest file. However, bigquery is rolling out its manifest file support , allowing users to specify manifest file as source uris. Right now the feature[1] roll-out seems to cover non-partitioned external tables (using hive parition would return an error "file_set_spec_type option is not supported for hive partition"), which should be covering partitioned external tables soon.

Given this new bigquery feature, it would be better to update BigQuerySyncTool correspondingly:

Allow creating a bigquery compatible manifest file which expects absolute path of data files. This has been done in ~~HUDI-6254~~.
Allow using the new manifest file to create external table directly. This can be done by issuing one "CREATE EXTERNAL TABLE" query to bigquery.
Avoid breaking existing user workflows. In case there are some users relying on the view-based workaround, it probably make sense to keep the workaround alive at least for now. That would require maintaining two versions of manifest files.
Provide a temporary workaround for using bigquery manifest file support till this feature extends to partitioned table. Since it currently does not support hive partition, the "CREATE EXTERNAL TABLE" can only create a table over all the parquet data files without recognizing the partition columns. To keep the partition columns, a possible workaround is to set the "hoodie.datasource.write.drop.partition.columns" as false and allow users to not specify the "hoodie.gcp.bigquery.sync.source_uri_prefix", such that the partition columns can be written into the parquet files and the BigQuerySyncTool will not try to create a hive-partitioned external table.

[1]https://cloud.google.com/bigquery/docs/information-schema-table-options#options_table

Attachments

Issue Links

links to

GitHub Pull Request #8898

Activity

People

Assignee:: Unassigned

Reporter:: Jinpeng Zhou

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 07/Jun/23 18:09

Updated:: 22/Jun/23 03:57

Resolved:: 22/Jun/23 03:57