[HUDI-7466] AWS Glue sync - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.15.0, 1.0.0
Component/s: None
Labels:
- pull-request-available

Description

Ticket to track
https://github.com/apache/hudi/pull/10460

Currently, AWS Glue sync works and provides 2 interfaces - one is HoodieHiveSyncClient using Hive, then Glue -> Hive implementation(hidden by AWS), and another is AWSGlueCatalogSyncClient.
Both of them have limitations - although Hive has improved a bit to use pushdown on a big scale still fails and fallback may not work for certain partitioning schemes.
For syncing to Glue using HoodieHiveSyncClient, there is a set of limitations:

Create/update is not parallelized under the hood, meaning for big sets it's very slow - empirically it's about 40 partitions/sec MAX, which translates to minutes for bigger scale.
The pushdown filter is not really effective since for 1st case(specifying exact partitions) - it works unpredictably, since the longer the partition value you have, the fewer partitions you can specify, in our case we cannot specify > 100 partitions, therefore it falls back to min-max predicate.
Min-max predicate does not work if the number of partitions is growing with nesting, e.g. on level 1 there are 10, on level 2 there are 100, on level 3 there are 1000. In this case, min-max will cut down high-level ones, but load all levels down, therefore not really making optimization.
When there is e.g. a schema change, Hive-Glue calls cascade, and for big tables it's impossible to sync in meaningful time - although for Glue -> Hudi does not specify schema on partition level, so this is wasted effort.
This is why AWSGlueCatalogSyncClient is preferable. But there are other problems with it.
Particular list of problems:
Create/Update/Delete were not optimized before - now optimized to be async, but without a meaningful high border, it will simply reach the request limit and stay there. This solution adds a parameter for such parallelism and creates parallelization logic.
Listing all partitions is used always for AWSGlueCatalogSyncClient - this is way suboptimal since the goal of this is to distinguish which of changed-since-last-sync are created and which are deletes, therefore more optimal API can be used - BatchGetPartition. Also, it can be parallelized easily. I added a new method to sync client classes and moved Hive-pushdown into a Hive-specific class and implemented this method for the AWSGlue client class. Also, parameter controlling parallelism is added.
Listing all partitions is suboptimal - it is still needed for initial sync/resync, but it's done in a straightforward way and is suboptimal. In particular - it uses basic nextToken which makes it sequential and works slowly in heavily partitioned tables. AWS has an improvement for this particular method, called segment. This allows us to basically create 1 to 10 start positions and use standard(nextToken) API to list partitions. Also - last public version of Hive-> Glue interface implementation uses it.
When we switched from the Hive sync class to AWS Glue specific - first what we faced is performance degradation with the listing. I added segment API parameter usage and added parameter controlling parallelism.

All this has been tested for a partitioned table with >200k partitions.
I managed to get speed improvement from 2-3 minutes to 3 seconds. Let me know if you are interested in numbers.

Attachments

Issue Links

links to

GitHub Pull Request #10460

GitHub Pull Request #10897

Activity

People

Assignee:: Unassigned

Reporter:: Vitali Makarevich

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 02/Mar/24 00:45

Updated:: 01/Apr/24 21:29

Resolved:: 16/Mar/24 01:06