[HUDI-3625] [RFC-60] Optimized storage layout for cloud object stores - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Epic
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: 1.0.0
Component/s: core
Labels:
- hudi-umbrellas
- pull-request-available

Epic Name:
RFC-60 Cloud storage layout

Description

Amazon S3 among other cloud object stores, throttle requests based on object prefix => https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/. Hudi follows the traditional Hive storage layout, with files being stored under separate partition paths under a common table path/prefix. This introduces the potential for throttling because of request limits being reached for the common table path/prefix, when writing significant number of files concurrently.

We propose implementing an alternate storage layout, that would be more suitable for cloud object stores like S3 to avoid running into throttling issues as the data scales. At a high level, we need to be able to distribute data files evenly across randomly generated prefixes, so that request limits get distributed across those prefixes, instead of a single table prefix.