Details
-
Epic
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
RFC-60 Cloud storage layout
Description
Amazon S3 among other cloud object stores, throttle requests based on object prefix => https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/. Hudi follows the traditional Hive storage layout, with files being stored under separate partition paths under a common table path/prefix. This introduces the potential for throttling because of request limits being reached for the common table path/prefix, when writing significant number of files concurrently.
We propose implementing an alternate storage layout, that would be more suitable for cloud object stores like S3 to avoid running into throttling issues as the data scales. At a high level, we need to be able to distribute data files evenly across randomly generated prefixes, so that request limits get distributed across those prefixes, instead of a single table prefix.