Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-3625

[RFC-60] Optimized storage layout for cloud object stores

    XMLWordPrintableJSON

Details

    • RFC-60 Cloud storage layout

    Description

      Amazon S3 among other cloud object stores, throttle requests based on object prefix => https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/. Hudi follows the traditional Hive storage layout, with files being stored under separate partition paths under a common table path/prefix. This introduces the potential for throttling because of request limits being reached for the common table path/prefix, when writing significant number of files concurrently.

      We propose implementing an alternate storage layout, that would be more suitable for cloud object stores like S3 to avoid running into throttling issues as the data scales. At a high level, we need to be able to distribute data files evenly across randomly generated prefixes, so that request limits get distributed across those prefixes, instead of a single table prefix.

      Attachments

        Activity

          People

            yc2523 Shawn Chang
            uditme Udit Mehrotra
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: