Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-13345

S3Guard: Improved Consistency for S3A

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.8.1
    • 2.9.0, 3.0.0-beta1
    • fs/s3
    • None
    • Hide
      S3Guard (pronounced see-guard) is a new feature for the S3A connector to Amazon S3, which uses DynamoDB for a high performance and consistent metadata repository. Essentially: S3Guard caches directory information, so your S3A clients get faster lookups and resilience to inconsistency between S3 list operations and the status of objects. When files are created, with S3Guard, they'll always be found.

      S3Guard does not address update consistency: if a file is updated, while the directory information will be updated, calling open() on the path may still return the old data. Similarly, deleted objects may also potentially be opened.

      Please consult the S3Guard documentation in the Amazon S3 section of our documentation.

      Note: part of this update includes moving to a new version of the AWS SDK 1.11, one which includes the Dynamo DB client and its a shaded version of Jackson 2. The large aws-sdk-bundle JAR is needed to use the S3A client with or without S3Guard enabled. The good news: because Jackson is shaded, there will be no conflict between any Jackson version used in your application and that which the AWS SDK needs.
      Show
      S3Guard (pronounced see-guard) is a new feature for the S3A connector to Amazon S3, which uses DynamoDB for a high performance and consistent metadata repository. Essentially: S3Guard caches directory information, so your S3A clients get faster lookups and resilience to inconsistency between S3 list operations and the status of objects. When files are created, with S3Guard, they'll always be found. S3Guard does not address update consistency: if a file is updated, while the directory information will be updated, calling open() on the path may still return the old data. Similarly, deleted objects may also potentially be opened. Please consult the S3Guard documentation in the Amazon S3 section of our documentation. Note: part of this update includes moving to a new version of the AWS SDK 1.11, one which includes the Dynamo DB client and its a shaded version of Jackson 2. The large aws-sdk-bundle JAR is needed to use the S3A client with or without S3Guard enabled. The good news: because Jackson is shaded, there will be no conflict between any Jackson version used in your application and that which the AWS SDK needs.

    Description

      This issue proposes S3Guard, a new feature of S3A, to provide an option for a stronger consistency model than what is currently offered. The solution coordinates with a strongly consistent external store to resolve inconsistencies caused by the S3 eventual consistency model.

      Attachments

        1. S3GuardImprovedConsistencyforS3AV2.pdf
          328 kB
          Chris Nauroth
        2. s3c.001.patch
          61 kB
          Lei (Eddy) Xu
        3. S3C-ConsistentListingonS3-Design.pdf
          245 kB
          Lei (Eddy) Xu
        4. HADOOP-13345.prototype1.patch
          76 kB
          Chris Nauroth
        5. S3GuardImprovedConsistencyforS3A.pdf
          431 kB
          Chris Nauroth

        Issue Links

          1.
          Support running isolated unit tests separate from AWS integration tests. Sub-task Resolved Chris Nauroth  
          2.
          S3Guard: Define MetadataStore interface. Sub-task Resolved Chris Nauroth  
          3.
          S3Guard: Implement DynamoDBMetadataStore. Sub-task Resolved Mingliang Liu  
          4.
          S3Guard: Implement access policy providing strong consistency with S3 as source of truth. Sub-task Closed Unassigned  
          5.
          S3Guard: Implement access policy using metadata store as source of truth. Sub-task Closed Unassigned  
          6.
          S3Guard: Implement access policy for intra-client consistency with in-memory metadata store. Sub-task Resolved Aaron Fabbri  
          7.
          S3Guard: Instrument new functionality with Hadoop metrics. Sub-task Resolved Ai Deng  
          8.
          S3Guard: Write end user docs, change table autocreate default. Sub-task Resolved Aaron Fabbri  
          9.
          S3Guard: create basic contract tests for MetadataStore implementations Sub-task Resolved Aaron Fabbri  
          10.
          S3Guard: implement move() for LocalMetadataStore, add unit tests Sub-task Resolved Aaron Fabbri  
          11.
          S3Guard: Allow execution of all S3A integration tests with S3Guard enabled. Sub-task Resolved Steve Loughran  
          12.
          S3Guard: S3AFileSystem Integration with MetadataStore Sub-task Resolved Aaron Fabbri  
          13.
          S3Guard: Provide command line tools to manipulate metadata store. Sub-task Resolved Lei (Eddy) Xu  
          14.
          Change PathMetadata to hold S3AFileStatus instead of FileStatus. Sub-task Resolved Lei (Eddy) Xu  
          15.
          S3Guard: better support for multi-bucket access Sub-task Resolved Aaron Fabbri  
          16.
          S3Guard: add delete tracking Sub-task Resolved Aaron Fabbri  
          17.
          s3guard: add inconsistency injection, integration tests Sub-task Resolved Aaron Fabbri  
          18.
          s3guard to log choice of metadata store at debug Sub-task Resolved Mingliang Liu  
          19.
          S3Guard: fix TestDynamoDBMetadataStore when fs.s3a.s3guard.ddb.table is set Sub-task Resolved Aaron Fabbri  
          20.
          s3guard: ITestS3AFileOperationCost.testFakeDirectoryDeletion failure Sub-task Resolved Mingliang Liu  
          21.
          dynamodb dependency -> compile Sub-task Resolved Mingliang Liu  
          22.
          DynamoDBMetadataStore to handle DDB throttling failures through retry policy Sub-task Resolved Aaron Fabbri  
          23.
          tune dynamodb client & tests Sub-task Resolved Steve Loughran  
          24.
          s3guard: improve S3AFileStatus#isEmptyDirectory handling Sub-task Resolved Aaron Fabbri  
          25.
          S3Guard: Existing tables may not be initialized correctly in DynamoDBMetadataStore Sub-task Resolved Mingliang Liu  
          26.
          S3Guard: NPE when table is already populated in dynamodb and user specifies "fs.s3a.s3guard.ddb.table.create=false" Sub-task Closed Mingliang Liu  
          27.
          S3AGuard: Use BatchWriteItem in DynamoDBMetadataStore#put() Sub-task Resolved Mingliang Liu  
          28.
          S3Guard: S3AFileSystem::listLocatedStatus() to employ MetadataStore Sub-task Resolved Mingliang Liu  
          29.
          S3Guard: DynamoDBMetadataStore#move() could be throwing exception due to BatchWriteItem limits Sub-task Resolved Mingliang Liu  
          30.
          Mock bucket locations in MockS3ClientFactory Sub-task Resolved Mingliang Liu  
          31.
          S3guard: replace dynamo.describe() call in init with more efficient query Sub-task Closed Mingliang Liu  
          32.
          Initialize DynamoDBMetadataStore without associated S3AFileSystem Sub-task Resolved Mingliang Liu  
          33.
          Add ability to start DDB local server in every test Sub-task Resolved Mingliang Liu  
          34.
          s3guard: add a version marker to every table Sub-task Resolved Steve Loughran  
          35.
          S3Guard CLI: Add documentation Sub-task Resolved Aaron Fabbri  
          36.
          s3guard cli: make tests easier to run and address failure Sub-task Resolved Sean Mackrory  
          37.
          Merge initial S3guard release into trunk Sub-task Resolved Steve Loughran  
          38.
          cli to list info about a bucket (S3guard or not) Sub-task Resolved Unassigned  
          39.
          Handled dynamo exceptions in translateException Sub-task Resolved Unassigned  
          40.
          S3Guard: fix multi-bucket integration tests Sub-task Resolved Aaron Fabbri  
          41.
          Optimize dirListingUnion Sub-task Resolved Sean Mackrory  
          42.
          S3Guard: DynamoDBMetadataStore logs nonsense region Sub-task Resolved Sean Mackrory  
          43.
          Implicitly creating DynamoDB table ignores endpoint config Sub-task Resolved Sean Mackrory  
          44.
          S3Guard: intermittent duplicate item keys failure Sub-task Resolved Mingliang Liu  
          45.
          CLI command to prune old metadata Sub-task Resolved Sean Mackrory  
          46.
          Metastore destruction test creates table without version marker Sub-task Resolved Sean Mackrory  
          47.
          S3Guard: link docs from index, fix typos Sub-task Resolved Aaron Fabbri  
          48.
          Fix breaking link in s3guard.md Sub-task Resolved Mingliang Liu  
          49.
          Drop unnecessary type assertion and cast Sub-task Resolved Sean Mackrory  
          50.
          Allow users to specify region for DynamoDB table instead of endpoint Sub-task Resolved Sean Mackrory  
          51.
          Rethink S3GuardTool options Sub-task Resolved Sean Mackrory  
          52.
          s3guard: regression in dirListingUnion Sub-task Resolved Aaron Fabbri  
          53.
          ITestS3GuardListConsistency fails intermittently Sub-task Resolved Mingliang Liu  
          54.
          In S3AFileSystem, make getAmazonClient() package private; export getBucketLocation() Sub-task Resolved Steve Loughran  
          55.
          s3guard tool tests aren't isolated; can't run in parallel Sub-task Resolved Sean Mackrory  
          56.
          ITestS3ACredentialsInURL sometimes fails Sub-task Resolved Sean Mackrory  
          57.
          Simplify DynamoDBClientFactory for creating Amazon DynamoDB clients Sub-task Resolved Mingliang Liu  
          58.
          s3guard: CLI diff non-empty after import on new table Sub-task Resolved Sean Mackrory  
          59.
          Ensure GenericOptionParser is used for S3Guard CLI Sub-task Resolved Sean Mackrory  
          60.
          Add S3Guard.dirListingUnion in S3AFileSystem#listFiles, listLocatedStatus Sub-task Closed Unassigned  
          61.
          S3GuardTool tests should not run if S3Guard is not set up Sub-task Resolved Sean Mackrory  
          62.
          S3Guard: import does not import empty directory Sub-task Resolved Sean Mackrory  
          63.
          Add validation of DynamoDB region Sub-task Resolved Sean Mackrory  
          64.
          DynamoDB client should waitForActive on existing tables Sub-task Resolved Sean Mackrory  
          65.
          Add s3guardtool dump command Sub-task Resolved Unassigned  
          66.
          S3Guard: DynamoDBMetadataStore::move() should populate ancestor directories of destination paths Sub-task Resolved Mingliang Liu  
          67.
          S3Guard: S3AFileSystem::rename() should move non-listed sub-directory entries in metadata store Sub-task Resolved Mingliang Liu  
          68.
          S3Guard: ITestS3AConcurrentOps is not cleaning up test data Sub-task Resolved Mingliang Liu  
          69.
          TestS3GuardTool hangs/fails when offline: it's an IT test Sub-task Resolved Mingliang Liu  
          70.
          S3Guard: S3AFileSystem::listFiles() to employ MetadataStore Sub-task Resolved Mingliang Liu  
          71.
          S3Guard: DynamoDBMetadata::prune() should self interrupt correctly Sub-task Resolved Mingliang Liu  
          72.
          TestDynamoDBMetadataStore is broken unless we can fail faster without a table version Sub-task Resolved Sean Mackrory  
          73.
          ITestS3GuardListConsistency failure w/ Local, authoritative metadata store Sub-task Resolved Aaron Fabbri  
          74.
          S3Guard: S3GuardTool to support provisioning existing metadata store Sub-task Resolved Steve Loughran  
          75.
          s3guard will set file length to -1 on a putObjectDirect(stream, -1) call Sub-task Resolved Steve Loughran  
          76.
          ITestS3GuardConcurrentOps.testConcurrentTableCreations fails without table name configured Sub-task Resolved Sean Mackrory  
          77.
          Play nice with ITestS3AEncryptionSSEC Sub-task Resolved Sean Mackrory  
          78.
          create() does not notify metadataStore of parent directories or ensure they're not existing files Sub-task Resolved Sean Mackrory  
          79.
          S3Guard: Improve FNFE message when opening a stream Sub-task Resolved Aaron Fabbri  
          80.
          make InconsistentAmazonS3Client usable in downstream tests Sub-task Resolved Aaron Fabbri  
          81.
          Ensure deleted parent directory tombstones are overwritten when implicitly recreated Sub-task Resolved Sean Mackrory  
          82.
          DirListingMetadata precondition failure messages to include path at fault Sub-task Resolved Steve Loughran  
          83.
          s3guard w/ failure injection: listStatus fails after renaming file into directory Sub-task Resolved Sean Mackrory  
          84.
          ITestS3GuardConcurrentOps requires explicit DynamoDB table name to be configured Sub-task Resolved Sean Mackrory  
          85.
          Findbugs warning in LocalMetadataStore Sub-task Resolved Sean Mackrory  
          86.
          ProvidedFileStatusIterator#next() may throw IndexOutOfBoundsException Sub-task Resolved Aaron Fabbri  
          87.
          simplify mkdirs() after S3Guard delete tracking change Sub-task Resolved Sean Mackrory  
          88.
          InconsistentAmazonS3Client adds extra paths to listStatus() after delete. Sub-task Resolved Sean Mackrory  
          89.
          ITestS3GuardListConsistency is too slow Sub-task Resolved Aaron Fabbri  
          90.
          S3Guard: issues running parallel tests w/ S3N Sub-task Resolved Aaron Fabbri  
          91.
          LocalDynamoDB missing from latest AWS SDK releases Sub-task Resolved Steve Loughran  
          92.
          S3Guard: optimize create codepath Sub-task Resolved Aaron Fabbri  
          93.
          add a predicate/option to probe an S3A FS for being consistent Sub-task Resolved Unassigned  
          94.
          ITestS3GuardConcurrentOps failing with -Ddynamodblocal -Ds3guard Sub-task Resolved Steve Loughran  
          95.
          ITestS3AEncryptionSSEC failing in parallel s3guard runs Sub-task Resolved Steve Loughran  
          96.
          Review S3guard docs & code prior to merge Sub-task Resolved Steve Loughran

          0%

          Original Estimate - 24h
          Remaining Estimate - 24h
          97.
          S3Guard premerge changes: java 7 build & test tuning Sub-task Resolved Steve Loughran  
          98.
          s3guard diff demand creates a new table Sub-task Resolved Unassigned  
          99.
          hadoop-aws shell profile not being built Sub-task Resolved Allen Wittenauer  
          100.
          S3Guard: handle provisioning failure through backoff & retry (& metrics) Sub-task Resolved Unassigned  
          101.
          s3guard usage calls function incorrectly Sub-task Resolved Allen Wittenauer  
          102.
          backport S3guard to branch-2 Sub-task Resolved Steve Loughran  

          Activity

            People

              cnauroth Chris Nauroth
              cnauroth Chris Nauroth
              Votes:
              8 Vote for this issue
              Watchers:
              73 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 24h
                  24h
                  Remaining:
                  Remaining Estimate - 24h
                  24h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified