Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.8.1
    • Fix Version/s: 2.9.0, 3.0.0-beta1
    • Component/s: fs/s3
    • Labels:
      None
    • Target Version/s:
    • Release Note:
      Hide
      S3Guard (pronounced see-guard) is a new feature for the S3A connector to Amazon S3, which uses DynamoDB for a high performance and consistent metadata repository. Essentially: S3Guard caches directory information, so your S3A clients get faster lookups and resilience to inconsistency between S3 list operations and the status of objects. When files are created, with S3Guard, they'll always be found.

      S3Guard does not address update consistency: if a file is updated, while the directory information will be updated, calling open() on the path may still return the old data. Similarly, deleted objects may also potentially be opened.

      Please consult the S3Guard documentation in the Amazon S3 section of our documentation.

      Note: part of this update includes moving to a new version of the AWS SDK 1.11, one which includes the Dynamo DB client and its a shaded version of Jackson 2. The large aws-sdk-bundle JAR is needed to use the S3A client with or without S3Guard enabled. The good news: because Jackson is shaded, there will be no conflict between any Jackson version used in your application and that which the AWS SDK needs.
      Show
      S3Guard (pronounced see-guard) is a new feature for the S3A connector to Amazon S3, which uses DynamoDB for a high performance and consistent metadata repository. Essentially: S3Guard caches directory information, so your S3A clients get faster lookups and resilience to inconsistency between S3 list operations and the status of objects. When files are created, with S3Guard, they'll always be found. S3Guard does not address update consistency: if a file is updated, while the directory information will be updated, calling open() on the path may still return the old data. Similarly, deleted objects may also potentially be opened. Please consult the S3Guard documentation in the Amazon S3 section of our documentation. Note: part of this update includes moving to a new version of the AWS SDK 1.11, one which includes the Dynamo DB client and its a shaded version of Jackson 2. The large aws-sdk-bundle JAR is needed to use the S3A client with or without S3Guard enabled. The good news: because Jackson is shaded, there will be no conflict between any Jackson version used in your application and that which the AWS SDK needs.

      Description

      This issue proposes S3Guard, a new feature of S3A, to provide an option for a stronger consistency model than what is currently offered. The solution coordinates with a strongly consistent external store to resolve inconsistencies caused by the S3 eventual consistency model.

        Attachments

        1. HADOOP-13345.prototype1.patch
          76 kB
          Chris Nauroth
        2. s3c.001.patch
          61 kB
          Lei (Eddy) Xu
        3. S3C-ConsistentListingonS3-Design.pdf
          245 kB
          Lei (Eddy) Xu
        4. S3GuardImprovedConsistencyforS3A.pdf
          431 kB
          Chris Nauroth
        5. S3GuardImprovedConsistencyforS3AV2.pdf
          328 kB
          Chris Nauroth

          Issue Links

          1.
          Support running isolated unit tests separate from AWS integration tests. Sub-task Resolved Chris Nauroth  
          2.
          Refactor S3AFileSystem to support introduction of separate metadata repository and tests. Sub-task Resolved Chris Nauroth  
          3.
          S3Guard: Define MetadataStore interface. Sub-task Resolved Chris Nauroth  
          4.
          S3Guard: Implement DynamoDBMetadataStore. Sub-task Resolved Mingliang Liu  
          5.
          S3Guard: Implement access policy providing strong consistency with S3 as source of truth. Sub-task Closed Unassigned  
          6.
          S3Guard: Implement access policy using metadata store as source of truth. Sub-task Closed Unassigned  
          7.
          S3Guard: Implement access policy for intra-client consistency with in-memory metadata store. Sub-task Resolved Aaron Fabbri  
          8.
          S3Guard: Instrument new functionality with Hadoop metrics. Sub-task Resolved Ai Deng  
          9.
          S3Guard: Write end user docs, change table autocreate default. Sub-task Resolved Aaron Fabbri  
          10.
          S3Guard: create basic contract tests for MetadataStore implementations Sub-task Resolved Aaron Fabbri  
          11.
          S3Guard: implement move() for LocalMetadataStore, add unit tests Sub-task Resolved Aaron Fabbri  
          12.
          S3Guard: Allow execution of all S3A integration tests with S3Guard enabled. Sub-task Resolved Steve Loughran  
          13.
          S3Guard: S3AFileSystem Integration with MetadataStore Sub-task Resolved Aaron Fabbri  
          14.
          S3Guard: Provide command line tools to manipulate metadata store. Sub-task Resolved Lei (Eddy) Xu  
          15.
          Change PathMetadata to hold S3AFileStatus instead of FileStatus. Sub-task Resolved Lei (Eddy) Xu  
          16.
          S3Guard: better support for multi-bucket access Sub-task Resolved Aaron Fabbri  
          17.
          S3Guard: add delete tracking Sub-task Resolved Aaron Fabbri  
          18.
          s3guard: add inconsistency injection, integration tests Sub-task Resolved Aaron Fabbri  
          19.
          s3guard to log choice of metadata store at debug Sub-task Resolved Mingliang Liu  
          20.
          S3Guard: fix TestDynamoDBMetadataStore when fs.s3a.s3guard.ddb.table is set Sub-task Resolved Aaron Fabbri  
          21.
          s3guard: ITestS3AFileOperationCost.testFakeDirectoryDeletion failure Sub-task Resolved Mingliang Liu  
          22.
          dynamodb dependency -> compile Sub-task Resolved Mingliang Liu  
          23.
          DynamoDBMetadataStore to handle DDB throttling failures through retry policy Sub-task Resolved Aaron Fabbri  
          24.
          tune dynamodb client & tests Sub-task Resolved Steve Loughran  
          25.
          s3guard: improve S3AFileStatus#isEmptyDirectory handling Sub-task Resolved Aaron Fabbri  
          26.
          S3Guard: Existing tables may not be initialized correctly in DynamoDBMetadataStore Sub-task Resolved Mingliang Liu  
          27.
          S3Guard: NPE when table is already populated in dynamodb and user specifies "fs.s3a.s3guard.ddb.table.create=false" Sub-task Closed Mingliang Liu  
          28.
          S3AGuard: Use BatchWriteItem in DynamoDBMetadataStore#put() Sub-task Resolved Mingliang Liu  
          29.
          S3Guard: S3AFileSystem::listLocatedStatus() to employ MetadataStore Sub-task Resolved Mingliang Liu  
          30.
          S3Guard: DynamoDBMetadataStore#move() could be throwing exception due to BatchWriteItem limits Sub-task Resolved Mingliang Liu  
          31.
          Mock bucket locations in MockS3ClientFactory Sub-task Resolved Mingliang Liu  
          32.
          S3guard: replace dynamo.describe() call in init with more efficient query Sub-task Closed Mingliang Liu  
          33.
          Initialize DynamoDBMetadataStore without associated S3AFileSystem Sub-task Resolved Mingliang Liu  
          34.
          Add ability to start DDB local server in every test Sub-task Resolved Mingliang Liu  
          35.
          s3guard: add a version marker to every table Sub-task Resolved Steve Loughran  
          36.
          S3Guard CLI: Add documentation Sub-task Resolved Aaron Fabbri  
          37.
          s3guard cli: make tests easier to run and address failure Sub-task Resolved Sean Mackrory  
          38.
          Merge initial S3guard release into trunk Sub-task Resolved Steve Loughran  
          39.
          cli to list info about a bucket (S3guard or not) Sub-task Resolved Unassigned  
          40.
          Handled dynamo exceptions in translateException Sub-task Resolved Unassigned  
          41.
          S3Guard: fix multi-bucket integration tests Sub-task Resolved Aaron Fabbri  
          42.
          Optimize dirListingUnion Sub-task Resolved Sean Mackrory  
          43.
          S3Guard: DynamoDBMetadataStore logs nonsense region Sub-task Resolved Sean Mackrory  
          44.
          Implicitly creating DynamoDB table ignores endpoint config Sub-task Resolved Sean Mackrory  
          45.
          S3Guard: intermittent duplicate item keys failure Sub-task Resolved Mingliang Liu  
          46.
          CLI command to prune old metadata Sub-task Resolved Sean Mackrory  
          47.
          Metastore destruction test creates table without version marker Sub-task Resolved Sean Mackrory  
          48.
          S3Guard: link docs from index, fix typos Sub-task Resolved Aaron Fabbri  
          49.
          Fix breaking link in s3guard.md Sub-task Resolved Mingliang Liu  
          50.
          Drop unnecessary type assertion and cast Sub-task Resolved Sean Mackrory  
          51.
          Allow users to specify region for DynamoDB table instead of endpoint Sub-task Resolved Sean Mackrory  
          52.
          Rethink S3GuardTool options Sub-task Resolved Sean Mackrory  
          53.
          s3guard: regression in dirListingUnion Sub-task Resolved Aaron Fabbri  
          54.
          ITestS3GuardListConsistency fails intermittently Sub-task Resolved Mingliang Liu  
          55.
          In S3AFileSystem, make getAmazonClient() package private; export getBucketLocation() Sub-task Resolved Steve Loughran  
          56.
          s3guard tool tests aren't isolated; can't run in parallel Sub-task Resolved Sean Mackrory  
          57.
          ITestS3ACredentialsInURL sometimes fails Sub-task Resolved Sean Mackrory  
          58.
          Simplify DynamoDBClientFactory for creating Amazon DynamoDB clients Sub-task Resolved Mingliang Liu  
          59.
          s3guard: CLI diff non-empty after import on new table Sub-task Resolved Sean Mackrory  
          60.
          Ensure GenericOptionParser is used for S3Guard CLI Sub-task Resolved Sean Mackrory  
          61.
          Add S3Guard.dirListingUnion in S3AFileSystem#listFiles, listLocatedStatus Sub-task Closed Unassigned  
          62.
          S3GuardTool tests should not run if S3Guard is not set up Sub-task Resolved Sean Mackrory  
          63.
          S3Guard: import does not import empty directory Sub-task Resolved Sean Mackrory  
          64.
          Add validation of DynamoDB region Sub-task Resolved Sean Mackrory  
          65.
          DynamoDB client should waitForActive on existing tables Sub-task Resolved Sean Mackrory  
          66.
          Add s3guardtool dump command Sub-task Resolved Unassigned  
          67.
          S3Guard: DynamoDBMetadataStore::move() should populate ancestor directories of destination paths Sub-task Resolved Mingliang Liu  
          68.
          S3Guard: S3AFileSystem::rename() should move non-listed sub-directory entries in metadata store Sub-task Resolved Mingliang Liu  
          69.
          S3Guard: ITestS3AConcurrentOps is not cleaning up test data Sub-task Resolved Mingliang Liu  
          70.
          TestS3GuardTool hangs/fails when offline: it's an IT test Sub-task Resolved Mingliang Liu  
          71.
          S3Guard: S3AFileSystem::listFiles() to employ MetadataStore Sub-task Resolved Mingliang Liu  
          72.
          S3Guard: DynamoDBMetadata::prune() should self interrupt correctly Sub-task Resolved Mingliang Liu  
          73.
          TestDynamoDBMetadataStore is broken unless we can fail faster without a table version Sub-task Resolved Sean Mackrory  
          74.
          ITestS3GuardListConsistency failure w/ Local, authoritative metadata store Sub-task Resolved Aaron Fabbri  
          75.
          S3Guard: S3GuardTool to support provisioning existing metadata store Sub-task Resolved Steve Loughran  
          76.
          s3guard will set file length to -1 on a putObjectDirect(stream, -1) call Sub-task Resolved Steve Loughran  
          77.
          ITestS3GuardConcurrentOps.testConcurrentTableCreations fails without table name configured Sub-task Resolved Sean Mackrory  
          78.
          Play nice with ITestS3AEncryptionSSEC Sub-task Resolved Sean Mackrory  
          79.
          create() does not notify metadataStore of parent directories or ensure they're not existing files Sub-task Resolved Sean Mackrory  
          80.
          S3Guard: Improve FNFE message when opening a stream Sub-task Resolved Aaron Fabbri  
          81.
          make InconsistentAmazonS3Client usable in downstream tests Sub-task Resolved Aaron Fabbri  
          82.
          Ensure deleted parent directory tombstones are overwritten when implicitly recreated Sub-task Resolved Sean Mackrory  
          83.
          DirListingMetadata precondition failure messages to include path at fault Sub-task Resolved Steve Loughran  
          84.
          s3guard w/ failure injection: listStatus fails after renaming file into directory Sub-task Resolved Sean Mackrory  
          85.
          ITestS3GuardConcurrentOps requires explicit DynamoDB table name to be configured Sub-task Resolved Sean Mackrory  
          86.
          Findbugs warning in LocalMetadataStore Sub-task Resolved Sean Mackrory  
          87.
          ProvidedFileStatusIterator#next() may throw IndexOutOfBoundsException Sub-task Resolved Aaron Fabbri  
          88.
          simplify mkdirs() after S3Guard delete tracking change Sub-task Resolved Sean Mackrory  
          89.
          InconsistentAmazonS3Client adds extra paths to listStatus() after delete. Sub-task Resolved Sean Mackrory  
          90.
          ITestS3GuardListConsistency is too slow Sub-task Resolved Aaron Fabbri  
          91.
          S3Guard: issues running parallel tests w/ S3N Sub-task Resolved Aaron Fabbri  
          92.
          LocalDynamoDB missing from latest AWS SDK releases Sub-task Resolved Steve Loughran  
          93.
          S3Guard: optimize create codepath Sub-task Resolved Aaron Fabbri  
          94.
          add a predicate/option to probe an S3A FS for being consistent Sub-task Resolved Unassigned  
          95.
          ITestS3GuardConcurrentOps failing with -Ddynamodblocal -Ds3guard Sub-task Resolved Steve Loughran  
          96.
          ITestS3AEncryptionSSEC failing in parallel s3guard runs Sub-task Resolved Steve Loughran  
          97.
          Review S3guard docs & code prior to merge Sub-task Resolved Steve Loughran

          0%

          Original Estimate - 24h
          Remaining Estimate - 24h
          98.
          S3Guard premerge changes: java 7 build & test tuning Sub-task Resolved Steve Loughran  
          99.
          s3guard diff demand creates a new table Sub-task Resolved Unassigned  
          100.
          hadoop-aws shell profile not being built Sub-task Resolved Allen Wittenauer  
          101.
          S3Guard: handle provisioning failure through backoff & retry (& metrics) Sub-task Resolved Unassigned  
          102.
          s3guard usage calls function incorrectly Sub-task Resolved Allen Wittenauer  
          103.
          backport S3guard to branch-2 Sub-task Resolved Steve Loughran  

            Activity

              People

              • Assignee:
                cnauroth Chris Nauroth
                Reporter:
                cnauroth Chris Nauroth
              • Votes:
                8 Vote for this issue
                Watchers:
                81 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 24h
                  24h
                  Remaining:
                  Remaining Estimate - 24h
                  24h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified