Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-15619 Über-JIRA: S3Guard Phase IV: Hadoop 3.3 features
  3. HADOOP-16085

S3Guard: use object version or etags to protect against inconsistent read after replace/overwrite

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.2.0
    • Fix Version/s: 3.3.0
    • Component/s: fs/s3
    • Labels:
      None
    • Release Note:
      Hide
      S3Guard will now track the etag of uploaded files and, if an S3 bucket is versioned, the object version. You can then control how to react to a mismatch between the data in the DynamoDB table and that in the store: warn, fail, or, when using versions, return the original value.

      This adds two new columns to the table: etag and version. This is transparent to older S3A clients -but when such clients add/update data to the S3Guard table, they will not add these values. As a result, the etag/version checks will not work with files uploaded by older clients.

      For a consistent experience, upgrade all clients to use the latest hadoop version.
      Show
      S3Guard will now track the etag of uploaded files and, if an S3 bucket is versioned, the object version. You can then control how to react to a mismatch between the data in the DynamoDB table and that in the store: warn, fail, or, when using versions, return the original value. This adds two new columns to the table: etag and version. This is transparent to older S3A clients -but when such clients add/update data to the S3Guard table, they will not add these values. As a result, the etag/version checks will not work with files uploaded by older clients. For a consistent experience, upgrade all clients to use the latest hadoop version.

      Description

      Currently S3Guard doesn't track S3 object versions.  If a file is written in S3A with S3Guard and then subsequently overwritten, there is no protection against the next reader seeing the old version of the file instead of the new one.

      It seems like the S3Guard metadata could track the S3 object version.  When a file is created or updated, the object version could be written to the S3Guard metadata.  When a file is read, the read out of S3 could be performed by object version, ensuring the correct version is retrieved.

      I don't have a lot of direct experience with this yet, but this is my impression from looking through the code.  My organization is looking to shift some datasets stored in HDFS over to S3 and is concerned about this potential issue as there are some cases in our codebase that would do an overwrite.

      I imagine this idea may have been considered before but I couldn't quite track down any JIRAs discussing it.  If there is one, feel free to close this with a reference to it.

      Am I understanding things correctly?  Is this idea feasible?  Any feedback that could be provided would be appreciated.  We may consider crafting a patch.

        Attachments

        1. HADOOP-16085_002.patch
          63 kB
          Ben Roling
        2. HADOOP-16085_3.2.0_001.patch
          56 kB
          Ben Roling
        3. HADOOP-16085-003.patch
          129 kB
          Ben Roling

          Issue Links

            Activity

              People

              • Assignee:
                ben.roling Ben Roling
                Reporter:
                ben.roling Ben Roling
              • Votes:
                0 Vote for this issue
                Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: