Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-16259

Distcp to set S3 Storage Class

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 2.8.4
    • Fix Version/s: None
    • Component/s: hadoop-aws, tools/distcp
    • Labels:
      None
    • Target Version/s:
    • Flags:
      Patch
    • Docs Text:
      Hide
      ENHANCE HADOOP DISTCP FOR CUSTOM S3 STORAGE CLASS
      Problem statement:
      Hadoop distcp implementation doesn’t have properties to override Storage class while transferring data to Amazon S3 storage. Hadoop distcp doesn’t set any storage class while transferring data to Amazon S3 storage. Due to this all the objects moved from cluster to S3 using Hadoop Distcp are been stored in the default storage class “STANDARD”.
      Due to this limitation of Hadoop distcp, the clusters heavily dependent on distcp to transfer data to S3 are forced to PUT objects under high cost “STANDARD” storage class and use S3 lifecycle policies to transition the data to cost effective archive layer like “GLACIER”.
      This will contribute to considerable increase in billing as data is been staged in “STANDARD” layer before transitioning to “GLACIER” layer even for use cases where archival is the only business need.
      This problem can be rectified by implementing the below changes in the hadoop-aws-x.x.x.jar.
      Design :
      The hadoop-aws jar is part of hadoop distribution and provide s3,s3n and s3a protocols to access objects stored in S3. In order to enable the storage class override feature for all the 3 protocols, we have to implement the changes in each protocols as mentioned below,
      Note : Based on the hadoop version of the cluster, we have to get the appropriate source code version of hadoop-aws-x.x.x.jar from apache download site.
      We will introduce a storage class property “fs.s3.storage.class”. This property will be defaulted to “STANDARD”. But will enable the feature to override this property to any of the valid S3 storage classes (STANDARD | REDUCED_REDUNDANCY | GLACIER | STANDARD_IA | ONEZONE_IA | INTELLIGENT_TIERING | DEEP_ARCHIVE).
      S3A:
      1. In the class “Constants” under the package “org.apache.hadoop.fs.s3a”, initialize the storage class property and default storage class properties as below,,
      a. public static final String S3_STORAGE_CLASS="fs.s3.storage.class";
      b. public static final String S3_STORAGE_CLASS_DEFAULT="STANDARD";
      c. public static final String S3_STORAGE_CLASS_HEADER="x-amz-storage-class";
      2. In the class “S3AOutputStream” under the package “org.apache.hadoop.fs.s3a”, introduce a “s3StorageClass” property and initialize this property in the constructor as below ,
      a. this.s3StorageClass=conf.get(S3_STORAGE_CLASS, S3_STORAGE_CLASS_DEFAULT); - This will read the conf passed from hadoop conf and check whether any override value is been provided for the property “org.apache.hadoop.fs.s3”. If there is any, then it will be used while uploading the object to S3 or else the default value “STANDARD” will be used.
      b. Then in the close method, set the storage class of the initialized S3 object in the object metadata object as follows,
      i. om.setHeader(S3_STORAGE_CLASS_HEADER, this.s3StorageClass);
      3. In the class “S3AFastOutputStream” under the package “org.apache.hadoop.fs.s3a”, introduce a “s3StorageClass” property and initialize this property in the constructor as below ,
      a. this.s3StorageClass= fs.getConf().get(S3_STORAGE_CLASS, S3_STORAGE_CLASS_DEFAULT); - This will read the conf passed from hadoop conf and check whether any override value is been provided for the property “org.apache.hadoop.fs.s3”. If there is any, then it will be used while uploading the object to S3 or else the default value “STANDARD” will be used.
      b. Then in the close method, set the storage class of the initialized S3 object in the object metadata object as follows,
      i. om.setHeader(S3_STORAGE_CLASS_HEADER, this.s3StorageClass);


      Advantage:
      • Objects can be uploaded directly into GLACIER storage class which have considerable reduction in the billing by eliminating unneeded staging of data in STANDARD layer.
      Show
      ENHANCE HADOOP DISTCP FOR CUSTOM S3 STORAGE CLASS Problem statement: Hadoop distcp implementation doesn’t have properties to override Storage class while transferring data to Amazon S3 storage. Hadoop distcp doesn’t set any storage class while transferring data to Amazon S3 storage. Due to this all the objects moved from cluster to S3 using Hadoop Distcp are been stored in the default storage class “STANDARD”. Due to this limitation of Hadoop distcp, the clusters heavily dependent on distcp to transfer data to S3 are forced to PUT objects under high cost “STANDARD” storage class and use S3 lifecycle policies to transition the data to cost effective archive layer like “GLACIER”. This will contribute to considerable increase in billing as data is been staged in “STANDARD” layer before transitioning to “GLACIER” layer even for use cases where archival is the only business need. This problem can be rectified by implementing the below changes in the hadoop-aws-x.x.x.jar. Design : The hadoop-aws jar is part of hadoop distribution and provide s3,s3n and s3a protocols to access objects stored in S3. In order to enable the storage class override feature for all the 3 protocols, we have to implement the changes in each protocols as mentioned below, Note : Based on the hadoop version of the cluster, we have to get the appropriate source code version of hadoop-aws-x.x.x.jar from apache download site. We will introduce a storage class property “fs.s3.storage.class”. This property will be defaulted to “STANDARD”. But will enable the feature to override this property to any of the valid S3 storage classes (STANDARD | REDUCED_REDUNDANCY | GLACIER | STANDARD_IA | ONEZONE_IA | INTELLIGENT_TIERING | DEEP_ARCHIVE). S3A: 1. In the class “Constants” under the package “org.apache.hadoop.fs.s3a”, initialize the storage class property and default storage class properties as below,, a. public static final String S3_STORAGE_CLASS="fs.s3.storage.class"; b. public static final String S3_STORAGE_CLASS_DEFAULT="STANDARD"; c. public static final String S3_STORAGE_CLASS_HEADER="x-amz-storage-class"; 2. In the class “S3AOutputStream” under the package “org.apache.hadoop.fs.s3a”, introduce a “s3StorageClass” property and initialize this property in the constructor as below , a. this.s3StorageClass=conf.get(S3_STORAGE_CLASS, S3_STORAGE_CLASS_DEFAULT); - This will read the conf passed from hadoop conf and check whether any override value is been provided for the property “org.apache.hadoop.fs.s3”. If there is any, then it will be used while uploading the object to S3 or else the default value “STANDARD” will be used. b. Then in the close method, set the storage class of the initialized S3 object in the object metadata object as follows, i. om.setHeader(S3_STORAGE_CLASS_HEADER, this.s3StorageClass); 3. In the class “S3AFastOutputStream” under the package “org.apache.hadoop.fs.s3a”, introduce a “s3StorageClass” property and initialize this property in the constructor as below , a. this.s3StorageClass= fs.getConf().get(S3_STORAGE_CLASS, S3_STORAGE_CLASS_DEFAULT); - This will read the conf passed from hadoop conf and check whether any override value is been provided for the property “org.apache.hadoop.fs.s3”. If there is any, then it will be used while uploading the object to S3 or else the default value “STANDARD” will be used. b. Then in the close method, set the storage class of the initialized S3 object in the object metadata object as follows, i. om.setHeader(S3_STORAGE_CLASS_HEADER, this.s3StorageClass); Advantage: • Objects can be uploaded directly into GLACIER storage class which have considerable reduction in the billing by eliminating unneeded staging of data in STANDARD layer.

      Description

      Hadoop distcp implementation doesn’t have properties to override Storage class while transferring data to Amazon S3 storage. Hadoop distcp doesn’t set any storage class while transferring data to Amazon S3 storage. Due to this all the objects moved from cluster to S3 using Hadoop Distcp are been stored in the default storage class “STANDARD”. By providing a new feature to override the default S3 storage class through configuration properties will be helpful to upload objects in other storage classes. I have come up with a design to implement this feature in a design document and uploaded the same in the JIRA. Kindly review and let me know for your suggestions.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                PrakashGopalsamy Prakash Gopalsamy
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Time Tracking

                  Estimated:
                  Original Estimate - 168h
                  168h
                  Remaining:
                  Remaining Estimate - 168h
                  168h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified