Details

    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      There's no specific docs for working with object stores from the hadoop fs shell or in distcp; people either suffer from this (performance, billing), or learn through trial and error what to do.

      Add a section in both fs shell and distcp docs covering use with object stores.

        Issue Links

          Activity

          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10875 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10875/)
          HADOOP-13655. document object store use with fs shell and distcp. (liuml07: rev beb70fed4f15cd4afe8ea23e6068a8344d3557b1)

          • (edit) hadoop-common-project/hadoop-common/src/site/markdown/filesystem/introduction.md
          • (edit) hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm
          • (edit) hadoop-common-project/hadoop-common/src/site/markdown/FileSystemShell.md
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10875 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10875/ ) HADOOP-13655 . document object store use with fs shell and distcp. (liuml07: rev beb70fed4f15cd4afe8ea23e6068a8344d3557b1) (edit) hadoop-common-project/hadoop-common/src/site/markdown/filesystem/introduction.md (edit) hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm (edit) hadoop-common-project/hadoop-common/src/site/markdown/FileSystemShell.md
          Hide
          liuml07 Mingliang Liu added a comment -

          Per offline discussion with Steve, the patch should go to trunk as well. The changes are pretty the same across branches.

          Show
          liuml07 Mingliang Liu added a comment - Per offline discussion with Steve, the patch should go to trunk as well. The changes are pretty the same across branches.
          Hide
          liuml07 Mingliang Liu added a comment -

          Just committed to trunk through branch-2.7 branches. As the HDFS-9820 was committed to branch-2.9+ branches, I slightly changed the patch for branch-2.7 and branch-2.8 by removing the "-rdiff" statement from the DistCp doc change.

          Thanks Steve Loughran for your contribution, and thanks Yuanbo Liu for your review.

          Show
          liuml07 Mingliang Liu added a comment - Just committed to trunk through branch-2.7 branches. As the HDFS-9820 was committed to branch-2.9+ branches, I slightly changed the patch for branch-2.7 and branch-2.8 by removing the "-rdiff" statement from the DistCp doc change. Thanks Steve Loughran for your contribution, and thanks Yuanbo Liu for your review.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/hadoop/pull/131

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/hadoop/pull/131
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 13m 49s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          0 mvndep 0m 19s Maven dependency ordering for branch
          +1 mvninstall 7m 1s branch-2 passed
          +1 mvnsite 1m 21s branch-2 passed
          0 mvndep 0m 15s Maven dependency ordering for patch
          +1 mvnsite 1m 17s the patch passed
          -1 whitespace 0m 0s The patch has 10 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
          +1 asflicense 0m 17s The patch does not generate ASF License warnings.
          24m 50s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:b59b8b7
          JIRA Issue HADOOP-13655
          GITHUB PR https://github.com/apache/hadoop/pull/131
          Optional Tests asflicense mvnsite
          uname Linux 8b08858dd925 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision branch-2 / 4b289d5
          whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/11113/artifact/patchprocess/whitespace-eol.txt
          modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-distcp U: .
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11113/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 13m 49s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. 0 mvndep 0m 19s Maven dependency ordering for branch +1 mvninstall 7m 1s branch-2 passed +1 mvnsite 1m 21s branch-2 passed 0 mvndep 0m 15s Maven dependency ordering for patch +1 mvnsite 1m 17s the patch passed -1 whitespace 0m 0s The patch has 10 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply +1 asflicense 0m 17s The patch does not generate ASF License warnings. 24m 50s Subsystem Report/Notes Docker Image:yetus/hadoop:b59b8b7 JIRA Issue HADOOP-13655 GITHUB PR https://github.com/apache/hadoop/pull/131 Optional Tests asflicense mvnsite uname Linux 8b08858dd925 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision branch-2 / 4b289d5 whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/11113/artifact/patchprocess/whitespace-eol.txt modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-distcp U: . Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11113/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          liuml07 Mingliang Liu added a comment -

          Attach a patch for trunk branch. It's largely the same as the the original PR patch which is for branch-2 branch.

          Hopefully this will trigger the Jenkins. Ping Steve Loughran for double check.

          Show
          liuml07 Mingliang Liu added a comment - Attach a patch for trunk branch. It's largely the same as the the original PR patch which is for branch-2 branch. Hopefully this will trigger the Jenkins. Ping Steve Loughran for double check.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user steveloughran commented on a diff in the pull request:

          https://github.com/apache/hadoop/pull/131#discussion_r88898982

          — Diff: hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm —
          @@ -470,6 +470,105 @@ $H3 SSL Configurations for HSFTP sources

          The SSL configuration file must be in the class-path of the DistCp program.

          +$H3 DistCp and Object Stores
          +
          +DistCp works with Object Stores such as Amazon S3, Azure WASB and OpenStack Swift.
          +
          +Prequisites
          +
          +1. The JAR containing the object store implementation is on the classpath,
          +along with all of its dependencies.
          +1. Unless the JAR automatically registers its bundled filesystem clients,
          +the configuration may need to be modified to state the class which
          +implements the filesystem schema. All of the ASF's own object store clients
          +are self-registering.
          +1. The relevant object store access credentials must be available in the cluster
          +configuration, or be otherwise available in all cluster hosts.
          +
          +DistCp can be used to upload data
          +
          +```bash
          +hadoop distcp hdfs://nn1:8020/datasets/set1 s3a://bucket/datasets/set1
          +```
          +
          +To download data
          +
          +```bash
          +hadoop distcp s3a://bucket/generated/results hdfs://nn1:8020/results
          +```
          +
          +To copy data between object stores
          +
          +```bash
          +hadoop distcp s3a://bucket/generated/results \
          + wasb://updates@example.blob.core.windows.net
          +```
          +
          +And do copy data within an object store
          +
          +```bash
          +hadoop distcp wasb://updates@example.blob.core.windows.net/current \
          + wasb://updates@example.blob.core.windows.net/old
          +```
          +
          +And to use `-update` to only copy changed files.
          +
          +```bash
          +hadoop distcp -update -numListstatusThreads 20 \
          + swift://history.cluster1/2016 \
          + hdfs://nn1:8020/history/2016
          +```
          +
          +Because object stores are slow to list files, consider setting the `-numListstatusThreads` option when performing a `-update` operation
          +on a large directory tree (the limit is 40 threads).
          +
          +When `DistCp -update` is used with objec stores,
          +generally only the modification time and length of the individual files are compared,
          +not any checksums. The fact that most object stores do have valid timestamps
          +for directories is irrelevant; only the file timestamps are compared.
          +However, it is important to have the clock of the client computers close
          +to that of the infrastructure, so that timestamps are consistent between
          +the client/HDFS cluster and that of the object store. Otherwise, changed files may be
          +missed/copied too often.
          +
          +*Notes*
          +
          +* The `-atomic` option causes a rename of the temporary data, so significantly
          +increases the time to commit work at the end of the operation. Furthermore,
          +as Object Stores other than (optionally) `wasb://` do not offer atomic renames of directories
          +the `-atomic` operation doesn't actually deliver what is promised. Avoid.
          +
          +* The `-append` option is not supported.
          +
          +* The `-diff` option is not supported
          — End diff –

          ok

          Show
          githubbot ASF GitHub Bot added a comment - Github user steveloughran commented on a diff in the pull request: https://github.com/apache/hadoop/pull/131#discussion_r88898982 — Diff: hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm — @@ -470,6 +470,105 @@ $H3 SSL Configurations for HSFTP sources The SSL configuration file must be in the class-path of the DistCp program. +$H3 DistCp and Object Stores + +DistCp works with Object Stores such as Amazon S3, Azure WASB and OpenStack Swift. + +Prequisites + +1. The JAR containing the object store implementation is on the classpath, +along with all of its dependencies. +1. Unless the JAR automatically registers its bundled filesystem clients, +the configuration may need to be modified to state the class which +implements the filesystem schema. All of the ASF's own object store clients +are self-registering. +1. The relevant object store access credentials must be available in the cluster +configuration, or be otherwise available in all cluster hosts. + +DistCp can be used to upload data + +```bash +hadoop distcp hdfs://nn1:8020/datasets/set1 s3a://bucket/datasets/set1 +``` + +To download data + +```bash +hadoop distcp s3a://bucket/generated/results hdfs://nn1:8020/results +``` + +To copy data between object stores + +```bash +hadoop distcp s3a://bucket/generated/results \ + wasb://updates@example.blob.core.windows.net +``` + +And do copy data within an object store + +```bash +hadoop distcp wasb://updates@example.blob.core.windows.net/current \ + wasb://updates@example.blob.core.windows.net/old +``` + +And to use `-update` to only copy changed files. + +```bash +hadoop distcp -update -numListstatusThreads 20 \ + swift://history.cluster1/2016 \ + hdfs://nn1:8020/history/2016 +``` + +Because object stores are slow to list files, consider setting the `-numListstatusThreads` option when performing a `-update` operation +on a large directory tree (the limit is 40 threads). + +When `DistCp -update` is used with objec stores, +generally only the modification time and length of the individual files are compared, +not any checksums. The fact that most object stores do have valid timestamps +for directories is irrelevant; only the file timestamps are compared. +However, it is important to have the clock of the client computers close +to that of the infrastructure, so that timestamps are consistent between +the client/HDFS cluster and that of the object store. Otherwise, changed files may be +missed/copied too often. + +* Notes * + +* The `-atomic` option causes a rename of the temporary data, so significantly +increases the time to commit work at the end of the operation. Furthermore, +as Object Stores other than (optionally) `wasb://` do not offer atomic renames of directories +the `-atomic` operation doesn't actually deliver what is promised. Avoid . + +* The `-append` option is not supported. + +* The `-diff` option is not supported — End diff – ok
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user liuml07 commented on a diff in the pull request:

          https://github.com/apache/hadoop/pull/131#discussion_r88332309

          — Diff: hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm —
          @@ -470,6 +470,105 @@ $H3 SSL Configurations for HSFTP sources

          The SSL configuration file must be in the class-path of the DistCp program.

          +$H3 DistCp and Object Stores
          +
          +DistCp works with Object Stores such as Amazon S3, Azure WASB and OpenStack Swift.
          +
          +Prequisites
          +
          +1. The JAR containing the object store implementation is on the classpath,
          +along with all of its dependencies.
          +1. Unless the JAR automatically registers its bundled filesystem clients,
          +the configuration may need to be modified to state the class which
          +implements the filesystem schema. All of the ASF's own object store clients
          +are self-registering.
          +1. The relevant object store access credentials must be available in the cluster
          +configuration, or be otherwise available in all cluster hosts.
          +
          +DistCp can be used to upload data
          +
          +```bash
          +hadoop distcp hdfs://nn1:8020/datasets/set1 s3a://bucket/datasets/set1
          +```
          +
          +To download data
          +
          +```bash
          +hadoop distcp s3a://bucket/generated/results hdfs://nn1:8020/results
          +```
          +
          +To copy data between object stores
          +
          +```bash
          +hadoop distcp s3a://bucket/generated/results \
          + wasb://updates@example.blob.core.windows.net
          +```
          +
          +And do copy data within an object store
          +
          +```bash
          +hadoop distcp wasb://updates@example.blob.core.windows.net/current \
          + wasb://updates@example.blob.core.windows.net/old
          +```
          +
          +And to use `-update` to only copy changed files.
          +
          +```bash
          +hadoop distcp -update -numListstatusThreads 20 \
          + swift://history.cluster1/2016 \
          + hdfs://nn1:8020/history/2016
          +```
          +
          +Because object stores are slow to list files, consider setting the `-numListstatusThreads` option when performing a `-update` operation
          +on a large directory tree (the limit is 40 threads).
          +
          +When `DistCp -update` is used with objec stores,
          — End diff –

          objec -> object

          Show
          githubbot ASF GitHub Bot added a comment - Github user liuml07 commented on a diff in the pull request: https://github.com/apache/hadoop/pull/131#discussion_r88332309 — Diff: hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm — @@ -470,6 +470,105 @@ $H3 SSL Configurations for HSFTP sources The SSL configuration file must be in the class-path of the DistCp program. +$H3 DistCp and Object Stores + +DistCp works with Object Stores such as Amazon S3, Azure WASB and OpenStack Swift. + +Prequisites + +1. The JAR containing the object store implementation is on the classpath, +along with all of its dependencies. +1. Unless the JAR automatically registers its bundled filesystem clients, +the configuration may need to be modified to state the class which +implements the filesystem schema. All of the ASF's own object store clients +are self-registering. +1. The relevant object store access credentials must be available in the cluster +configuration, or be otherwise available in all cluster hosts. + +DistCp can be used to upload data + +```bash +hadoop distcp hdfs://nn1:8020/datasets/set1 s3a://bucket/datasets/set1 +``` + +To download data + +```bash +hadoop distcp s3a://bucket/generated/results hdfs://nn1:8020/results +``` + +To copy data between object stores + +```bash +hadoop distcp s3a://bucket/generated/results \ + wasb://updates@example.blob.core.windows.net +``` + +And do copy data within an object store + +```bash +hadoop distcp wasb://updates@example.blob.core.windows.net/current \ + wasb://updates@example.blob.core.windows.net/old +``` + +And to use `-update` to only copy changed files. + +```bash +hadoop distcp -update -numListstatusThreads 20 \ + swift://history.cluster1/2016 \ + hdfs://nn1:8020/history/2016 +``` + +Because object stores are slow to list files, consider setting the `-numListstatusThreads` option when performing a `-update` operation +on a large directory tree (the limit is 40 threads). + +When `DistCp -update` is used with objec stores, — End diff – objec -> object
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user liuml07 commented on a diff in the pull request:

          https://github.com/apache/hadoop/pull/131#discussion_r88338833

          — Diff: hadoop-common-project/hadoop-common/src/site/markdown/FileSystemShell.md —
          @@ -729,3 +757,280 @@ usage
          Usage: `hadoop fs -usage command`

          Return the help for an individual command.
          +
          +
          +<a name="ObjectStores" />Working with Object Storage
          — End diff –

          `<a name="ObjectStores" />` is accidently here I guess?

          Show
          githubbot ASF GitHub Bot added a comment - Github user liuml07 commented on a diff in the pull request: https://github.com/apache/hadoop/pull/131#discussion_r88338833 — Diff: hadoop-common-project/hadoop-common/src/site/markdown/FileSystemShell.md — @@ -729,3 +757,280 @@ usage Usage: `hadoop fs -usage command` Return the help for an individual command. + + +<a name="ObjectStores" />Working with Object Storage — End diff – `<a name="ObjectStores" />` is accidently here I guess?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user liuml07 commented on a diff in the pull request:

          https://github.com/apache/hadoop/pull/131#discussion_r88342205

          — Diff: hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm —
          @@ -470,6 +470,105 @@ $H3 SSL Configurations for HSFTP sources

          The SSL configuration file must be in the class-path of the DistCp program.

          +$H3 DistCp and Object Stores
          +
          +DistCp works with Object Stores such as Amazon S3, Azure WASB and OpenStack Swift.
          +
          +Prequisites
          +
          +1. The JAR containing the object store implementation is on the classpath,
          +along with all of its dependencies.
          +1. Unless the JAR automatically registers its bundled filesystem clients,
          +the configuration may need to be modified to state the class which
          +implements the filesystem schema. All of the ASF's own object store clients
          +are self-registering.
          +1. The relevant object store access credentials must be available in the cluster
          +configuration, or be otherwise available in all cluster hosts.
          +
          +DistCp can be used to upload data
          +
          +```bash
          +hadoop distcp hdfs://nn1:8020/datasets/set1 s3a://bucket/datasets/set1
          +```
          +
          +To download data
          +
          +```bash
          +hadoop distcp s3a://bucket/generated/results hdfs://nn1:8020/results
          +```
          +
          +To copy data between object stores
          +
          +```bash
          +hadoop distcp s3a://bucket/generated/results \
          + wasb://updates@example.blob.core.windows.net
          +```
          +
          +And do copy data within an object store
          +
          +```bash
          +hadoop distcp wasb://updates@example.blob.core.windows.net/current \
          + wasb://updates@example.blob.core.windows.net/old
          +```
          +
          +And to use `-update` to only copy changed files.
          +
          +```bash
          +hadoop distcp -update -numListstatusThreads 20 \
          + swift://history.cluster1/2016 \
          + hdfs://nn1:8020/history/2016
          +```
          +
          +Because object stores are slow to list files, consider setting the `-numListstatusThreads` option when performing a `-update` operation
          +on a large directory tree (the limit is 40 threads).
          +
          +When `DistCp -update` is used with objec stores,
          +generally only the modification time and length of the individual files are compared,
          +not any checksums. The fact that most object stores do have valid timestamps
          +for directories is irrelevant; only the file timestamps are compared.
          +However, it is important to have the clock of the client computers close
          +to that of the infrastructure, so that timestamps are consistent between
          +the client/HDFS cluster and that of the object store. Otherwise, changed files may be
          +missed/copied too often.
          +
          +*Notes*
          +
          +* The `-atomic` option causes a rename of the temporary data, so significantly
          +increases the time to commit work at the end of the operation. Furthermore,
          +as Object Stores other than (optionally) `wasb://` do not offer atomic renames of directories
          +the `-atomic` operation doesn't actually deliver what is promised. Avoid.
          +
          +* The `-append` option is not supported.
          +
          +* The `-diff` option is not supported
          — End diff –

          The `-diff/-rdiff` option is not supported

          Yes there is an `rdiff` options that is just added.

          Show
          githubbot ASF GitHub Bot added a comment - Github user liuml07 commented on a diff in the pull request: https://github.com/apache/hadoop/pull/131#discussion_r88342205 — Diff: hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm — @@ -470,6 +470,105 @@ $H3 SSL Configurations for HSFTP sources The SSL configuration file must be in the class-path of the DistCp program. +$H3 DistCp and Object Stores + +DistCp works with Object Stores such as Amazon S3, Azure WASB and OpenStack Swift. + +Prequisites + +1. The JAR containing the object store implementation is on the classpath, +along with all of its dependencies. +1. Unless the JAR automatically registers its bundled filesystem clients, +the configuration may need to be modified to state the class which +implements the filesystem schema. All of the ASF's own object store clients +are self-registering. +1. The relevant object store access credentials must be available in the cluster +configuration, or be otherwise available in all cluster hosts. + +DistCp can be used to upload data + +```bash +hadoop distcp hdfs://nn1:8020/datasets/set1 s3a://bucket/datasets/set1 +``` + +To download data + +```bash +hadoop distcp s3a://bucket/generated/results hdfs://nn1:8020/results +``` + +To copy data between object stores + +```bash +hadoop distcp s3a://bucket/generated/results \ + wasb://updates@example.blob.core.windows.net +``` + +And do copy data within an object store + +```bash +hadoop distcp wasb://updates@example.blob.core.windows.net/current \ + wasb://updates@example.blob.core.windows.net/old +``` + +And to use `-update` to only copy changed files. + +```bash +hadoop distcp -update -numListstatusThreads 20 \ + swift://history.cluster1/2016 \ + hdfs://nn1:8020/history/2016 +``` + +Because object stores are slow to list files, consider setting the `-numListstatusThreads` option when performing a `-update` operation +on a large directory tree (the limit is 40 threads). + +When `DistCp -update` is used with objec stores, +generally only the modification time and length of the individual files are compared, +not any checksums. The fact that most object stores do have valid timestamps +for directories is irrelevant; only the file timestamps are compared. +However, it is important to have the clock of the client computers close +to that of the infrastructure, so that timestamps are consistent between +the client/HDFS cluster and that of the object store. Otherwise, changed files may be +missed/copied too often. + +* Notes * + +* The `-atomic` option causes a rename of the temporary data, so significantly +increases the time to commit work at the end of the operation. Furthermore, +as Object Stores other than (optionally) `wasb://` do not offer atomic renames of directories +the `-atomic` operation doesn't actually deliver what is promised. Avoid . + +* The `-append` option is not supported. + +* The `-diff` option is not supported — End diff – The `-diff/-rdiff` option is not supported Yes there is an `rdiff` options that is just added.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user liuml07 commented on a diff in the pull request:

          https://github.com/apache/hadoop/pull/131#discussion_r88343643

          — Diff: hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm —
          @@ -470,6 +470,105 @@ $H3 SSL Configurations for HSFTP sources

          The SSL configuration file must be in the class-path of the DistCp program.

          +$H3 DistCp and Object Stores
          +
          +DistCp works with Object Stores such as Amazon S3, Azure WASB and OpenStack Swift.
          +
          +Prequisites
          +
          +1. The JAR containing the object store implementation is on the classpath,
          +along with all of its dependencies.
          +1. Unless the JAR automatically registers its bundled filesystem clients,
          +the configuration may need to be modified to state the class which
          +implements the filesystem schema. All of the ASF's own object store clients
          +are self-registering.
          +1. The relevant object store access credentials must be available in the cluster
          +configuration, or be otherwise available in all cluster hosts.
          +
          +DistCp can be used to upload data
          +
          +```bash
          +hadoop distcp hdfs://nn1:8020/datasets/set1 s3a://bucket/datasets/set1
          +```
          +
          +To download data
          +
          +```bash
          +hadoop distcp s3a://bucket/generated/results hdfs://nn1:8020/results
          +```
          +
          +To copy data between object stores
          +
          +```bash
          +hadoop distcp s3a://bucket/generated/results \
          + wasb://updates@example.blob.core.windows.net
          +```
          +
          +And do copy data within an object store
          +
          +```bash
          +hadoop distcp wasb://updates@example.blob.core.windows.net/current \
          + wasb://updates@example.blob.core.windows.net/old
          +```
          +
          +And to use `-update` to only copy changed files.
          +
          +```bash
          +hadoop distcp -update -numListstatusThreads 20 \
          + swift://history.cluster1/2016 \
          + hdfs://nn1:8020/history/2016
          +```
          +
          +Because object stores are slow to list files, consider setting the `-numListstatusThreads` option when performing a `-update` operation
          +on a large directory tree (the limit is 40 threads).
          +
          +When `DistCp -update` is used with objec stores,
          +generally only the modification time and length of the individual files are compared,
          +not any checksums. The fact that most object stores do have valid timestamps
          +for directories is irrelevant; only the file timestamps are compared.
          +However, it is important to have the clock of the client computers close
          +to that of the infrastructure, so that timestamps are consistent between
          +the client/HDFS cluster and that of the object store. Otherwise, changed files may be
          +missed/copied too often.
          +
          +*Notes*
          +
          +* The `-atomic` option causes a rename of the temporary data, so significantly
          +increases the time to commit work at the end of the operation. Furthermore,
          +as Object Stores other than (optionally) `wasb://` do not offer atomic renames of directories
          +the `-atomic` operation doesn't actually deliver what is promised. Avoid.
          +
          +* The `-append` option is not supported.
          +
          +* The `-diff` option is not supported
          +
          +* CRC checking will not be performed, irrespective of the value of the `-skipCrc`
          +flag.
          +
          +* All `-p` options, including those to preserve permissions, user and group information, attributes
          +checksums and replication are generally ignored. The `wasb://` connector will
          +preserve the information, but not enforce the permissions.
          +
          +* Some object store connectors offer an option for in-memory buffering of
          +output —for example the S3A connector. Using such option while copying
          +large files may trigger some form of out of memory event,
          +be it a heap overflow or a YARN container termination.
          +This is particularly common if the network bandwidth
          +between the cluster and the object store is limited (such as when working
          +with remote object stores). It is best to disable/avoid such options and
          +rely on disk buffering.
          +
          +* Copy operations within a single object store still take place in the Hadoop cluster
          +—even when the object store implements a more efficient COPY operation internally
          +
          + That is, an operation such as
          — End diff –

          The indention is unnecessary?

          Show
          githubbot ASF GitHub Bot added a comment - Github user liuml07 commented on a diff in the pull request: https://github.com/apache/hadoop/pull/131#discussion_r88343643 — Diff: hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm — @@ -470,6 +470,105 @@ $H3 SSL Configurations for HSFTP sources The SSL configuration file must be in the class-path of the DistCp program. +$H3 DistCp and Object Stores + +DistCp works with Object Stores such as Amazon S3, Azure WASB and OpenStack Swift. + +Prequisites + +1. The JAR containing the object store implementation is on the classpath, +along with all of its dependencies. +1. Unless the JAR automatically registers its bundled filesystem clients, +the configuration may need to be modified to state the class which +implements the filesystem schema. All of the ASF's own object store clients +are self-registering. +1. The relevant object store access credentials must be available in the cluster +configuration, or be otherwise available in all cluster hosts. + +DistCp can be used to upload data + +```bash +hadoop distcp hdfs://nn1:8020/datasets/set1 s3a://bucket/datasets/set1 +``` + +To download data + +```bash +hadoop distcp s3a://bucket/generated/results hdfs://nn1:8020/results +``` + +To copy data between object stores + +```bash +hadoop distcp s3a://bucket/generated/results \ + wasb://updates@example.blob.core.windows.net +``` + +And do copy data within an object store + +```bash +hadoop distcp wasb://updates@example.blob.core.windows.net/current \ + wasb://updates@example.blob.core.windows.net/old +``` + +And to use `-update` to only copy changed files. + +```bash +hadoop distcp -update -numListstatusThreads 20 \ + swift://history.cluster1/2016 \ + hdfs://nn1:8020/history/2016 +``` + +Because object stores are slow to list files, consider setting the `-numListstatusThreads` option when performing a `-update` operation +on a large directory tree (the limit is 40 threads). + +When `DistCp -update` is used with objec stores, +generally only the modification time and length of the individual files are compared, +not any checksums. The fact that most object stores do have valid timestamps +for directories is irrelevant; only the file timestamps are compared. +However, it is important to have the clock of the client computers close +to that of the infrastructure, so that timestamps are consistent between +the client/HDFS cluster and that of the object store. Otherwise, changed files may be +missed/copied too often. + +* Notes * + +* The `-atomic` option causes a rename of the temporary data, so significantly +increases the time to commit work at the end of the operation. Furthermore, +as Object Stores other than (optionally) `wasb://` do not offer atomic renames of directories +the `-atomic` operation doesn't actually deliver what is promised. Avoid . + +* The `-append` option is not supported. + +* The `-diff` option is not supported + +* CRC checking will not be performed, irrespective of the value of the `-skipCrc` +flag. + +* All `-p` options, including those to preserve permissions, user and group information, attributes +checksums and replication are generally ignored. The `wasb://` connector will +preserve the information, but not enforce the permissions. + +* Some object store connectors offer an option for in-memory buffering of +output —for example the S3A connector. Using such option while copying +large files may trigger some form of out of memory event, +be it a heap overflow or a YARN container termination. +This is particularly common if the network bandwidth +between the cluster and the object store is limited (such as when working +with remote object stores). It is best to disable/avoid such options and +rely on disk buffering. + +* Copy operations within a single object store still take place in the Hadoop cluster +—even when the object store implements a more efficient COPY operation internally + + That is, an operation such as — End diff – The indention is unnecessary?
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 18s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          0 mvndep 0m 54s Maven dependency ordering for branch
          +1 mvninstall 6m 38s branch-2 passed
          +1 mvnsite 1m 25s branch-2 passed
          0 mvndep 0m 14s Maven dependency ordering for patch
          +1 mvnsite 1m 22s the patch passed
          -1 whitespace 0m 0s The patch has 57 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
          +1 asflicense 0m 18s The patch does not generate ASF License warnings.
          11m 35s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:b59b8b7
          JIRA Issue HADOOP-13655
          GITHUB PR https://github.com/apache/hadoop/pull/131
          Optional Tests asflicense mvnsite
          uname Linux 997497522d2b 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision branch-2 / 086577c
          whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/10871/artifact/patchprocess/whitespace-eol.txt
          modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-distcp U: .
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10871/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 18s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. 0 mvndep 0m 54s Maven dependency ordering for branch +1 mvninstall 6m 38s branch-2 passed +1 mvnsite 1m 25s branch-2 passed 0 mvndep 0m 14s Maven dependency ordering for patch +1 mvnsite 1m 22s the patch passed -1 whitespace 0m 0s The patch has 57 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply +1 asflicense 0m 18s The patch does not generate ASF License warnings. 11m 35s Subsystem Report/Notes Docker Image:yetus/hadoop:b59b8b7 JIRA Issue HADOOP-13655 GITHUB PR https://github.com/apache/hadoop/pull/131 Optional Tests asflicense mvnsite uname Linux 997497522d2b 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision branch-2 / 086577c whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/10871/artifact/patchprocess/whitespace-eol.txt modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-distcp U: . Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10871/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user steveloughran commented on a diff in the pull request:

          https://github.com/apache/hadoop/pull/131#discussion_r80909422

          — Diff: hadoop-common-project/hadoop-common/src/site/markdown/FileSystemShell.md —
          @@ -315,7 +324,11 @@ Returns 0 on success and -1 on error.

          Options:

          -The -f option will overwrite the destination if it already exists.
          +* `-p` : Preserves access and modification times, ownership and the permissions.
          +(assuming the permissions can be propagated across filesystems)
          +* `-f` : Overwrites the destination if it already exists.
          +* `-ignorecrc` : Skip CRC checks on the file(s) downloaded.
          +* `crc`: write CRC checksums for the files downloaded.
          — End diff –

          fixed

          Show
          githubbot ASF GitHub Bot added a comment - Github user steveloughran commented on a diff in the pull request: https://github.com/apache/hadoop/pull/131#discussion_r80909422 — Diff: hadoop-common-project/hadoop-common/src/site/markdown/FileSystemShell.md — @@ -315,7 +324,11 @@ Returns 0 on success and -1 on error. Options: -The -f option will overwrite the destination if it already exists. +* `-p` : Preserves access and modification times, ownership and the permissions. +(assuming the permissions can be propagated across filesystems) +* `-f` : Overwrites the destination if it already exists. +* `-ignorecrc` : Skip CRC checks on the file(s) downloaded. +* `crc`: write CRC checksums for the files downloaded. — End diff – fixed
          Hide
          yuanbo Yuanbo Liu added a comment -

          Steve Loughran I've reviewed your pull request in GitHub. Great work! Since I don't have much knowledge about object store, I just find some trivial mistake there. I would be glad to test those commands if I had object store environment.
          Thank again for your work, well done!

          Show
          yuanbo Yuanbo Liu added a comment - Steve Loughran I've reviewed your pull request in GitHub. Great work! Since I don't have much knowledge about object store, I just find some trivial mistake there. I would be glad to test those commands if I had object store environment. Thank again for your work, well done!
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user yuanboliu commented on a diff in the pull request:

          https://github.com/apache/hadoop/pull/131#discussion_r80839392

          — Diff: hadoop-common-project/hadoop-common/src/site/markdown/FileSystemShell.md —
          @@ -729,3 +757,278 @@ usage
          Usage: `hadoop fs -usage command`

          Return the help for an individual command.
          +
          +
          +<a name="ObjectStores" />Working with Object Storage
          +====================================================
          +
          +The Hadoop FileSystem shell works with Object Stores such as Amazon S3,
          +Azure WASB and OpenStack Swift.
          +
          +
          +
          +```bash
          +# Create a directory
          +hadoop fs -mkdir s3a://bucket/datasets/
          +
          +# Upload a file from the cluster filesystem
          +hadoop fs -put /datasets/example.orc s3a://bucket/datasets/
          +
          +# touch a file
          +hadoop fs -touchz wasb://yourcontainer@youraccount.blob.core.windows.net/touched
          +```
          +
          +Unlike a normal filesystem, renaming files and directories in an object store
          +usually takes time proportional to the size of the objects being manipulated.
          +As many of the filesystem shell operations
          +use renaming as the final stage in operations, skipping that stage
          +can avoid long delays.
          +
          +In particular, the `put` and `copyFromLocal` commands should
          +both have the `-d` options set for a direct upload.
          +
          +
          +```bash
          +# Upload a file from the cluster filesystem
          +hadoop fs -put -d /datasets/example.orc s3a://bucket/datasets/
          +
          +# Upload a file from the local filesystem
          +hadoop fs -copyFromLocal -d -f ~/datasets/devices.orc s3a://bucket/datasets/
          — End diff –

          hadoop fs -copyFromLocal -d -f ~/datasets/devices.orc s3a://bucket/datasets/
          The symbol "~" is redundant, right?

          Show
          githubbot ASF GitHub Bot added a comment - Github user yuanboliu commented on a diff in the pull request: https://github.com/apache/hadoop/pull/131#discussion_r80839392 — Diff: hadoop-common-project/hadoop-common/src/site/markdown/FileSystemShell.md — @@ -729,3 +757,278 @@ usage Usage: `hadoop fs -usage command` Return the help for an individual command. + + +<a name="ObjectStores" />Working with Object Storage +==================================================== + +The Hadoop FileSystem shell works with Object Stores such as Amazon S3, +Azure WASB and OpenStack Swift. + + + +```bash +# Create a directory +hadoop fs -mkdir s3a://bucket/datasets/ + +# Upload a file from the cluster filesystem +hadoop fs -put /datasets/example.orc s3a://bucket/datasets/ + +# touch a file +hadoop fs -touchz wasb://yourcontainer@youraccount.blob.core.windows.net/touched +``` + +Unlike a normal filesystem, renaming files and directories in an object store +usually takes time proportional to the size of the objects being manipulated. +As many of the filesystem shell operations +use renaming as the final stage in operations, skipping that stage +can avoid long delays. + +In particular, the `put` and `copyFromLocal` commands should +both have the `-d` options set for a direct upload. + + +```bash +# Upload a file from the cluster filesystem +hadoop fs -put -d /datasets/example.orc s3a://bucket/datasets/ + +# Upload a file from the local filesystem +hadoop fs -copyFromLocal -d -f ~/datasets/devices.orc s3a://bucket/datasets/ — End diff – hadoop fs -copyFromLocal -d -f ~/datasets/devices.orc s3a://bucket/datasets/ The symbol "~" is redundant, right?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user yuanboliu commented on a diff in the pull request:

          https://github.com/apache/hadoop/pull/131#discussion_r80836707

          — Diff: hadoop-common-project/hadoop-common/src/site/markdown/FileSystemShell.md —
          @@ -315,7 +324,11 @@ Returns 0 on success and -1 on error.

          Options:

          -The -f option will overwrite the destination if it already exists.
          +* `-p` : Preserves access and modification times, ownership and the permissions.
          +(assuming the permissions can be propagated across filesystems)
          +* `-f` : Overwrites the destination if it already exists.
          +* `-ignorecrc` : Skip CRC checks on the file(s) downloaded.
          +* `crc`: write CRC checksums for the files downloaded.
          — End diff –

          `crc` should be `-crc`

          Show
          githubbot ASF GitHub Bot added a comment - Github user yuanboliu commented on a diff in the pull request: https://github.com/apache/hadoop/pull/131#discussion_r80836707 — Diff: hadoop-common-project/hadoop-common/src/site/markdown/FileSystemShell.md — @@ -315,7 +324,11 @@ Returns 0 on success and -1 on error. Options: -The -f option will overwrite the destination if it already exists. +* `-p` : Preserves access and modification times, ownership and the permissions. +(assuming the permissions can be propagated across filesystems) +* `-f` : Overwrites the destination if it already exists. +* `-ignorecrc` : Skip CRC checks on the file(s) downloaded. +* `crc`: write CRC checksums for the files downloaded. — End diff – `crc` should be `-crc`
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user yuanboliu commented on a diff in the pull request:

          https://github.com/apache/hadoop/pull/131#discussion_r80839511

          — Diff: hadoop-common-project/hadoop-common/src/site/markdown/FileSystemShell.md —
          @@ -729,3 +757,278 @@ usage
          Usage: `hadoop fs -usage command`

          Return the help for an individual command.
          +
          +
          +<a name="ObjectStores" />Working with Object Storage
          +====================================================
          +
          +The Hadoop FileSystem shell works with Object Stores such as Amazon S3,
          +Azure WASB and OpenStack Swift.
          +
          +
          +
          +```bash
          +# Create a directory
          +hadoop fs -mkdir s3a://bucket/datasets/
          +
          +# Upload a file from the cluster filesystem
          +hadoop fs -put /datasets/example.orc s3a://bucket/datasets/
          +
          +# touch a file
          +hadoop fs -touchz wasb://yourcontainer@youraccount.blob.core.windows.net/touched
          +```
          +
          +Unlike a normal filesystem, renaming files and directories in an object store
          +usually takes time proportional to the size of the objects being manipulated.
          +As many of the filesystem shell operations
          +use renaming as the final stage in operations, skipping that stage
          +can avoid long delays.
          +
          +In particular, the `put` and `copyFromLocal` commands should
          +both have the `-d` options set for a direct upload.
          +
          +
          +```bash
          +# Upload a file from the cluster filesystem
          +hadoop fs -put -d /datasets/example.orc s3a://bucket/datasets/
          +
          +# Upload a file from the local filesystem
          +hadoop fs -copyFromLocal -d -f ~/datasets/devices.orc s3a://bucket/datasets/
          +
          +# create a file from stdin
          +echo "hello" | hadoop fs -put -d -f - wasb://yourcontainer@youraccount.blob.core.windows.net/hello.txt
          — End diff –

          `hadoop fs -put -d -f - wasb:` should be `hadoop fs -put -d -f wasb:`

          Show
          githubbot ASF GitHub Bot added a comment - Github user yuanboliu commented on a diff in the pull request: https://github.com/apache/hadoop/pull/131#discussion_r80839511 — Diff: hadoop-common-project/hadoop-common/src/site/markdown/FileSystemShell.md — @@ -729,3 +757,278 @@ usage Usage: `hadoop fs -usage command` Return the help for an individual command. + + +<a name="ObjectStores" />Working with Object Storage +==================================================== + +The Hadoop FileSystem shell works with Object Stores such as Amazon S3, +Azure WASB and OpenStack Swift. + + + +```bash +# Create a directory +hadoop fs -mkdir s3a://bucket/datasets/ + +# Upload a file from the cluster filesystem +hadoop fs -put /datasets/example.orc s3a://bucket/datasets/ + +# touch a file +hadoop fs -touchz wasb://yourcontainer@youraccount.blob.core.windows.net/touched +``` + +Unlike a normal filesystem, renaming files and directories in an object store +usually takes time proportional to the size of the objects being manipulated. +As many of the filesystem shell operations +use renaming as the final stage in operations, skipping that stage +can avoid long delays. + +In particular, the `put` and `copyFromLocal` commands should +both have the `-d` options set for a direct upload. + + +```bash +# Upload a file from the cluster filesystem +hadoop fs -put -d /datasets/example.orc s3a://bucket/datasets/ + +# Upload a file from the local filesystem +hadoop fs -copyFromLocal -d -f ~/datasets/devices.orc s3a://bucket/datasets/ + +# create a file from stdin +echo "hello" | hadoop fs -put -d -f - wasb://yourcontainer@youraccount.blob.core.windows.net/hello.txt — End diff – `hadoop fs -put -d -f - wasb:` should be `hadoop fs -put -d -f wasb:`
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user steveloughran opened a pull request:

          https://github.com/apache/hadoop/pull/131

          HADOOP-13655

          Patch of filesystem shell & distcp docs to cover object stores. Also updated some references in filesystem/index.md which were out of date

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/steveloughran/hadoop s3/HADOOP-13655-shell-docs

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/hadoop/pull/131.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #131


          commit 0a76336a0515e136474cc62b7e1b97aa175f7d10
          Author: Steve Loughran <stevel@apache.org>
          Date: 2016-09-26T12:44:59Z

          HADOOP-13655 patch 001 of docs


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user steveloughran opened a pull request: https://github.com/apache/hadoop/pull/131 HADOOP-13655 Patch of filesystem shell & distcp docs to cover object stores. Also updated some references in filesystem/index.md which were out of date You can merge this pull request into a Git repository by running: $ git pull https://github.com/steveloughran/hadoop s3/ HADOOP-13655 -shell-docs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/hadoop/pull/131.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #131 commit 0a76336a0515e136474cc62b7e1b97aa175f7d10 Author: Steve Loughran <stevel@apache.org> Date: 2016-09-26T12:44:59Z HADOOP-13655 patch 001 of docs

            People

            • Assignee:
              stevel@apache.org Steve Loughran
              Reporter:
              stevel@apache.org Steve Loughran
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development