Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-31

Hadoop distcp tool fails if file path contains special characters + & !

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Invalid
    • Affects Version/s: 0.20.2, 0.21.0, 0.22.0
    • Fix Version/s: None
    • Component/s: tools
    • Labels:
      None

      Description

      Copying folders containing + & ! characters between hdfs (using hftp) does not work in distcp

      For example:
      Copying folder "string1+string2" at "namenode.address.com", hftp port myport to "/myotherhome/folder" on "myothermachine" does not work

      myothermachine prompt>>> hadoop --config ~/mycluster/ distcp "hftp://namenode.address.com:myport/myhome/dir/string1+string2" /myotherhome/folder/
      --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Error results for hadoop job1:
      --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      08/07/16 00:27:39 INFO tools.DistCp: srcPaths=[hftp://namenode.address.com:myport/myhome/dir/string1+string2]
      08/07/16 00:27:39 INFO tools.DistCp: destPath=/myotherhome/folder/
      08/07/16 00:27:41 INFO tools.DistCp: srcCount=2
      08/07/16 00:27:42 INFO mapred.JobClient: Running job: job1
      08/07/16 00:27:43 INFO mapred.JobClient: map 0% reduce 0%
      08/07/16 00:27:58 INFO mapred.JobClient: Task Id : attempt_1_m_000000_0, Status : FAILED
      java.io.IOException: Copied: 0 Skipped: 0 Failed: 1
      at org.apache.hadoop.tools.DistCp$CopyFilesMapper.close(DistCp.java:538)
      at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:226)
      at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2208)

      08/07/16 00:28:14 INFO mapred.JobClient: Task Id : attempt_1_m_000000_1, Status : FAILED
      java.io.IOException: Copied: 0 Skipped: 0 Failed: 1
      at org.apache.hadoop.tools.DistCp$CopyFilesMapper.close(DistCp.java:538)
      at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:226)
      at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2208)

      08/07/16 00:28:28 INFO mapred.JobClient: Task Id : attempt_1_m_000000_2, Status : FAILED
      java.io.IOException: Copied: 0 Skipped: 0 Failed: 1
      at org.apache.hadoop.tools.DistCp$CopyFilesMapper.close(DistCp.java:538)
      at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:226)
      at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2208)

      With failures, global counters are inaccurate; consider running with -i
      Copy failed: java.io.IOException: Job failed!
      at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1053)
      at org.apache.hadoop.tools.DistCp.copy(DistCp.java:615)
      at org.apache.hadoop.tools.DistCp.run(DistCp.java:764)
      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
      at org.apache.hadoop.tools.DistCp.main(DistCp.java:784)
      --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Error log for the map task which failed
      --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      INFO org.apache.hadoop.tools.DistCp: FAIL string1+string2/myjobtrackermachine.com-joblog.tar.gz : java.io.IOException: Server returned HTTP response code: 500 for URL: http://mymachine.com:myport/streamFile?filename=/myhome/dir/string1+string2/myjobtrackermachine.com-joblog.tar.gz&ugi=myid,mygroup
      at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1241)
      at org.apache.hadoop.dfs.HftpFileSystem.open(HftpFileSystem.java:117)
      at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:371)
      at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:377)
      at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:504)
      at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:279)
      at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:226)
      at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2208)
      --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

        Issue Links

          Activity

          Hide
          Tsz Wo Nicholas Sze added a comment -

          Closing as invalid. Please feel free to reopen if the problem still exists.

          Show
          Tsz Wo Nicholas Sze added a comment - Closing as invalid. Please feel free to reopen if the problem still exists.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          This should be fixed by HDFS-1109. Could you check again?

          Show
          Tsz Wo Nicholas Sze added a comment - This should be fixed by HDFS-1109 . Could you check again?
          Hide
          Kris Jirapinyo added a comment -

          Yes, that would be nice.

          I was using hftp to copy from a 0.20.1 cluster to CDH3 cluster (starting distcp on CDH3 cluster), and I ran into the same 500 error. It seems that the url escaping mechanism is making the final fetch url incorrect.

          e.g.

          file in HDFS:
          /test/twitteruserout2/_logs/history/mi-prod-app01.ec2.biz360.com_1269013964063_job_201003190852_17784_hadoop_twitter+users+extraction+from+source+on+Tue+Apr+20

          fetch filename:
          /test/twitteruserout2/_logs/history/mi-prod-app01.ec2.biz360.com_1269013964063_job_201003190852_17784_hadoop_twitter users extraction from source on Tue Apr 20

          Error from specific machine:
          2010-08-16 14:33:06,765 WARN org.mortbay.log: /streamFile: java.io.IOException: Cannot open filename /test/twitteruserout2/_logs/history/mi-prod-app01.ec2.biz360.com_1269013964063_job_201003190852_17784_hadoop_twitter users extraction from source on Tue Apr 20

          Trying to run from http:

          http://mi-prod-app28:50075/streamFile?filename=/test/twitteruserout2/_logs/history/mi-prod-app01.ec2.biz360.com_1269013964063_job_201003190852_17784_hadoop_twitter+users+extraction+from+source+on+Tue+Apr+20&ugi=hadoop,hadoop

          Doesn't work and will give same error as above.
          However, if I replace the + with %2B then the get works.

          Show
          Kris Jirapinyo added a comment - Yes, that would be nice. I was using hftp to copy from a 0.20.1 cluster to CDH3 cluster (starting distcp on CDH3 cluster), and I ran into the same 500 error. It seems that the url escaping mechanism is making the final fetch url incorrect. e.g. file in HDFS: /test/twitteruserout2/_logs/history/mi-prod-app01.ec2.biz360.com_1269013964063_job_201003190852_17784_hadoop_twitter+users+extraction+from+source+on+Tue+Apr+20 fetch filename: /test/twitteruserout2/_logs/history/mi-prod-app01.ec2.biz360.com_1269013964063_job_201003190852_17784_hadoop_twitter users extraction from source on Tue Apr 20 Error from specific machine: 2010-08-16 14:33:06,765 WARN org.mortbay.log: /streamFile: java.io.IOException: Cannot open filename /test/twitteruserout2/_logs/history/mi-prod-app01.ec2.biz360.com_1269013964063_job_201003190852_17784_hadoop_twitter users extraction from source on Tue Apr 20 Trying to run from http: http://mi-prod-app28:50075/streamFile?filename=/test/twitteruserout2/_logs/history/mi-prod-app01.ec2.biz360.com_1269013964063_job_201003190852_17784_hadoop_twitter+users+extraction+from+source+on+Tue+Apr+20&ugi=hadoop,hadoop Doesn't work and will give same error as above. However, if I replace the + with %2B then the get works.
          Hide
          Allen Wittenauer added a comment -

          This really needs to get fixed.

          Show
          Allen Wittenauer added a comment - This really needs to get fixed.

            People

            • Assignee:
              Unassigned
              Reporter:
              Viraj Bhat
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development