Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-646

distcp should place the file distcp_src_files in distributed cache

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: distcp
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Patch increases the replication factor of _distcp_src_files to sqrt(min(maxMapsOnCluster, totalMapsInThisJob)) sothat many maps won't access the same replica of the file _distcp_src_files at the same time.

      Description

      When large number of files are being copied by distcp, accessing distcp_src_files seems to be an issue, as all map tasks would be accessing this file. The error message seen is:

      09/06/16 10:13:16 INFO mapred.JobClient: Task Id : attempt_200906040559_0110_m_003348_0, Status : FAILED
      java.io.IOException: Could not obtain block: blk_-4229860619941366534_1500174
      file=/mapredsystem/hadoop/mapredsystem/distcp_7fiyvq/_distcp_src_files
      at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1757)
      at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1585)
      at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1712)
      at java.io.DataInputStream.readFully(DataInputStream.java:178)
      at java.io.DataInputStream.readFully(DataInputStream.java:152)
      at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
      at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
      at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
      at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
      at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
      at org.apache.hadoop.tools.DistCp$CopyInputFormat.getRecordReader(DistCp.java:299)
      at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
      at org.apache.hadoop.mapred.Child.main(Child.java:170)

      This could be because of HADOOP-6038 and/or HADOOP-4681.

      If distcp places this special file distcp_src_files in distributed cache, that could solve the problem.

        Attachments

        1. d_replica_srcfilelist_v1.patch
          2 kB
          Ravi Gummadi
        2. d_replica_srcfilelist_v2.patch
          3 kB
          Ravi Gummadi
        3. d_replica_srcfilelist.patch
          2 kB
          Ravi Gummadi

          Activity

            People

            • Assignee:
              ravidotg Ravi Gummadi
              Reporter:
              ravidotg Ravi Gummadi
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: