Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-15898

1 - 1.5 TB Data size fails to run with the following error

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.6.0
    • 2.6.0
    • performance
    • Hadoop 2.6.0-cdh5.5.1 Express edition.

       

       

    • Important

    Description

      There is a business impact MR job which runs every day @ 2.00 PM PST and data size is about 1 - 1.5 TB (depends on the business days) . Ideal elapsed time of this job : 4 hrs.  But the multiple  mappers of this job simultaneously  failing  with the following error so job will take some times 11 and even 13 hours also like that.  

      Steps to prevent this problem : 1, Migrated the environment to Yarn .2 increased the ulimit 3. Added extra nodes to the cluster. 4. Disks replacement taking place regularly 5. Monitoring the cluster and terminating other jobs which impacts this job. 

      Few of the values that we tried increasing without any benefit are

      1. increased open files

      2.  increase dfs.datanode.handler.count

      3. increase dfs.datanode.max.xcievers

      4. increase dfs.datanode.max.transfer.threads

      But no luck.

      org.apache.hadoop.hdfs.DFSClient: Error Recovery for block BP-854530680-69.194.253.58-1430267558563:blk_4683766046_1108754130089 in pipeline DatanodeInfoWithStorage
      [10.0.1.37:50010,DS-ed333d2e-839a-4029-a1c9-b6615c322ed2,DISK],
       DatanodeInfoWithStorage[74.120.143.19:50010,DS-5d10576e-adc3-474f-bc9d-f0d6fb3ae4c3,DISK],
      DatanodeInfoWithStorage[74.120.143.6:50010,DS-a5299d68-2858-46c3-8e37-d2559895f979,DISK]
      bad datanode DatanodeInfoWithStorage[10.0.1.37:50010,DS-ed333d2e-839a-4029-a1c9-b6615c322ed2,DISK]
       
      org.apache.hadoop.hdfs.DFSClient: Error Recovery for block BP-854530680-69.194.253.58-1430267558563:blk_4683766046_1108754130089 in pipeline DatanodeInfoWithStorage[74.120.143.19:50010,DS-5d10576e-adc3-474f-bc9d-f0d6fb3ae4c3,DISK], DatanodeInfoWithStorage[74.120.143.6:50010,DS-a5299d68-2858-46c3-8e37-d2559895f979,DISK]: bad datanode DatanodeInfoWithStorage[74.120.143.19:50010,DS-5d10576e-adc3-474f-bc9d-f0d6fb3ae4c3,DISK]

      org.apache.hadoop.mapred.YarnChild: Exception running child : java.io.IOException: java.io.IOException: All datanodes DatanodeInfoWithStorage[74.120.143.6:50010,DS-a5299d68-2858-46c3-8e37-d2559895f979,DISK] are bad. Aborting... at
       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            snarava@gmail.com Srinivas
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - 96h
                96h
                Remaining:
                Remaining Estimate - 96h
                96h
                Logged:
                Time Spent - Not Specified
                Not Specified