Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-2552

CRC errors can cause job to run for very long time in large jobs

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.6.2, 0.8.0-alpha, 0.7.1, 0.5.5
    • None
    • None

    Description

      Ran a fairly large job at 10 TB scale which had 1009 reducers.

      One of the machine had bad disk and NM did not delist that disk. Machine hosting NM has disk issues (sdf & sde holds shuffle data). exceptions.

      Info fld=0x8960894
      sd 6:0:5:0: [sdf]  Add. Sense: Unrecovered read error
      sd 6:0:5:0: [sdf] CDB: Read(10): 28 00 08 96 08 90 00 00 08 00
      end_request: critical medium error, dev sdf, sector 144050320
      sd 6:0:5:0: [sdf]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
      sd 6:0:5:0: [sdf]  Sense Key : Medium Error [current]
      Info fld=0x895a2b9
      sd 6:0:5:0: [sdf]  Add. Sense: Unrecovered read error
      sd 6:0:5:0: [sdf] CDB: Read(10): 28 00 08 95 a2 b8 00 00 08 00
      end_request: critical medium error, dev sdf, sector 144024248
      sd 6:0:5:0: [sdf]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
      sd 6:0:5:0: [sdf]  Sense Key : Medium Error [current]
      Info fld=0x895a2b9
      sd 6:0:5:0: [sdf]  Add. Sense: Unrecovered read error
      sd 6:0:5:0: [sdf] CDB: Read(10): 28 00 08 95 a2 b8 00 00 08 00
      end_request: critical medium error, dev sdf, sector 144024248
      sd 6:0:5:0: [sdf]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
      sd 6:0:5:0: [sdf]  Sense Key : Medium Error [current]
      Info fld=0x8849edb
      sd 6:0:5:0: [sdf]  Add. Sense: Unrecovered read error
      sd 6:0:5:0: [sdf] CDB: Read(10): 28 00 08 84 9e d8 00 00 08 00
      end_request: critical medium error, dev sdf, sector 142909144
      sd 6:0:5:0: [sdf]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
      sd 6:0:5:0: [sdf]  Sense Key : Medium Error [current]
      Info fld=0x8849edb
      sd 6:0:5:0: [sdf]  Add. Sense: Unrecovered read error
      sd 6:0:5:0: [sdf] CDB: Read(10): 28 00 08 84 9e d8 00 00 08 00
      end_request: critical medium error, dev sdf, sector 142909144
      

      In-memory fetches start throwing CRC as follows.

      2015-06-11 01:01:03,728 INFO [ShuffleAndMergeRunner [Map_11]] orderedgrouped.ShuffleScheduler: PendingHosts=[]
      2015-06-11 01:01:03,730 INFO [Fetcher [Map_11] #0] http.HttpConnection: for url=http://cn056-10.l42scl.hortonworks.com:13562/mapOutput?job=job_1433813751839_0124&reduce=3&map=attempt_1433813751839_0124_1_04_000446_0_10027&keepAlive=true sent hash and receievd reply 0 ms
      2015-06-11 01:01:03,730 INFO [Fetcher [Map_11] #0] orderedgrouped.FetcherOrderedGrouped: fetcher#439 about to shuffle output of map InputAttemptIdentifier [inputIdentifier=InputIdentifier [inputIndex=446], attemptNumber=0, pathComponent=attempt_1433813751839_0124_1_04_000446_0_10027, fetchTypeInfo=FINAL_MERGE_ENABLED, spillEventId=-1] decomp: 45475 len: 23974 to MEMORY
      2015-06-11 01:01:07,206 INFO [Fetcher [Map_11] #0] impl.IFileInputStream:  CurrentOffset=2510, offset=2510, off=2502, dataLength=23966, origLen=21456, len=21456, length=23970, checksumSize=4
      2015-06-11 01:01:07,207 INFO [Fetcher [Map_11] #0] impl.IFileInputStream:  CurrentOffset=2510, offset=2510, off=0, dataLength=23966, origLen=21456, len=21456, length=23970, checksumSize=4
      2015-06-11 01:01:07,207 WARN [Fetcher [Map_11] #0] orderedgrouped.FetcherOrderedGrouped: Failed to shuffle output of InputAttemptIdentifier [inputIdentifier=InputIdentifier [inputIndex=446], attemptNumber=0, pathComponent=attempt_1433813751839_0124_1_04_000446_0_10027, fetchTypeInfo=FINAL_MERGE_ENABLED, spillEventId=-1] from cn056-10.l42scl.hortonworks.com:13562
      org.apache.hadoop.fs.ChecksumException: Checksum Error:  CurrentOffset=2510, offset=2510, off=2502, dataLength=23966, origLen=21456, len=21456, length=23970, checksumSize=4
      	at org.apache.tez.runtime.library.common.sort.impl.IFileInputStream.doRead(IFileInputStream.java:255)
      	at org.apache.tez.runtime.library.common.sort.impl.IFileInputStream.read(IFileInputStream.java:185)
      	at org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:127)
      	at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:98)
      	at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
      	at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
      	at org.apache.tez.runtime.library.common.sort.impl.IFile$Reader.readToMemory(IFile.java:619)
      	at org.apache.tez.runtime.library.common.shuffle.ShuffleUtils.shuffleToMemory(ShuffleUtils.java:113)
      	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyMapOutput(FetcherOrderedGrouped.java:471)
      	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:267)
      	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:164)
      	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:177)
      	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:52)
      	at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      

      TaskAttemptImpl didn't fail it due to the following code

      float failureFraction = ((float) attempt.uniquefailedOutputReports.size())
                / outputFailedEvent.getConsumerTaskNumber();
      

      In this case, reducer ran in 180 slot waves. So even if all 180 tasks report the error, it would be around 180/1009 = 0.17 (which is less than 0.25 MAX_ALLOWED_OUTPUT_FAILURES_FRACTION) and the job runs for ever (killed the job after 2 hours; normally run in couple of minutes)

      In fetcher side, reducer state would be healthy and it would continue to wait.

      Env: Tez master & Hive master
      Ref: Query_88 @ 10 TB scale.

      Attachments

        1. Screen Shot 2015-06-11 at 3.07.47 PM.png
          202 kB
          Rajesh Balamohan
        2. TEZ-2552.1.patch
          7 kB
          Rajesh Balamohan
        3. TEZ-2552.2.patch
          8 kB
          Rajesh Balamohan
        4. TEZ-2552.3.patch
          10 kB
          Rajesh Balamohan
        5. TEZ-2552.branch-0.5.patch
          11 kB
          Rajesh Balamohan
        6. TEZ-2552.branch-0.6.patch
          10 kB
          Rajesh Balamohan
        7. TEZ-2552.wip.patch
          2 kB
          Rajesh Balamohan
        8. with_patch.png
          89 kB
          Rajesh Balamohan
        9. without_patch.png
          75 kB
          Rajesh Balamohan

        Activity

          People

            rajesh.balamohan Rajesh Balamohan
            rajesh.balamohan Rajesh Balamohan
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: