Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-7003

Indefinite retries of getJobSummary() if a job summary file is corrupt

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: jobhistoryserver
    • Labels:
      None

      Description

      Having a corrupt job summary file in the /user/history/done_intermediate directory in HDFS, e.g. /user/history/done_intermediate/oozie/job_1111111111111_111111.summary before moving it to /user/history/done, results in indefinite retries of org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.getJobSummary(). JHS will log recurring exceptions like:

      2017-11-03 01:01:01,124 WARN org.apache.hadoop.hdfs.BlockReaderFactory: I/O error constructing remote block reader.
      java.io.IOException: Got error for OP_READ_BLOCK, status=ERROR, self=/ABC.DEF.GHI:JKLMN, remote=/ABC.DEF.GHI:JKLMN, for file /user/history/done_intermediate/admin/job_1111111111111_1111.summary, for pool XX-999999999-ABC.DEF.GHI-1111111111111 block 1111111111_22222
      	at org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:467)
      	at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:432)
      	at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:881)
      	at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:759)
      	at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:376)
      	at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:652)
      	at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:879)
      	at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:932)
      	at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:732)
      	at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:337)
      	at java.io.DataInputStream.readUTF(DataInputStream.java:589)
      	at java.io.DataInputStream.readUTF(DataInputStream.java:564)
      	at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.getJobSummary(HistoryFileManager.java:1059)
      

      (INFO and ERROR logs are omitted)

      To reproduce it:

      • start JHS in debug mode (use JVM parameter -agentlib:jdwp=transport=dt_socket,server=y,address=45555,suspend=n when starting it)
      • attach debugger to the process and add a break point to stop in org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.getJobSummary()
      • start a map reduce job and wait until breakpoint is hit
      • delete or rename physical block on the datanode(s) for the job summary file (e.g. use hdfs fsck /user/history/done_intermediate/oozie/job_1111111111111_111111.summary -blocks -locations -files to get the block name; search for the block the on datanode(s) and remove/ rename it)
      • detach debugger
      • examine JHS log files

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              asasvari Attila Sasvári
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: