Hive
  1. Hive
  2. HIVE-1032

Better Error Messages for Execution Errors

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.6.0
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Three common errors that occur during execution are:

      1. Map-side group-by causing an out of memory exception due to large aggregation hash tables

      2. ScriptOperator failing due to the user's script throwing an exception or otherwise returning a non-zero error code

      3. Incorrectly specifying the join order of small and large tables, causing the large table to be loaded into memory and producing an out of memory exception.

      These errors are typically discovered by manually examining the error log files of the failed task. This task proposes to create a feature that would automatically read the error logs and output a probable cause and solution to the command line.

      1. HIVE-1032.6.patch
        32 kB
        Paul Yang
      2. HIVE-1032.5.patch
        28 kB
        Paul Yang
      3. HIVE-1032.4.patch
        28 kB
        Paul Yang
      4. HIVE-1032.3.patch
        21 kB
        Paul Yang
      5. HIVE-1032.2.patch
        21 kB
        Paul Yang
      6. HIVE-1032.1.patch
        21 kB
        Paul Yang

        Issue Links

          Activity

          Hide
          Paul Yang added a comment -
          • Resolved conflict
          Show
          Paul Yang added a comment - Resolved conflict
          Hide
          Namit Jain added a comment -

          Had a discussion with Paul offline - it might be a good idea to separate the mapper logs and reducer logs and then run possibly different heuristics on those.
          For eg: if a query contains a join and a group by and no mapjoin, and a out of memory occurs on the reducer, it is likely due to join order

          Show
          Namit Jain added a comment - Had a discussion with Paul offline - it might be a good idea to separate the mapper logs and reducer logs and then run possibly different heuristics on those. For eg: if a query contains a join and a group by and no mapjoin, and a out of memory occurs on the reducer, it is likely due to join order
          Hide
          Paul Yang added a comment -
          • Removed the join order error message as it seems that HIVE-963 fixed it
          • Added flag to indicate whether the log came from a map or reduce task
          Show
          Paul Yang added a comment - Removed the join order error message as it seems that HIVE-963 fixed it Added flag to indicate whether the log came from a map or reduce task
          Hide
          Zheng Shao added a comment -

          I like the idea of having interfaces/base classes like ErrorHeuristics and Error.

          Some questions about these new interfaces:

          1. ErrorHeuristics:
          1.1 The comment says usually it returns a single error but the interface returns List<Error>. I think it should be good enough to return a single error, and the current code does assume that (when excluding reported ErrorHeuristics). This simplified the concepts. We can always have multiple ErrorHeuristics, one for each type of error.
          1.2 readLogLine(String) should be named processLogLine(String). read is usually for InputStream to return the line.
          1.3 Can we say "getError()" infers "reset"? Maybe name it "getErrorAndReset()". That simplified the interface.

          2. TaskLogProcessor:
          2.1 Can we rename addTaskLogUrl to addTaskAttemptLogUrl?
          2.2 Can we return only the error that is detected in the most number of task attempts? We can output more errors if they have the same counts.
          2.3 Let's add comment to each level of the loop (there are 2 "all" here and it refers to 2 loops (while not knowing whether they are nested (in which order) or parallel)
          + // Read read the lines from all the task logs and feed them to all the
          + // error heuristics

          3. Error:
          3.1 Let's rename it to ErrorAndSolution? That's more appropriate I think. Accordingly we can modify all function names.

          For the case that Namit mentions, I think we should just let the Operators output different kind of error messages, so ErrorHeuristics can capture that. I don't see how the added flag can help solve the problem (and it's actually never used in the code) so I would prefer doing it the old way.

          Show
          Zheng Shao added a comment - I like the idea of having interfaces/base classes like ErrorHeuristics and Error. Some questions about these new interfaces: 1. ErrorHeuristics: 1.1 The comment says usually it returns a single error but the interface returns List<Error>. I think it should be good enough to return a single error, and the current code does assume that (when excluding reported ErrorHeuristics). This simplified the concepts. We can always have multiple ErrorHeuristics, one for each type of error. 1.2 readLogLine(String) should be named processLogLine(String). read is usually for InputStream to return the line. 1.3 Can we say "getError()" infers "reset"? Maybe name it "getErrorAndReset()". That simplified the interface. 2. TaskLogProcessor: 2.1 Can we rename addTaskLogUrl to addTaskAttemptLogUrl? 2.2 Can we return only the error that is detected in the most number of task attempts? We can output more errors if they have the same counts. 2.3 Let's add comment to each level of the loop (there are 2 "all" here and it refers to 2 loops (while not knowing whether they are nested (in which order) or parallel) + // Read read the lines from all the task logs and feed them to all the + // error heuristics 3. Error: 3.1 Let's rename it to ErrorAndSolution? That's more appropriate I think. Accordingly we can modify all function names. For the case that Namit mentions, I think we should just let the Operators output different kind of error messages, so ErrorHeuristics can capture that. I don't see how the added flag can help solve the problem (and it's actually never used in the code) so I would prefer doing it the old way.
          Hide
          Namit Jain added a comment -

          I think the same error message can mean different things - it depends on the context.

          Going forward, users will use more complex queries containing say a script and a join.
          If there is a out of memory exception (both join and the script operator) can do so, but if the
          script runs in the mapper and the join in the reducer, we can use that.

          As you said that the operator can output different messages, the fact whether we are processing a
          mapper of reducer log is also part of the same context. Today, we are processing all error processors
          for all the queries, but we can determine that at compile time (based on the query).

          Show
          Namit Jain added a comment - I think the same error message can mean different things - it depends on the context. Going forward, users will use more complex queries containing say a script and a join. If there is a out of memory exception (both join and the script operator) can do so, but if the script runs in the mapper and the join in the reducer, we can use that. As you said that the operator can output different messages, the fact whether we are processing a mapper of reducer log is also part of the same context. Today, we are processing all error processors for all the queries, but we can determine that at compile time (based on the query).
          Hide
          Zheng Shao added a comment -

          Whether it's in the mapper or reducer is not useful by itself - we will have to hook ErrorHeuristics with Operator Tree to find it out.

          Instead of going that route, it will be much easier for the ErrorHeuristics capture all information from the log. We already have operator initialization messages in the log, so ErrorHeuristics should already know what are the possible operators that caused the problem.

          In this way, we can make the ErrorHeuristics loosely coupled with the compilation/query execution.

          Show
          Zheng Shao added a comment - Whether it's in the mapper or reducer is not useful by itself - we will have to hook ErrorHeuristics with Operator Tree to find it out. Instead of going that route, it will be much easier for the ErrorHeuristics capture all information from the log. We already have operator initialization messages in the log, so ErrorHeuristics should already know what are the possible operators that caused the problem. In this way, we can make the ErrorHeuristics loosely coupled with the compilation/query execution.
          Hide
          Zheng Shao added a comment -

          I went over the code again. Actually the flag "isMapTask" is used in "MapAggrMemErrorHeuristic".
          However "MapAggr" can actually happen in reducer. The reason is that we always put the operators to the reducer of the former map-reduce job, instead of the mapper of the latter map-reduce job. We should remove that check.

          Show
          Zheng Shao added a comment - I went over the code again. Actually the flag "isMapTask" is used in "MapAggrMemErrorHeuristic". However "MapAggr" can actually happen in reducer. The reason is that we always put the operators to the reducer of the former map-reduce job, instead of the mapper of the latter map-reduce job. We should remove that check.
          Hide
          Zheng Shao added a comment -

          Another error that we might want to include in the same patch.

          The solution for this error is: "Data file split hdfs://dfs:9000/user/hive/warehouse/mytable/ds=2009-10-04/part-00232, range: 0-0 is corrupted".

          2010-01-19 11:53:30,581 INFO org.apache.hadoop.mapred.MapTask: split: hdfs://dfs:9000/user/hive/warehouse/mytable/ds=2009-10-04/part-00232, range: 0-0
          2010-01-19 11:53:30,795 WARN org.apache.hadoop.mapred.Child: Error running child
          java.io.EOFException
          	at java.io.DataInputStream.readFully(DataInputStream.java:180)
          	at java.io.DataInputStream.readFully(DataInputStream.java:152)
          	at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
          	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
          	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
          	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
          	at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
          	at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:63)
          	at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:236)
          	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
          	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
          	at org.apache.hadoop.mapred.Child.main(Child.java:159)
          2010-01-19 11:53:30,801 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task
          
          Show
          Zheng Shao added a comment - Another error that we might want to include in the same patch. The solution for this error is: "Data file split hdfs://dfs:9000/user/hive/warehouse/mytable/ds=2009-10-04/part-00232, range: 0-0 is corrupted". 2010-01-19 11:53:30,581 INFO org.apache.hadoop.mapred.MapTask: split: hdfs: //dfs:9000/user/hive/warehouse/mytable/ds=2009-10-04/part-00232, range: 0-0 2010-01-19 11:53:30,795 WARN org.apache.hadoop.mapred.Child: Error running child java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at java.io.DataInputStream.readFully(DataInputStream.java:152) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412) at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43) at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:63) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:236) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) 2010-01-19 11:53:30,801 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task
          Hide
          Zheng Shao added a comment -

          Talked with Namit offline some time ago about this issue.
          Let's remove the "isMapTask" flag from the constructor.

          Show
          Zheng Shao added a comment - Talked with Namit offline some time ago about this issue. Let's remove the "isMapTask" flag from the constructor.
          Hide
          Paul Yang added a comment -
          • Incorporated Zheng's comments, removed isMapTask flag.

          For the data split error condition, I only checked for the EOFException. Is that good enough?

          Show
          Paul Yang added a comment - Incorporated Zheng's comments, removed isMapTask flag. For the data split error condition, I only checked for the EOFException. Is that good enough?
          Hide
          Zheng Shao added a comment -

          TaskInfo class still has the field "isMapTask" that is not used anyway. Shall we remove it?

          Show
          Zheng Shao added a comment - TaskInfo class still has the field "isMapTask" that is not used anyway. Shall we remove it?
          Hide
          Zheng Shao added a comment -

          Hi Paul, the last thing: can you try it out on both hadoop 0.17 and hadoop 0.20? We need to make sure this works with both.

          Show
          Zheng Shao added a comment - Hi Paul, the last thing: can you try it out on both hadoop 0.17 and hadoop 0.20? We need to make sure this works with both.
          Hide
          Paul Yang added a comment -

          Because this patch uses features of HIVE-873, this will not work with hadoop 0.17. If you want, I can send you the broken queries I used to test on 0.20.

          Show
          Paul Yang added a comment - Because this patch uses features of HIVE-873 , this will not work with hadoop 0.17. If you want, I can send you the broken queries I used to test on 0.20.
          Hide
          Zheng Shao added a comment -

          That makes sense to me. As long as it's compilable with 0.17 it should be OK.

          Sorry there is another last thing Can you run "ant checkstyle" and fix the checkstyle warnings introduced by this patch (especially in the new files).

          Show
          Zheng Shao added a comment - That makes sense to me. As long as it's compilable with 0.17 it should be OK. Sorry there is another last thing Can you run "ant checkstyle" and fix the checkstyle warnings introduced by this patch (especially in the new files).
          Hide
          Paul Yang added a comment -
          • Fixed checkstyle issues
          Show
          Paul Yang added a comment - Fixed checkstyle issues
          Hide
          Zheng Shao added a comment -

          Committed. Thanks Paul!

          Show
          Zheng Shao added a comment - Committed. Thanks Paul!

            People

            • Assignee:
              Paul Yang
              Reporter:
              Paul Yang
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development