Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-11069

ColumnStatsTask doesn't work with hive.exec.parallel

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 1.2.0
    • None
    • None
    • None

    Description

      Try a simple query:

      hive> set hive.exec.parallel=true;
      hive> analyze table src compute statistics for columns;
      

      It fails with errors similar to:

      FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.ColumnStatsTask
      hive> java.lang.RuntimeException: Error caching map.xml: java.io.IOException: java.lang.InterruptedException
      	at org.apache.hadoop.hive.ql.exec.Utilities.setBaseWork(Utilities.java:747)
      	at org.apache.hadoop.hive.ql.exec.Utilities.setMapWork(Utilities.java:682)
      	at org.apache.hadoop.hive.ql.exec.Utilities.setMapRedWork(Utilities.java:674)
      	at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:375)
      	at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:137)
      	at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
      	at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:88)
      	at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:75)
      Caused by: java.io.IOException: java.lang.InterruptedException
      	at org.apache.hadoop.ipc.Client.call(Client.java:1450)
      	at org.apache.hadoop.ipc.Client.call(Client.java:1402)
      	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
      	at com.sun.proxy.$Proxy14.mkdirs(Unknown Source)
      	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:539)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
      	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
      	at com.sun.proxy.$Proxy15.mkdirs(Unknown Source)
      	at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2758)
      	at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2729)
      	at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:870)
      	at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:866)
      	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
      	at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:866)
      	at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:859)
      	at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1954)
      	at org.apache.hadoop.hive.ql.exec.Utilities.setPlanPath(Utilities.java:765)
      	at org.apache.hadoop.hive.ql.exec.Utilities.setBaseWork(Utilities.java:691)
      	... 7 more
      Caused by: java.lang.InterruptedException
      	at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:400)
      	at java.util.concurrent.FutureTask.get(FutureTask.java:187)
      	at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1049)
      	at org.apache.hadoop.ipc.Client.call(Client.java:1444)
      	... 28 more
      Job Submission failed with exception 'java.lang.RuntimeException(Error caching map.xml: java.io.IOException: java.lang.InterruptedException)'
      

      The problem is the Column Stats Task doesn't depend on the root task which causes errors. Here's the explain output:

      hive> explain analyze table src compute statistics for columns;
      OK
      STAGE DEPENDENCIES:
        Stage-0 is a root stage
        Stage-1 is a root stage
      
      STAGE PLANS:
        Stage: Stage-0
          Map Reduce
            Map Operator Tree:
                TableScan
                  alias: src
                  Select Operator
                    expressions: key (type: string), value (type: string)
                    outputColumnNames: key, value
                    Group By Operator
                      aggregations: compute_stats(key, 16), compute_stats(value, 16)
                      mode: hash
                      outputColumnNames: _col0, _col1
                      Reduce Output Operator
                        sort order:
                        value expressions: _col0 (type: struct<columntype:string,maxlength:bigint,sumlength:bigint,count:bigint,countnulls:bigint,bitvector:string,numbitvectors:int>), _col1 (type: struct<columntype:string,maxlength:bigint,sumlength:bigint,count:bigint,countnulls:bigint,bitvector:string,numbitvectors:int>)
            Reduce Operator Tree:
              Group By Operator
                aggregations: compute_stats(VALUE._col0), compute_stats(VALUE._col1)
                mode: mergepartial
                outputColumnNames: _col0, _col1
                File Output Operator
                  compressed: false
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      
        Stage: Stage-1
          Column Stats Work
            Column Stats Desc:
                Columns: key, value
                Column Types: string, string
                Table: default.src
      
      Time taken: 0.761 seconds, Fetched: 39 row(s)
      

      For reference, here's the corresponding output in Hive 0.13:

      hive> explain analyze table orders compute statistics for columns;
      OK
      STAGE DEPENDENCIES:
        Stage-0 is a root stage
        Stage-1 depends on stages: Stage-0
      
      STAGE PLANS:
        Stage: Stage-0
          Map Reduce
            Map Operator Tree:
                TableScan
                  alias: orders
                  Statistics: Num rows: 0 Data size: 13310103552 Basic stats: PARTIAL Column stats: COMPLETE
      
        Stage: Stage-1
          Stats-Aggr Operator
      
      Time taken: 2.142 seconds, Fetched: 15 row(s)
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rjainqb Rajat Jain
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: