Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-15575

ALTER TABLE CONCATENATE and hive.merge.tezfiles seems busted for UNION ALL output

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      Hive UNION ALL produces data in sub-directories under the table/partition directories. E.g.

      hive (mythdb_hadooppf_17544)> create table source ( foo string, bar string, goo string ) stored as textfile;
      OK
      Time taken: 0.322 seconds
      hive (mythdb_hadooppf_17544)> create table results_partitioned( foo string, bar string, goo string ) partitioned by ( dt string ) stored as orcfile;
      OK
      Time taken: 0.322 seconds
      hive (mythdb_hadooppf_17544)> set hive.merge.tezfiles=false; insert overwrite table results_partitioned partition( dt ) select 'goo', 'bar', 'foo', '1' from source UNION ALL select 'go', 'far', 'moo', '1' from source;
      ...
      Loading data to table mythdb_hadooppf_17544.results_partitioned partition (dt=null)
               Time taken for load dynamic partitions : 311
              Loading partition {dt=1}
               Time taken for adding to write entity : 3
      OK
      Time taken: 27.659 seconds
      hive (mythdb_hadooppf_17544)> dfs -ls -R /tmp/mythdb_hadooppf_17544/results_partitioned;
      drwxrwxrwt   - dfsload hdfs          0 2017-01-10 23:13 /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1
      drwxrwxrwt   - dfsload hdfs          0 2017-01-10 23:13 /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/1
      -rwxrwxrwt   3 dfsload hdfs        349 2017-01-10 23:13 /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/1/000000_0
      drwxrwxrwt   - dfsload hdfs          0 2017-01-10 23:13 /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/2
      -rwxrwxrwt   3 dfsload hdfs        368 2017-01-10 23:13 /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/2/000000_0
      

      These results can only be read if mapred.input.dir.recursive=true, as TezCompiler::init() seems to do. But the Hadoop default for this is false. This leads to the following errors:
      1. Running CONCATENATE on the partition on the partition causes data-loss.

      hive --database mythdb_hadooppf_17544 -e " set mapred.input.dir.recursive; alter table results_partitioned partition ( dt='1' ) concatenate ; set mapred.input.dir.recursive; "
      ...
      OK
      Time taken: 2.151 seconds
      mapred.input.dir.recursive=false
      
      
      Status: Running (Executing on YARN cluster with App id application_1481756273279_5088754)
      
      --------------------------------------------------------------------------------
              VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
      --------------------------------------------------------------------------------
      File Merge         SUCCEEDED      0          0        0        0       0       0
      --------------------------------------------------------------------------------
      VERTICES: 01/01  [>>--------------------------] 0%    ELAPSED TIME: 0.35 s
      --------------------------------------------------------------------------------
      Loading data to table mythdb_hadooppf_17544.results_partitioned partition (dt=1)
      Moved: 'hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/1' to trash at: hdfs://cluster-nn1.mygrid.myth.net:8020/user/dfsload/.Trash/Current
      Moved: 'hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/2' to trash at: hdfs://cluster-nn1.mygrid.myth.net:8020/user/dfsload/.Trash/Current
      OK
      Time taken: 25.873 seconds
      
      $ hdfs dfs -count -h /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1
                 1            0                  0 /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1
      

      2. hive.merge.tezfiles is busted, because the merge-task attempts to merge files across results_partitioned/dt=1/1 and results_partitioned/dt=1/2:

      $ hive --database mythdb_hadooppf_17544 -e " set hive.merge.tezfiles=true; insert overwrite table results_partitioned partition( dt ) select 'goo', 'bar', 'foo', '1' from source UNION ALL select 'go', 'far', 'moo', '1' from source; "
      ...
      Query ID = dfsload_20170110233558_51289333-d9da-4851-8671-bfe653d26e45
      Total jobs = 3
      Launching Job 1 out of 3
      
      
      Status: Running (Executing on YARN cluster with App id application_1481756273279_5089989)
      
      --------------------------------------------------------------------------------
              VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
      --------------------------------------------------------------------------------
      Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
      Map 3 ..........   SUCCEEDED      1          1        0        0       0       0
      --------------------------------------------------------------------------------
      VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 13.07 s
      --------------------------------------------------------------------------------
      Stage-4 is filtered out by condition resolver.
      Stage-3 is selected by condition resolver.
      Stage-5 is filtered out by condition resolver.
      Launching Job 3 out of 3
      
      
      Status: Running (Executing on YARN cluster with App id application_1481756273279_5089989)
      
      --------------------------------------------------------------------------------
              VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
      --------------------------------------------------------------------------------
      File Merge           RUNNING      1          0        1        0       2       0
      --------------------------------------------------------------------------------
      VERTICES: 00/01  [>>--------------------------] 0%    ELAPSED TIME: 3.06 s
      --------------------------------------------------------------------------------
      ...
      

      The File Merge fails with the following:

      TaskAttempt 3 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Multiple partitions for one merge mapper: hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/2 NOT EQUAL TO hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/1
              at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:171)
              at org.apache.hadoop.hive.ql.exec.tez.MergeFileTezProcessor.run(MergeFileTezProcessor.java:42)
              at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:362)
              at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:192)
              at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:184)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:422)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1738)
              at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:184)
              at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:180)
              at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at java.lang.Thread.run(Thread.java:745)
      Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Multiple partitions for one merge mapper: hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/2 NOT EQUAL TO hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/1
              at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.processRow(MergeFileRecordProcessor.java:217)
              at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.run(MergeFileRecordProcessor.java:151)
              at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:148)
              ... 14 more
      Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Multiple partitions for one merge mapper: hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/2 NOT EQUAL TO hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/1
              at org.apache.hadoop.hive.ql.exec.OrcFileMergeOperator.processKeyValuePairs(OrcFileMergeOperator.java:159)
              at org.apache.hadoop.hive.ql.exec.OrcFileMergeOperator.process(OrcFileMergeOperator.java:62)
              at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.processRow(MergeFileRecordProcessor.java:208)
              ... 16 more
      Caused by: java.io.IOException: Multiple partitions for one merge mapper: hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/2 NOT EQUAL TO hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/1
              at org.apache.hadoop.hive.ql.exec.AbstractFileMergeOperator.checkPartitionsMatch(AbstractFileMergeOperator.java:174)
              at org.apache.hadoop.hive.ql.exec.AbstractFileMergeOperator.fixTmpPath(AbstractFileMergeOperator.java:191)
              at org.apache.hadoop.hive.ql.exec.OrcFileMergeOperator.processKeyValuePairs(OrcFileMergeOperator.java:86)
              ... 18 more
      ]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:0, Vertex vertex_1481756273279_5089989_2_00 [File Merge] killed/failed due to:OWN_TASK_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0
      

      3. Data produced with Hive UNION ALL will not be readable by Pig/HCatalog, without mapred.input.dir.recursive.

      Setting mapred.input.dir.recursive=true in hive-site.xml should resolve the first and third problem. But is this the recommendation? This is intrusive, and doesn't solve #2. The Pig UNION doesn't work this way, as per my limited understanding.

      Attachments

        Activity

          People

            Unassigned Unassigned
            mithun Mithun Radhakrishnan
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated: