Details
-
Bug
-
Status: Open
-
Critical
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
Hive UNION ALL produces data in sub-directories under the table/partition directories. E.g.
hive (mythdb_hadooppf_17544)> create table source ( foo string, bar string, goo string ) stored as textfile; OK Time taken: 0.322 seconds hive (mythdb_hadooppf_17544)> create table results_partitioned( foo string, bar string, goo string ) partitioned by ( dt string ) stored as orcfile; OK Time taken: 0.322 seconds hive (mythdb_hadooppf_17544)> set hive.merge.tezfiles=false; insert overwrite table results_partitioned partition( dt ) select 'goo', 'bar', 'foo', '1' from source UNION ALL select 'go', 'far', 'moo', '1' from source; ... Loading data to table mythdb_hadooppf_17544.results_partitioned partition (dt=null) Time taken for load dynamic partitions : 311 Loading partition {dt=1} Time taken for adding to write entity : 3 OK Time taken: 27.659 seconds hive (mythdb_hadooppf_17544)> dfs -ls -R /tmp/mythdb_hadooppf_17544/results_partitioned; drwxrwxrwt - dfsload hdfs 0 2017-01-10 23:13 /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1 drwxrwxrwt - dfsload hdfs 0 2017-01-10 23:13 /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/1 -rwxrwxrwt 3 dfsload hdfs 349 2017-01-10 23:13 /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/1/000000_0 drwxrwxrwt - dfsload hdfs 0 2017-01-10 23:13 /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/2 -rwxrwxrwt 3 dfsload hdfs 368 2017-01-10 23:13 /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/2/000000_0
These results can only be read if mapred.input.dir.recursive=true, as TezCompiler::init() seems to do. But the Hadoop default for this is false. This leads to the following errors:
1. Running CONCATENATE on the partition on the partition causes data-loss.
hive --database mythdb_hadooppf_17544 -e " set mapred.input.dir.recursive; alter table results_partitioned partition ( dt='1' ) concatenate ; set mapred.input.dir.recursive; " ... OK Time taken: 2.151 seconds mapred.input.dir.recursive=false Status: Running (Executing on YARN cluster with App id application_1481756273279_5088754) -------------------------------------------------------------------------------- VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -------------------------------------------------------------------------------- File Merge SUCCEEDED 0 0 0 0 0 0 -------------------------------------------------------------------------------- VERTICES: 01/01 [>>--------------------------] 0% ELAPSED TIME: 0.35 s -------------------------------------------------------------------------------- Loading data to table mythdb_hadooppf_17544.results_partitioned partition (dt=1) Moved: 'hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/1' to trash at: hdfs://cluster-nn1.mygrid.myth.net:8020/user/dfsload/.Trash/Current Moved: 'hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/2' to trash at: hdfs://cluster-nn1.mygrid.myth.net:8020/user/dfsload/.Trash/Current OK Time taken: 25.873 seconds $ hdfs dfs -count -h /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1 1 0 0 /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1
2. hive.merge.tezfiles is busted, because the merge-task attempts to merge files across results_partitioned/dt=1/1 and results_partitioned/dt=1/2:
$ hive --database mythdb_hadooppf_17544 -e " set hive.merge.tezfiles=true; insert overwrite table results_partitioned partition( dt ) select 'goo', 'bar', 'foo', '1' from source UNION ALL select 'go', 'far', 'moo', '1' from source; " ... Query ID = dfsload_20170110233558_51289333-d9da-4851-8671-bfe653d26e45 Total jobs = 3 Launching Job 1 out of 3 Status: Running (Executing on YARN cluster with App id application_1481756273279_5089989) -------------------------------------------------------------------------------- VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -------------------------------------------------------------------------------- Map 1 .......... SUCCEEDED 1 1 0 0 0 0 Map 3 .......... SUCCEEDED 1 1 0 0 0 0 -------------------------------------------------------------------------------- VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 13.07 s -------------------------------------------------------------------------------- Stage-4 is filtered out by condition resolver. Stage-3 is selected by condition resolver. Stage-5 is filtered out by condition resolver. Launching Job 3 out of 3 Status: Running (Executing on YARN cluster with App id application_1481756273279_5089989) -------------------------------------------------------------------------------- VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -------------------------------------------------------------------------------- File Merge RUNNING 1 0 1 0 2 0 -------------------------------------------------------------------------------- VERTICES: 00/01 [>>--------------------------] 0% ELAPSED TIME: 3.06 s -------------------------------------------------------------------------------- ...
The File Merge fails with the following:
TaskAttempt 3 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Multiple partitions for one merge mapper: hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/2 NOT EQUAL TO hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/1 at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:171) at org.apache.hadoop.hive.ql.exec.tez.MergeFileTezProcessor.run(MergeFileTezProcessor.java:42) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:362) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:192) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:184) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1738) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:184) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:180) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Multiple partitions for one merge mapper: hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/2 NOT EQUAL TO hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/1 at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.processRow(MergeFileRecordProcessor.java:217) at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.run(MergeFileRecordProcessor.java:151) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:148) ... 14 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Multiple partitions for one merge mapper: hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/2 NOT EQUAL TO hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/1 at org.apache.hadoop.hive.ql.exec.OrcFileMergeOperator.processKeyValuePairs(OrcFileMergeOperator.java:159) at org.apache.hadoop.hive.ql.exec.OrcFileMergeOperator.process(OrcFileMergeOperator.java:62) at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.processRow(MergeFileRecordProcessor.java:208) ... 16 more Caused by: java.io.IOException: Multiple partitions for one merge mapper: hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/2 NOT EQUAL TO hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/1 at org.apache.hadoop.hive.ql.exec.AbstractFileMergeOperator.checkPartitionsMatch(AbstractFileMergeOperator.java:174) at org.apache.hadoop.hive.ql.exec.AbstractFileMergeOperator.fixTmpPath(AbstractFileMergeOperator.java:191) at org.apache.hadoop.hive.ql.exec.OrcFileMergeOperator.processKeyValuePairs(OrcFileMergeOperator.java:86) ... 18 more ]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:0, Vertex vertex_1481756273279_5089989_2_00 [File Merge] killed/failed due to:OWN_TASK_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0
3. Data produced with Hive UNION ALL will not be readable by Pig/HCatalog, without mapred.input.dir.recursive.
Setting mapred.input.dir.recursive=true in hive-site.xml should resolve the first and third problem. But is this the recommendation? This is intrusive, and doesn't solve #2. The Pig UNION doesn't work this way, as per my limited understanding.