[HIVE-23758] OrcInputFormat.getSargColumnNames might be more failsafe in case of schema mismatch - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Invalid
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
- pull-request-available

Description

There was a customer case, where a bucket file was somehow placed into a partition directory, which contained another bucket file with valid acid schema (refer orc_dump.log for details), and query failed while split generation with below error at this line

Caused by: java.lang.RuntimeException: ORC split generation failed with exception: java.lang.IndexOutOfBoundsException: Index: 6, Size: 6
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1871)
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1959)
        at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:532)
        at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:789)
        at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
        at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
        at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
        ... 4 more
Caused by: java.util.concurrent.ExecutionException: java.lang.IndexOutOfBoundsException: Index: 6, Size: 6
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.util.concurrent.FutureTask.get(FutureTask.java:192)
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1865)
        ... 17 more
Caused by: java.lang.IndexOutOfBoundsException: Index: 6, Size: 6
        at java.util.ArrayList.rangeCheck(ArrayList.java:653)
        at java.util.ArrayList.get(ArrayList.java:429)
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSargColumnNames(OrcInputFormat.java:482)
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.extractNeededColNames(OrcInputFormat.java:539)
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.extractNeededColNames(OrcInputFormat.java:534)
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.access$2900(OrcInputFormat.java:158)
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.callInternal(OrcInputFormat.java:1556)
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.access$2700(OrcInputFormat.java:1337)
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator$1.run(OrcInputFormat.java:1522)
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator$1.run(OrcInputFormat.java:1519)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:1519)
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:1337)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)

haven't figured out the origin of the second, empty file, but a sanity check of length helped to skip this issue by ignoring that file while split generation, which I'm about to try out in the first version of the pull request:
in tez app logs after the patch:

2020-06-24 13:13:21,331 [WARN] [ORC_GET_SPLITS #2] |orc.OrcInputFormat|: possible schema mismatch, asked for column with index:6. column but there is only 6 types defined (isOriginal: false, originalColumnNames.length: 1), cannot get sarg col names...
2020-06-24 13:13:21,331 [WARN] [ORC_GET_SPLITS #2] |orc.OrcInputFormat|: Skipping split elimination for hdfs://ns1/warehouse/tablespace/managed/hive/bdaa28846/cda_date=20200601/cda_job_name=core_base/base_0000001/bucket_00001 as column names is null

where bucket_00001 was the second, problematic file, so the patch helped split generation recover from this strange state...

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

orc_dump.log
24/Jun/20 13:24
6 kB
László Bodor

Issue Links

is fixed by

HIVE-23889 Empty bucket files are inserted with invalid schema after HIVE-21784

Resolved

links to

GitHub Pull Request #1174

Activity

People

Assignee:: László Bodor

Reporter:: László Bodor

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 24/Jun/20 13:22

Updated:: 05/Aug/20 18:04

Resolved:: 05/Aug/20 18:03

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

40m