[SPARK-25175] Field resolution should fail if there's ambiguity for ORC native reader - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.1
Fix Version/s: 2.4.0
Component/s: SQL
Labels:
None

Description

~~SPARK-25132~~ adds support for case-insensitive field resolution when reading from Parquet files. We found ORC files have similar issues, but not identical to Parquet. Spark has two OrcFileFormat.

Since ~~SPARK-2883~~, Spark supports ORC inside sql/hive module with Hive dependency. This hive OrcFileFormat always do case-insensitive field resolution regardless of case sensitivity mode. When there is ambiguity, hive OrcFileFormat always returns the first matched field, rather than failing the reading operation.
~~SPARK-20682~~ adds a new ORC data source inside sql/core. This native OrcFileFormat supports case-insensitive field resolution, however it cannot handle duplicate fields.

Besides data source tables, hive serde tables also have issues. If ORC data file has more fields than table schema, we just can't read hive serde tables. If ORC data file does not have more fields, hive serde tables always do field resolution by ordinal, rather than by name.

Both ORC data source hive impl and hive serde table rely on the hive orc InputFormat/SerDe to read table. I'm not sure whether we can change underlying hive classes to make all orc read behaviors consistent.

This ticket aims to make read behavior of ORC data source native impl consistent with Parquet data source.

Attachments

Issue Links

blocks

SPARK-20901 Feature parity for ORC with Parquet

Open

relates to

SPARK-25132 Case-insensitive field resolution when reading from Parquet

Resolved

links to

[Github] Pull Request #22262 (seancxmao)

Activity

People

Assignee:: Chenxiao Mao

Reporter:: Chenxiao Mao

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 21/Aug/18 09:23

Updated:: 10/Sep/18 02:26

Resolved:: 10/Sep/18 02:24