[SPARK-14543] SQL/Hive insertInto has unexpected results - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Later
Affects Version/s: None
Fix Version/s: None
Component/s: SQL
Labels:
None

Target Version/s:

2.4.0

Description

Updated description
There should be an option to match input data to output columns by name. The API allows operations on tables, which hide the column resolution problem. It's easy to copy from one table to another without listing the columns, and in the API it is common to work with columns by name rather than by position. I think the API should add a way to match columns by name, which is closer to what users expect. I propose adding something like this:

CREATE TABLE src (id: bigint, count: int, total: bigint)
CREATE TABLE dst (id: bigint, total: bigint, count: int)

sqlContext.table("src").write.byName.insertInto("dst")

Original description
The Hive write path adds a pre-insertion cast (projection) to reconcile incoming data columns with the outgoing table schema. Columns are matched by position and casts are inserted to reconcile the two column schemas.

When columns aren't correctly aligned, this causes unexpected results. I ran into this by not using a correct partitionBy call (addressed by ~~SPARK-14459~~), which caused an error message that an int could not be cast to an array. However, if the columns are vaguely compatible, for example string and float, then no error or warning is produced and data is written to the wrong columns using unexpected casts (string -> bigint -> float).

A real-world use case that will hit this is when a table definition changes by adding a column in the middle of a table. Spark SQL statements that copied from that table to a destination table will then map the columns differently but insert casts that mask the problem. The last column's data will be dropped without a reliable warning for the user.

This highlights a few problems:

Too many or too few incoming data columns should cause an AnalysisException to be thrown
Only "safe" casts should be inserted automatically, like int -> long, using UpCast
Pre-insertion casts currently ignore extra columns by using zip
The pre-insertion cast logic differs between Hive's MetastoreRelation and LogicalRelation

Also, I think there should be an option to match input data to output columns by name. The API allows operations on tables, which hide the column resolution problem. It's easy to copy from one table to another without listing the columns, and in the API it is common to work with columns by name rather than by position. I think the API should add a way to match columns by name, which is closer to what users expect. I propose adding something like this:

CREATE TABLE src (id: bigint, count: int, total: bigint)
CREATE TABLE dst (id: bigint, total: bigint, count: int)

sqlContext.table("src").write.byName.insertInto("dst")

Attachments

Issue Links

depends upon

SPARK-24251 DataSourceV2: Add AppendData logical operation

Resolved

is duplicated by

SPARK-19286 Spark/Hive insert Into has unexpected results

Closed

links to

[Github] Pull Request #12313 (rdblue)

Activity

People

Assignee:: Ryan Blue

Reporter:: Ryan Blue

Votes:: 2 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 11/Apr/16 22:31

Updated:: 18/Jul/19 18:52

Resolved:: 26/Jul/18 18:49