Description
Updated description
There should be an option to match input data to output columns by name. The API allows operations on tables, which hide the column resolution problem. It's easy to copy from one table to another without listing the columns, and in the API it is common to work with columns by name rather than by position. I think the API should add a way to match columns by name, which is closer to what users expect. I propose adding something like this:
CREATE TABLE src (id: bigint, count: int, total: bigint) CREATE TABLE dst (id: bigint, total: bigint, count: int) sqlContext.table("src").write.byName.insertInto("dst")
Original description
The Hive write path adds a pre-insertion cast (projection) to reconcile incoming data columns with the outgoing table schema. Columns are matched by position and casts are inserted to reconcile the two column schemas.
When columns aren't correctly aligned, this causes unexpected results. I ran into this by not using a correct partitionBy call (addressed by SPARK-14459), which caused an error message that an int could not be cast to an array. However, if the columns are vaguely compatible, for example string and float, then no error or warning is produced and data is written to the wrong columns using unexpected casts (string -> bigint -> float).
A real-world use case that will hit this is when a table definition changes by adding a column in the middle of a table. Spark SQL statements that copied from that table to a destination table will then map the columns differently but insert casts that mask the problem. The last column's data will be dropped without a reliable warning for the user.
This highlights a few problems:
- Too many or too few incoming data columns should cause an AnalysisException to be thrown
- Only "safe" casts should be inserted automatically, like int -> long, using UpCast
- Pre-insertion casts currently ignore extra columns by using zip
- The pre-insertion cast logic differs between Hive's MetastoreRelation and LogicalRelation
Also, I think there should be an option to match input data to output columns by name. The API allows operations on tables, which hide the column resolution problem. It's easy to copy from one table to another without listing the columns, and in the API it is common to work with columns by name rather than by position. I think the API should add a way to match columns by name, which is closer to what users expect. I propose adding something like this:
CREATE TABLE src (id: bigint, count: int, total: bigint) CREATE TABLE dst (id: bigint, total: bigint, count: int) sqlContext.table("src").write.byName.insertInto("dst")
Attachments
Issue Links
- depends upon
-
SPARK-24251 DataSourceV2: Add AppendData logical operation
- Resolved
- is duplicated by
-
SPARK-19286 Spark/Hive insert Into has unexpected results
- Closed
- links to