Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14543

SQL/Hive insertInto has unexpected results

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Later
    • None
    • None
    • SQL
    • None

    Description

      Updated description
      There should be an option to match input data to output columns by name. The API allows operations on tables, which hide the column resolution problem. It's easy to copy from one table to another without listing the columns, and in the API it is common to work with columns by name rather than by position. I think the API should add a way to match columns by name, which is closer to what users expect. I propose adding something like this:

      CREATE TABLE src (id: bigint, count: int, total: bigint)
      CREATE TABLE dst (id: bigint, total: bigint, count: int)
      
      sqlContext.table("src").write.byName.insertInto("dst")
      

      Original description
      The Hive write path adds a pre-insertion cast (projection) to reconcile incoming data columns with the outgoing table schema. Columns are matched by position and casts are inserted to reconcile the two column schemas.

      When columns aren't correctly aligned, this causes unexpected results. I ran into this by not using a correct partitionBy call (addressed by SPARK-14459), which caused an error message that an int could not be cast to an array. However, if the columns are vaguely compatible, for example string and float, then no error or warning is produced and data is written to the wrong columns using unexpected casts (string -> bigint -> float).

      A real-world use case that will hit this is when a table definition changes by adding a column in the middle of a table. Spark SQL statements that copied from that table to a destination table will then map the columns differently but insert casts that mask the problem. The last column's data will be dropped without a reliable warning for the user.

      This highlights a few problems:

      • Too many or too few incoming data columns should cause an AnalysisException to be thrown
      • Only "safe" casts should be inserted automatically, like int -> long, using UpCast
      • Pre-insertion casts currently ignore extra columns by using zip
      • The pre-insertion cast logic differs between Hive's MetastoreRelation and LogicalRelation

      Also, I think there should be an option to match input data to output columns by name. The API allows operations on tables, which hide the column resolution problem. It's easy to copy from one table to another without listing the columns, and in the API it is common to work with columns by name rather than by position. I think the API should add a way to match columns by name, which is closer to what users expect. I propose adding something like this:

      CREATE TABLE src (id: bigint, count: int, total: bigint)
      CREATE TABLE dst (id: bigint, total: bigint, count: int)
      
      sqlContext.table("src").write.byName.insertInto("dst")
      

      Attachments

        Issue Links

          Activity

            People

              rdblue Ryan Blue
              rdblue Ryan Blue
              Votes:
              2 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: