Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26836

Columns get switched in Spark SQL using Avro backed Hive table if schema evolves

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 2.3.1, 2.4.0, 3.0.1, 3.1.0, 3.2.0
    • 3.2.0
    • SQL
    • I tested with Hive and HCatalog which runs on version 2.3.4 and with Spark 2.3.1 and 2.4

    Description

      I have a hive avro table where the avro schema is stored on s3 next to the avro files. 

      In the table definiton the avro.schema.url always points to the latest partition's _schema.avsc file which is always the lates schema. (Avro schemas are backward and forward compatible in a table)

      When new data comes in, I always add a new partition where the avro.schema.url properties also set to the _schema.avsc which was used when it was added and of course I always update the table avro.schema.url property to the latest one.

      Querying this table works fine until the schema evolves in a way that a new optional property is added in the middle. 

      When this happens then after the spark sql query the columns in the old partition gets mixed up and it shows the wrong data for the columns.

      If I query the table with Hive then everything is perfectly fine and it gives me back the correct columns for the partitions which were created the old schema and for the new which was created the evolved schema.

       

      Here is how I could reproduce with the doctors.avro example data in sql test suite.

      1. I have created two partition folder:
        [hadoop@ip-192-168-10-158 hadoop]$ hdfs dfs -ls s3://somelocation/doctors/*/
        Found 2 items
        -rw-rw-rw- 1 hadoop hadoop 418 2019-02-06 12:48 s3://somelocation/doctors
        /dt=2019-02-05/_schema.avsc
        -rw-rw-rw- 1 hadoop hadoop 521 2019-02-06 12:13 s3://somelocation/doctors
        /dt=2019-02-05/doctors.avro
        Found 2 items
        -rw-rw-rw- 1 hadoop hadoop 580 2019-02-06 12:49 s3://somelocation/doctors
        /dt=2019-02-06/_schema.avsc
        -rw-rw-rw- 1 hadoop hadoop 577 2019-02-06 12:13 s3://somelocation/doctors
        /dt=2019-02-06/doctors_evolved.avro

        Here the first partition had data which was created with the schema before evolving and the second one had the evolved one. (the evolved schema is the same as in your testcase except I moved the extra_field column to the last from the second and I generated two lines of avro data with the evolved schema.

      2. I have created a hive table with the following command:

       

      CREATE EXTERNAL TABLE `default.doctors`
       PARTITIONED BY (
       `dt` string
       )
       ROW FORMAT SERDE
       'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
       WITH SERDEPROPERTIES (
       'avro.schema.url'='s3://somelocation/doctors/
      /dt=2019-02-06/_schema.avsc')
       STORED AS INPUTFORMAT
       'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
       OUTPUTFORMAT
       'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
       LOCATION
       's3://somelocation/doctors/'
       TBLPROPERTIES (
       'transient_lastDdlTime'='1538130975')

       

      Here as you can see the table schema url points to the latest schema

      3. I ran an msck repair table to pick up all the partitions.

      Fyi: If I run my select * query from here then everything is fine and no columns switch happening.

      4. Then I changed the first partition's avro.schema.url url to points to the schema which is under the partition folder (non-evolved one -> s3://somelocation/doctors/
      /dt=2019-02-05/_schema.avsc)

      Then if you ran a select * from default.spark_test then the columns will be mixed up (on the data below the first name column becomes the extra_field column. I guess because in the latest schema it is the second column):

       

      number,extra_field,first_name,last_name,dt 
      6,Colin,Baker,null,2019-02-05 
      3,Jon,Pertwee,null,2019-02-05 
      4,Tom,Baker,null,2019-02-05 
      5,Peter,Davison,null,2019-02-05 
      11,Matt,Smith,null,2019-02-05 
      1,William,Hartnell,null,2019-02-05 
      7,Sylvester,McCoy,null,2019-02-05 
      8,Paul,McGann,null,2019-02-05 
      2,Patrick,Troughton,null,2019-02-05 
      9,Christopher,Eccleston,null,2019-02-05 
      10,David,Tennant,null,2019-02-05 
      21,fishfinger,Jim,Baker,2019-02-06 
      24,fishfinger,Bean,Pertwee,2019-02-06
      
      

      If I try the same query from Hive and not from spark sql then everything is fine and it never switches the columns.

       

      Attachments

        1. doctors_evolved.avro
          0.6 kB
          Tamás Németh
        2. doctors_evolved.json
          0.2 kB
          Tamás Németh
        3. doctors.avro
          0.5 kB
          Tamás Németh
        4. original.avsc
          0.4 kB
          Tamás Németh

        Activity

          People

            attilapiros Attila Zsolt Piros
            treff7es Tamás Németh
            Votes:
            2 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: