Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27695

SELECT * returns null column when reading from Hive / ORC and spark.sql.hive.convertMetastoreOrc=true

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Incomplete
    • Affects Version/s: 2.3.1, 2.3.2, 2.3.3
    • Fix Version/s: None
    • Component/s: Input/Output, Spark Core
    • Labels:

      Description

      If you do 

      select * from hive.some_table

      and the underlying data does not match exactly the schema the last column is returned as null 

      Example 

      from pyspark import SparkConf
      from pyspark.sql import SparkSession
      from pyspark.sql.types import *
      
      conf = SparkConf().set('spark.sql.hive.convertMetastoreOrc', 'true')
      
      spark = SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate()
      
      
      data = [{'a':i, 'b':i+10, 'd':{'a':i, 'b':i+10}} for i in range(1, 100)]
      data_schema = StructType([StructField('a', LongType(), True),
      StructField('b', LongType(), True),
      StructField('d', MapType(StringType(), LongType(), True), True)
      ])
      rdd = spark.sparkContext.parallelize(data)
      
      df = rdd.toDF(data_schema)
      
      df.write.format("orc").save("./sample_data/")
      
      spark.sql("""create external table tmp(
      a bigint,
      b bigint,
      d map<string, bigint>)
      stored as orc
      location 'sample_data/'
      """)
      
      
      spark.sql("select * from tmp").show()
      

      This return correctl

      +---+---+-------------------+
      |  a|  b|                  d|
      +---+---+-------------------+
      | 85| 95| [a -> 85, b -> 95]|
      | 86| 96| [a -> 86, b -> 96]|
      | 87| 97| [a -> 87, b -> 97]|
      | 88| 98| [a -> 88, b -> 98]|
      | 89| 99| [a -> 89, b -> 99]|
      | 90|100|[a -> 90, b -> 100]|
      
      

      However if add a new column in the underlying data without altering the hive schema
      the last column of the hive schema is set to null

      from pyspark import SparkConf
      from pyspark.sql import SparkSession
      from pyspark.sql.types import *
      
      conf = SparkConf().set('spark.sql.hive.convertMetastoreOrc', 'true')
      
      spark = SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate()
      
      
      data = [{'a':i, 'b':i+10, 'c':i+5, 'd':{'a':i, 'b':i+10, 'c':i+5}} for i in range(1, 100)]
      data_schema = StructType([StructField('a', LongType(), True),
      	                      StructField('b', LongType(), True),
      	                      StructField('c', LongType(), True),
      	                      StructField('d', MapType(StringType(), LongType(), True), True)
      	                      ])
      rdd = spark.sparkContext.parallelize(data)
      
      df = rdd.toDF(data_schema)
      
      df.write.format("orc").mode("overwrite").save("./sample_data/")
      
      
      spark.sql("select * from tmp").show()
      
      spark.read.orc("./sample_data/").show()
      

      The first show() returns

      +---+---+----+
      |  a|  b|   d|
      +---+---+----+
      | 85| 95|null|
      | 86| 96|null|
      | 87| 97|null|
      | 88| 98|null|
      | 89| 99|null|
      | 90|100|null|
      | 91|101|null|
      | 92|102|null|
      | 93|103|null|
      | 94|104|null|
      

      But the data on disk is correct

      +---+---+---+--------------------+
      |  a|  b|  c|                   d|
      +---+---+---+--------------------+
      | 85| 95| 90|[a -> 85, b -> 95...|
      | 86| 96| 91|[a -> 86, b -> 96...|
      | 87| 97| 92|[a -> 87, b -> 97...|
      | 88| 98| 93|[a -> 88, b -> 98...|
      | 89| 99| 94|[a -> 89, b -> 99...|
      | 90|100| 95|[a -> 90, b -> 10...|
      | 91|101| 96|[a -> 91, b -> 10...|
      

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              unnamed101 Oscar Cassetti
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: