-
Type:
Bug
-
Status: Open
-
Priority:
Major
-
Resolution: Unresolved
-
Affects Version/s: 2.3.1, 2.3.2, 2.3.3
-
Fix Version/s: None
-
Component/s: Input/Output, Spark Core
-
Labels:None
If you do
select * from hive.some_table
and the underlying data does not match exactly the schema the last column is returned as null
Example
from pyspark import SparkConf from pyspark.sql import SparkSession from pyspark.sql.types import * conf = SparkConf().set('spark.sql.hive.convertMetastoreOrc', 'true') spark = SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate() data = [{'a':i, 'b':i+10, 'd':{'a':i, 'b':i+10}} for i in range(1, 100)] data_schema = StructType([StructField('a', LongType(), True), StructField('b', LongType(), True), StructField('d', MapType(StringType(), LongType(), True), True) ]) rdd = spark.sparkContext.parallelize(data) df = rdd.toDF(data_schema) df.write.format("orc").save("./sample_data/") spark.sql("""create external table tmp( a bigint, b bigint, d map<string, bigint>) stored as orc location 'sample_data/' """) spark.sql("select * from tmp").show()
This return correctl
+---+---+-------------------+ | a| b| d| +---+---+-------------------+ | 85| 95| [a -> 85, b -> 95]| | 86| 96| [a -> 86, b -> 96]| | 87| 97| [a -> 87, b -> 97]| | 88| 98| [a -> 88, b -> 98]| | 89| 99| [a -> 89, b -> 99]| | 90|100|[a -> 90, b -> 100]|
However if add a new column in the underlying data without altering the hive schema
the last column of the hive schema is set to null
from pyspark import SparkConf from pyspark.sql import SparkSession from pyspark.sql.types import * conf = SparkConf().set('spark.sql.hive.convertMetastoreOrc', 'true') spark = SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate() data = [{'a':i, 'b':i+10, 'c':i+5, 'd':{'a':i, 'b':i+10, 'c':i+5}} for i in range(1, 100)] data_schema = StructType([StructField('a', LongType(), True), StructField('b', LongType(), True), StructField('c', LongType(), True), StructField('d', MapType(StringType(), LongType(), True), True) ]) rdd = spark.sparkContext.parallelize(data) df = rdd.toDF(data_schema) df.write.format("orc").mode("overwrite").save("./sample_data/") spark.sql("select * from tmp").show() spark.read.orc("./sample_data/").show()
The first show() returns
+---+---+----+ | a| b| d| +---+---+----+ | 85| 95|null| | 86| 96|null| | 87| 97|null| | 88| 98|null| | 89| 99|null| | 90|100|null| | 91|101|null| | 92|102|null| | 93|103|null| | 94|104|null|
But the data on disk is correct
+---+---+---+--------------------+ | a| b| c| d| +---+---+---+--------------------+ | 85| 95| 90|[a -> 85, b -> 95...| | 86| 96| 91|[a -> 86, b -> 96...| | 87| 97| 92|[a -> 87, b -> 97...| | 88| 98| 93|[a -> 88, b -> 98...| | 89| 99| 94|[a -> 89, b -> 99...| | 90|100| 95|[a -> 90, b -> 10...| | 91|101| 96|[a -> 91, b -> 10...|