Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Not A Problem
-
1.5.0
-
None
-
None
Description
Using a defined schema to load a json rdd works as expected. Loading the json records from a file does not apply the supplied schema. Mainly the nullable field isn't applied correctly. Loading from a file uses nullable=true on all fields regardless of applied schema.
Code to reproduce:
import org.apache.spark.sql.types._ val jsonRdd = sc.parallelize(List( """{"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", "ProductCode": "WQT648", "Qty": 5}""", """{"OrderID": 2, "CustomerID":16 , "OrderDate": "2015-07-11", "ProductCode": "LG4-Z5", "Qty": 10, "Discount":0.25, "expressDelivery":true}""")) val mySchema = StructType(Array( StructField(name="OrderID" , dataType=LongType, nullable=false), StructField("CustomerID", IntegerType, false), StructField("OrderDate", DateType, false), StructField("ProductCode", StringType, false), StructField("Qty", IntegerType, false), StructField("Discount", FloatType, true), StructField("expressDelivery", BooleanType, true) )) val myDF = sqlContext.read.schema(mySchema).json(jsonRdd) val schema1 = myDF.printSchema val dfDFfromFile = sqlContext.read.schema(mySchema).json("Orders.json") val schema2 = dfDFfromFile.printSchema
Orders.json
{"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", "ProductCode": "WQT648", "Qty": 5} {"OrderID": 2, "CustomerID":16 , "OrderDate": "2015-07-11", "ProductCode": "LG4-Z5", "Qty": 10, "Discount":0.25, "expressDelivery":true}
The behavior should be consistent.
Attachments
Issue Links
- relates to
-
SPARK-23173 from_json can produce nulls for fields which are marked as non-nullable
- Resolved
-
SPARK-25545 CSV loading with DROPMALFORMED mode doesn't correctly drop rows that do not confirm to non-nullable schema fields
- Resolved