Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.4.0
-
None
Description
When reading a JSON blob with duplicate fields, Spark appears to ignore the value of the first one. JSON recommends unique names but does not require it; since JSON and Spark SQL both allow duplicate field names, we should fix the bug where the first column value is getting dropped.
I'm guessing somewhere when parsing JSON, we're turning it into a Map which is causing the first value to be overridden.
Repro (Python, 2.4):
scala> val jsonRDD = spark.sparkContext.parallelize(Seq("[{ \"a\": \"blah\", \"a\": \"blah2\"} ]")) jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at parallelize at <console>:23 scala> val df = spark.read.json(jsonRDD) df: org.apache.spark.sql.DataFrame = [a: string, a: string] scala> df.show +----+-----+ | a| a| +----+-----+ |null|blah2| +----+-----+
The expected response would be:
+----+-----+ | a| a| +----+-----+ |blah|blah2| +----+-----+