[SPARK-28043] Reading json with duplicate columns drops the first column value - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.4.0
Fix Version/s: None
Component/s: Spark Core
Labels:
- bulk-closed

Description

When reading a JSON blob with duplicate fields, Spark appears to ignore the value of the first one. JSON recommends unique names but does not require it; since JSON and Spark SQL both allow duplicate field names, we should fix the bug where the first column value is getting dropped.

I'm guessing somewhere when parsing JSON, we're turning it into a Map which is causing the first value to be overridden.

Repro (Python, 2.4):

scala> val jsonRDD = spark.sparkContext.parallelize(Seq("[{ \"a\": \"blah\", \"a\": \"blah2\"} ]"))
jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at parallelize at <console>:23
scala> val df = spark.read.json(jsonRDD)
df: org.apache.spark.sql.DataFrame = [a: string, a: string]                     
scala> df.show
+----+-----+
|   a|    a|
+----+-----+
|null|blah2|
+----+-----+

The expected response would be:

+----+-----+
|   a|    a|
+----+-----+
|blah|blah2|
+----+-----+

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Mukul Murthy

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 13/Jun/19 23:38

Updated:: 25/May/21 01:53

Resolved:: 25/May/21 01:41