Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28043

Reading json with duplicate columns drops the first column value

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.4.0
    • None
    • Spark Core

    Description

      When reading a JSON blob with duplicate fields, Spark appears to ignore the value of the first one. JSON recommends unique names but does not require it; since JSON and Spark SQL both allow duplicate field names, we should fix the bug where the first column value is getting dropped.

       

      I'm guessing somewhere when parsing JSON, we're turning it into a Map which is causing the first value to be overridden.

       

      Repro (Python, 2.4):

      scala> val jsonRDD = spark.sparkContext.parallelize(Seq("[{ \"a\": \"blah\", \"a\": \"blah2\"} ]"))
      jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at parallelize at <console>:23
      scala> val df = spark.read.json(jsonRDD)
      df: org.apache.spark.sql.DataFrame = [a: string, a: string]                     
      scala> df.show
      +----+-----+
      |   a|    a|
      +----+-----+
      |null|blah2|
      +----+-----+
      

       

      The expected response would be:

      +----+-----+
      |   a|    a|
      +----+-----+
      |blah|blah2|
      +----+-----+
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            mukulmurthy Mukul Murthy
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: