Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32176

Automatic type promotion to ArrayType in defined schema in from_json is broken

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Cannot Reproduce
    • 3.0.0
    • None
    • Spark Core
    • None

    Description

       

      In spark 2.4, I'm able to read data where I have data in mixed types, by defining col "stats" as StringType and later parse the inner data
       
      stats_def = StructType().add("hour",IntegerType(),True).add("hits",IntegerType(),True)
      df2 = df.select(f.col("stats"),f.from_json(f.col("stats"),ArrayType(stats_def)).alias("stats_array"))
      df2.show(5,False)
      df2.printSchema
       

      stats stats_array
      [\{"hour":3,"hits":1},\{"hour":4,"hits":1}] [[3, 1], [4, 1]]
      {"hits":20} [[, 20]]

      <bound method DataFrame.printSchema of DataFrame[*stats: string, stats_array: array<struct<hour:int,hits:int>>*]>
       
      In spark 3.0.0 it throws error -
      java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData
       
      I think it was an important feature and should be supported, maybe with the help of from_json options.
       

      Attachments

        Activity

          People

            Unassigned Unassigned
            abhi92544 Abhishek Adhikari
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: