Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23448

Dataframe returns wrong result when column don't respect datatype

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.2
    • Fix Version/s: 2.3.1
    • Component/s: SQL
    • Labels:
      None
    • Environment:

      Local

      Description

      I have the following json file that contains some noisy data(String instead of Array):

       

      {"attr1":"val1","attr2":"[\"val2\"]"}
      {"attr1":"val1","attr2":["val2"]}
      

      And i need to specify schema programatically like this:

       

      implicit val spark = SparkSession
        .builder()
        .master("local[*]")
        .config("spark.ui.enabled", false)
        .config("spark.sql.caseSensitive", "True")
        .getOrCreate()
      import spark.implicits._
      
      val schema = StructType(
        Seq(StructField("attr1", StringType, true),
            StructField("attr2", ArrayType(StringType, true), true)))
      
      spark.read.schema(schema).json(input).collect().foreach(println)
      

      The result given by this code is:

      [null,null]
      [val1,WrappedArray(val2)]
      

      Instead of putting null in corrupted column, all columns of the first message are null

       

       

        Attachments

          Activity

            People

            • Assignee:
              viirya L. C. Hsieh
              Reporter:
              azaroui Ahmed ZAROUI
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: