Description
I have the following json file that contains some noisy data(String instead of Array):
{"attr1":"val1","attr2":"[\"val2\"]"} {"attr1":"val1","attr2":["val2"]}
And i need to specify schema programatically like this:
implicit val spark = SparkSession .builder() .master("local[*]") .config("spark.ui.enabled", false) .config("spark.sql.caseSensitive", "True") .getOrCreate() import spark.implicits._ val schema = StructType( Seq(StructField("attr1", StringType, true), StructField("attr2", ArrayType(StringType, true), true))) spark.read.schema(schema).json(input).collect().foreach(println)
The result given by this code is:
[null,null] [val1,WrappedArray(val2)]
Instead of putting null in corrupted column, all columns of the first message are null