[SPARK-23448] Dataframe returns wrong result when column don't respect datatype - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.2
Fix Version/s: 2.3.1
Component/s: SQL
Labels:
None
Environment:

Local

Description

I have the following json file that contains some noisy data(String instead of Array):

{"attr1":"val1","attr2":"[\"val2\"]"}
{"attr1":"val1","attr2":["val2"]}

And i need to specify schema programatically like this:

implicit val spark = SparkSession
  .builder()
  .master("local[*]")
  .config("spark.ui.enabled", false)
  .config("spark.sql.caseSensitive", "True")
  .getOrCreate()
import spark.implicits._

val schema = StructType(
  Seq(StructField("attr1", StringType, true),
      StructField("attr2", ArrayType(StringType, true), true)))

spark.read.schema(schema).json(input).collect().foreach(println)

The result given by this code is:

[null,null]
[val1,WrappedArray(val2)]

Instead of putting null in corrupted column, all columns of the first message are null

Attachments

Issue Links

links to

[Github] Pull Request #20648 (viirya)

[Github] Pull Request #20666 (viirya)

Activity

People

Assignee:: L. C. Hsieh

Reporter:: Ahmed ZAROUI

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 16/Feb/18 13:39

Updated:: 12/Dec/22 18:11

Resolved:: 28/Feb/18 02:01