[SPARK-31772] Json schema reading is not consistent between int and string types - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: 2.4.4
Fix Version/s: None
Component/s: PySpark
Labels:
None

Description

When reading json file using a schema, int value is converted to string if field is string but string field is not converted to int value if field is int.

Sample Code:

read_schema = StructType([StructField("a", IntegerType()),
StructField("b", StringType())])
df = self.spark_session.read.schema(read_schema).json("input/json/temp_test")
df.show()

json temp_test

{"a": 1,"b": "b1"} {"a": 2,"b": "b2"} {"a": 3,"b": 3} {"a": "4","b": 4}

actual:

------+

1	b1
2	b2
3	3
null	null

------+

expected:

Third line will be nulled as the fourth line as b is int while in schema it's string.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: yaniv oren

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 20/May/20 16:27

Updated:: 12/Dec/22 18:10

Resolved:: 25/May/20 07:19