[SPARK-11868] wrong results returned from dataframe create from Rows without consistent schma on pyspark - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Cannot Reproduce
Affects Version/s: 1.5.2
Fix Version/s: None
Component/s: PySpark, SQL
Labels:
None
Environment:

pyspark

Description

When schema is inconsistent (but is the sames for the 10 first rows), it's possible to create a dataframe form dictionaries and if a key is missing, its value is None. But when trying to create dataframe from corresponding rows, we get inconsistent behavior (wrong values for keys) without exception. See example below.

The problems seems to be:
1. Not verifying all rows in schema.
2. In pyspark.sql.types._create_converter, None is being set when converting dictionary and field is not exist:

return tuple([conv(d.get(name)) for name, conv in zip(names, converters)])

But for Rows, it is just assumed that the number of fields in tuple is equal the number of in the inferred schema, and we place wrong values for wrong keys otherwise:

return tuple(conv(v) for v, conv in zip(obj, converters))

Thanks.

example:

dicts = [{'1':1,'2':2,'3':3}]*10+[{'1':1,'3':3}]
rows = [pyspark.sql.Row(**r) for r in dicts]
rows_rdd = sc.parallelize(rows)
dicts_rdd = sc.parallelize(dicts)
rows_df = sqlContext.createDataFrame(rows_rdd)
dicts_df = sqlContext.createDataFrame(dicts_rdd)

print(rows_df.select(['2']).collect()[10])
print(dicts_df.select(['2']).collect()[10])

output:

Row(2=3)
Row(2=None)

Attachments

Issue Links

relates to

SPARK-11319 PySpark silently accepts null values in non-nullable DataFrame fields.

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Yuval Tanny

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 19/Nov/15 22:27

Updated:: 12/Dec/22 18:11

Resolved:: 02/Nov/16 07:37