[SPARK-30941] PySpark Row can be instantiated with duplicate field names - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0
Fix Version/s: 2.4.6, 3.0.0
Component/s: PySpark
Labels:
- correctness

Description

It is possible to create a Row that has fields with the same name when calling `collect()` after a join. Given that the Row constructor itself doesn't allow this, this seems to be undesired behavior.

This can possibly cause correctness issues because different ways of getting values produce different results: _get_item_ will return the leftmost value, while asDict() will return the rightmost value (because the former uses an index search and the latter uses a dictionary generator).

>>> manual_output_row = Row(a=1, b=1, b=2)
{{ File "<stdin>", line 1}}
SyntaxError: keyword argument repeated

>>> input_rows = Row(a=1, b=1), Row(a=1, b=2)
>>> df1, df2 = (spark.createDataFrame([r]) for r in input_rows)
>>> df3 = df1.join(df2, "a")
>>> output_row = df3.collect()[0]
>>> output_row
Row(a=1, b=1, b=2)
>>> output_row["b"]
1
>>> output_row.asDict()["b"]
2

*SPARK 1.6.3*

>>> from pyspark.sql.types import Row
>>> input_rows = Row(a=1, b=1), Row(a=1, b=2)
>>> df1, df2 = (sqlContext.createDataFrame([r]) for r in input_rows)
>>> df3 = df1.join(df2, "a")
>>> output_row = df3.collect()[0]
>>> output_row
Row(a=1, b=1, b=2)
>>> output_row["b"]
1
>>> output_row.asDict()["b"]
2
>>> sc.version
u'1.6.3'

Attachments

Issue Links

links to

GitHub Pull Request #27853

Activity

People

Assignee:: Hyukjin Kwon

Reporter:: David Roher

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 24/Feb/20 16:38

Updated:: 12/Dec/22 18:10

Resolved:: 09/Mar/20 18:07