Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0
Description
It is possible to create a Row that has fields with the same name when calling `collect()` after a join. Given that the Row constructor itself doesn't allow this, this seems to be undesired behavior.
This can possibly cause correctness issues because different ways of getting values produce different results: _get_item_ will return the leftmost value, while asDict() will return the rightmost value (because the former uses an index search and the latter uses a dictionary generator).
>>> manual_output_row = Row(a=1, b=1, b=2)
{{ File "<stdin>", line 1}}
SyntaxError: keyword argument repeated
>>> input_rows = Row(a=1, b=1), Row(a=1, b=2)
>>> df1, df2 = (spark.createDataFrame([r]) for r in input_rows)
>>> df3 = df1.join(df2, "a")
>>> output_row = df3.collect()[0]
>>> output_row
Row(a=1, b=1, b=2)
>>> output_row["b"]
1
>>> output_row.asDict()["b"]
2
*SPARK 1.6.3*
>>> from pyspark.sql.types import Row >>> input_rows = Row(a=1, b=1), Row(a=1, b=2) >>> df1, df2 = (sqlContext.createDataFrame([r]) for r in input_rows) >>> df3 = df1.join(df2, "a") >>> output_row = df3.collect()[0] >>> output_row Row(a=1, b=1, b=2) >>> output_row["b"] 1 >>> output_row.asDict()["b"] 2 >>> sc.version u'1.6.3'
Attachments
Issue Links
- links to