Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
4.0.0
Description
There is a bug that results in an internal error with some combination of the Python UDTF "select" and "partitionBy" options of the "analyze" method.
To reproduce:
from pyspark.sql.functions import ( AnalyzeArgument, AnalyzeResult, PartitioningColumn, SelectedColumn, udtf ) from pyspark.sql.types import ( DoubleType, StringType, StructType, ) @udtf class TestTvf: @staticmethod def analyze(observed: AnalyzeArgument) -> AnalyzeResult: out_schema = StructType() out_schema.add("partition_col", StringType()) out_schema.add("double_col", DoubleType()) return AnalyzeResult( schema=out_schema, partitionBy=[PartitioningColumn("partition_col")], select=[ SelectedColumn("partition_col"), SelectedColumn("double_col"), ], ) def eval(self, *args, **kwargs): pass def terminate(self): for _ in range(10): yield { "partition_col": None, "double_col": 1.0, } spark.udtf.register("serialize_test", TestTvf) # Fails ( spark .sql( """ SELECT * FROM serialize_test( TABLE( SELECT 5 AS unused_col, 'hi' AS partition_col, 1.0 AS double_col UNION ALL SELECT 4 AS unused_col, 'hi' AS partition_col, 1.0 AS double_col ) ) """ ) .toPandas() )
Attachments
Issue Links
- links to