Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.6.0, 2.0.0, 2.1.0, 2.2.0
Description
Right now there are a few ways we can create UDF:
- With standalone function:
def _add_one(x): """Adds one""" if x is not None: return x + 1 add_one = udf(_add_one, IntegerType())
This allows for full control flow, including exception handling, but duplicates variables.
- With `lambda` expression:
add_one = udf(lambda x: x + 1 if x is not None else None, IntegerType())
No variable duplication but only pure expressions.
- Using nested functions with immediate call:
def add_one(c): def add_one_(x): if x is not None: return x + 1 return udf(add_one_, IntegerType())(c)
Quite verbose but enables full control flow and clearly indicates expected number of arguments.
- Using `udf` functions as a decorator:
@udf def add_one(x): """Adds one""" if x is not None: return x + 1
Possible but only with default `returnType` (or curried `@partial(udf, returnType=IntegerType())`).
Proposed
Add `udf` decorator which can be used as follows:
from pyspark.sql.decorators import udf @udf(IntegerType()) def add_one(x): """Adds one""" if x is not None: return x + 1
or
@udf() def strip(x): """Strips String""" if x is not None: return x.strip()