Details
Description
Hive GenericUDF are superior to normal UDFs in the following ways:
- It can accept arguments of complex types, and return complex types.
- It can accept variable length of arguments.
- It can accept an infinite number of function signature - for example, it's easy to write a GenericUDF that accepts array<int>, array<array<int>> and so on (arbitrary levels of nesting).
- It can do short-circuit evaluations using DeferedObject. Arguments can in any types and it's allowed to do lazy-evaluation for them.
The masking functions added for Ranger column masking are some important examples of GenericUDF. For instance, there're hundreds of ways to use mask_show_first_n:
mask_show_first_n(val) mask_show_first_n(val, 8) mask_show_first_n(val, 8, 'X', 'x', 'n') mask_show_first_n(val, 8, 'x', 'x', 'x', 'x', -1) mask_show_first_n(val, 8, 'x', -1, 'x', 'x', '9') ...
We have to implement hundreds of overloads for all possible combinations.
Currently we don't support complex types in UDF arguments or return type, so we should at least provide a framework to support UDFs that:
- It can accept variable length of arguments.
- Arguments can in any types. Their actual values are extracted in the UDF (lazy-evaluation).
For 2, maybe just adding a field in impala_udf::AnyVal reflecting the actual types is enough.
Attachments
Issue Links
- relates to
-
IMPALA-11162 Provide support for Hive Generic UDFs
- Resolved
-
IMPALA-7877 Support Hive GenericUDF
- Resolved