The objective is to list down a set of tasks required to provide UDF support for Apache Arrow streaming execution engine. In the first iteration we will be focusing on providing support for Python-based UDFs which can support Python functions.
The UDF Integration is going to pan out with a series of sub-tasks associated with the development and PoCs. Note that this is going to be the first iteration of UDF integrations with a limited scope. This ticket will cover the following topics;
- POC for UDF integration: The objective is to evaluate the existing components in the source and evaluate the required modifications and new building blocks required to integrate UDFs.
- The language will be limited to C+Python users can register Python function as a UDF and use it with an `apply` method on Arrow Tables or provide a computation API endpoint via arrow::compute API. Note that the C+ API already provides a way to register custom functions via the function registry API. At the moment this is not exposed to Python.
- Planned features for this ticket are;
- Scalar UDFs : UDFs executed per value (per row)
- Vector UDFs : UDFs executed per batch (a full array or partial array)
- Aggregate UDFs : UDFs associated with an aggregation operation
- Integration limitations
- Doesn't support custom data types which doesn't support Numpy or Pandas
- Complex processing with parallelism within UDFs are not supported
- Parallel UDFs are not supported in the initial version of UDFs. Allthough we are documenting what is required and a rough sketch for the next phase.
- links to