Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
Currently, if an R expression is not entirely supported by the arrow compute engine, the entire input will be pulled into memory for native R to operate on. It would be possible to instead provide add a custom compute function to the registry (inside R_init_arrow, probably) which evaluates any sub expressions which couldn't be translated to native arrow compute expressions.
This would for example allow a filter expression including a call to an R function baz to evaluate on a dataset larger than memory and with predicate and projection pushdown as normal using the expressions which are translatable. The resulting expression might look something like this in c++:
call("and_kleene", { call("greater", {field_ref("a"), scalar(1)}), call("r_expr", {field_ref("b")}, /*options=*/RexprOptions{cpp11::function(baz_sexp)}), });
In this case although the "r_expr" function is opaque to compute and datasets, we would still recognize that only fields "a" and "b" need to be materialized. Furthermore, the first member of the filter's conjunction is a > 1, which is translatable and could be used for predicate pushdown, for checking against parquet statistics, etc.
Since R is not multithreaded, the compute function would need to take a global lock to ensure only a single thread of R execution. This would also block the interpreter, so it's not a high-performance solution... but it would block the interpreter less than doing everything in pure native R (since at least some of the work could be offloaded to worker threads and we could take advantage of batched input). Still, it seems like a worthwhile option to consider
Attachments
Issue Links
- is related to
-
ARROW-14071 [R] Try to arrow_eval user-defined functions
- In Progress