Description
At SPARK-27463, some refactoring was made. There are two common base abstract classes were introduced:
1. BaseArrowPythonRunner
Before:
└── BasePythonRunner ├── ArrowPythonRunner ├── CoGroupedArrowPythonRunner ├── PythonRunner └── PythonUDFRunner
After:
BasePythonRunner ├── BaseArrowPythonRunner │ ├── ArrowPythonRunner │ └── CoGroupedArrowPythonRunner ├── PythonRunner └── PythonUDFRunner
The problem is that R code path is being matched with Python side:
└── BaseRRunner ├── ArrowRRunner └── RRunner
I would like to match the hierarchy and decouple other stuff for now. Ideally we should deduplicate both code paths. Internal implementation is also similar intentionally.
2. BasePandasGroupExec
Before:
├── FlatMapGroupsInPandasExec └── FlatMapCoGroupsInPandasExec
After:
└── BasePandasGroupExec ├── FlatMapGroupsInPandasExec └── FlatMapCoGroupsInPandasExec
Problem is that, R (with Arrow optimization, in particular) has some duplicated codes with Pandas UDFs.
FlatMapGroupsInRWithArrowExec <> FlatMapGroupsInPandasExec
MapPartitionsInRWithArrowExec <> ArrowEvalPythonExec
In order to prepare deduplication here as well, it might better avoid changing hierarchy alone in Python sides but just rather decouple it.
Attachments
Issue Links
- relates to
-
SPARK-27463 Support Dataframe Cogroup via Pandas UDFs
- Resolved
- links to