Details
Description
Add support to createDataFrame from a distributed collection of pandas.DataFrames by converting the RDD of pd.DFs to an RDD of arrow records batches, then directly creating the spark DataFrame from it.
Performance is significantly better (vectorized) than creating a spark DF by converting each df to a list of rows, similar to the improvement of SPARK-20791.
Initial example & benchmark for older spark versions: https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5
I'm currently working on a PR and will post it soon.
Extends the work done in:
https://issues.apache.org/jira/browse/SPARK-20791
https://issues.apache.org/jira/browse/SPARK-23030