Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Incomplete
-
1.5.1
-
None
-
Linux Debian, PySpark, in local testing.
-
Patch
Description
In PySpark's SQLContext, when it invokes createDataFrame() from a pandas.DataFrame and indicating a 'schema' with StructFields, the function _createFromLocal() converts the pandas.DataFrame but ignoring two points:
- Index column, because the flag index=False
- Timestamp's records, because a Date column can't be index and Pandas doesn't converts its records in Timestamp's type.
So, converting a DataFrame from Pandas to SQL is poor in scenarios with temporal records.
Doc: http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.to_records.html
Affected code:
def _createFromLocal(self, data, schema):
"""
Create an RDD for DataFrame from an list or pandas.DataFrame, returns
the RDD and schema.
"""
if has_pandas and isinstance(data, pandas.DataFrame):
if schema is None:
schema = [str(x) for x in data.columns]
data = [r.tolist() for r in data.to_records(index=False)] # HERE
- ...
Attachments
Issue Links
- relates to
-
SPARK-20791 Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame
- Resolved