Description
When Arrow optimization is enabled in Python 2.7,
import pandas pdf = pandas.DataFrame(["test1", "test2"]) df = spark.createDataFrame(pdf) df.show()
I got the following output:
+----------------+ | 0| +----------------+ |[74 65 73 74 31]| |[74 65 73 74 32]| +----------------+
This looks because Python's str and byte are same. it does look right:
>>> str == bytes
True
>>> isinstance("a", bytes)
True
1. Python 2 treats `str` as `bytes`.
2. PySpark added some special codes and hacks to recognizes `str` as string types.
3. PyArrow / Pandas followed Python 2 difference
We might have to match the behaviour to PySpark's but Python 2 is deprecated anyway. I think it's better to just note it.
Attachments
Issue Links
- links to