[SPARK-27995] Note the difference between str of Python 2 and 3 at Arrow optimized - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 3.0.0
Fix Version/s: 3.0.0
Component/s: PySpark
Labels:
None

Description

When Arrow optimization is enabled in Python 2.7,

import pandas
pdf = pandas.DataFrame(["test1", "test2"])
df = spark.createDataFrame(pdf)
df.show()

I got the following output:

+----------------+
|               0|
+----------------+
|[74 65 73 74 31]|
|[74 65 73 74 32]|
+----------------+

This looks because Python's str and byte are same. it does look right:

>>> str == bytes
True
>>> isinstance("a", bytes)
True

1. Python 2 treats `str` as `bytes`.
2. PySpark added some special codes and hacks to recognizes `str` as string types.
3. PyArrow / Pandas followed Python 2 difference

We might have to match the behaviour to PySpark's but Python 2 is deprecated anyway. I think it's better to just note it.

Attachments

Issue Links

links to

GitHub Pull Request #24838

Activity

People

Assignee:: Hyukjin Kwon

Reporter:: Hyukjin Kwon

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 11/Jun/19 03:05

Updated:: 12/Dec/22 18:10

Resolved:: 11/Jun/19 09:44