Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27995

Note the difference between str of Python 2 and 3 at Arrow optimized

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 3.0.0
    • 3.0.0
    • PySpark
    • None

    Description

      When Arrow optimization is enabled in Python 2.7,

      import pandas
      pdf = pandas.DataFrame(["test1", "test2"])
      df = spark.createDataFrame(pdf)
      df.show()
      

      I got the following output:

      +----------------+
      |               0|
      +----------------+
      |[74 65 73 74 31]|
      |[74 65 73 74 32]|
      +----------------+
      

      This looks because Python's str and byte are same. it does look right:

      >>> str == bytes
      True
      >>> isinstance("a", bytes)
      True
      

      1. Python 2 treats `str` as `bytes`.
      2. PySpark added some special codes and hacks to recognizes `str` as string types.
      3. PyArrow / Pandas followed Python 2 difference

      We might have to match the behaviour to PySpark's but Python 2 is deprecated anyway. I think it's better to just note it.

      Attachments

        Issue Links

          Activity

            People

              gurwls223 Hyukjin Kwon
              gurwls223 Hyukjin Kwon
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: