Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21187 Complete support for remaining Spark data types in Arrow Converters
  3. SPARK-25351

Handle Pandas category type when converting from Python with Arrow

    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.1
    • Fix Version/s: 3.1.0
    • Component/s: PySpark
    • Labels:

      Description

      There needs to be some handling of category types done when calling createDataFrame with Arrow or the return value of pandas_udf. Without Arrow, Spark casts each element to the category. For example

      In [1]: import pandas as pd
      
      In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]})
      
      In [3]: pdf["B"] = pdf["A"].astype('category')
      
      In [4]: pdf
      Out[4]: 
         A  B
      0  a  a
      1  b  b
      2  c  c
      3  a  a
      
      In [5]: pdf.dtypes
      Out[5]: 
      A      object
      B    category
      dtype: object
      
      In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False)
      
      In [8]: df = spark.createDataFrame(pdf)
      
      In [9]: df.show()
      +---+---+
      |  A|  B|
      +---+---+
      |  a|  a|
      |  b|  b|
      |  c|  c|
      |  a|  a|
      +---+---+
      
      
      In [10]: df.printSchema()
      root
       |-- A: string (nullable = true)
       |-- B: string (nullable = true)
      
      In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True)
      
      In [19]: df = spark.createDataFrame(pdf)   
      
         1667         spark_type = ArrayType(from_arrow_type(at.value_type))
         1668     else:
      -> 1669         raise TypeError("Unsupported type in conversion from Arrow: " + str(at))
         1670     return spark_type
         1671 
      
      TypeError: Unsupported type in conversion from Arrow: dictionary<values=string, indices=int8, ordered=0>
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                jalpan.randeri Jalpan Randeri
                Reporter:
                bryanc Bryan Cutler
              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: