Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23009

PySpark should not assume Pandas cols are a basestring type

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.0
    • Fix Version/s: 2.3.0
    • Component/s: PySpark
    • Labels:
      None

      Description

      When calling SparkSession.createDataFrame using a Pandas DataFrame as input, Spark assumes that the columns will either be a str type or unicode type. They can actually be any type that a dict can key off of. If they are not a basestr type, then a confusing AttributeError is thrown:

      In [16]: pdf = pd.DataFrame(np.random.rand(4, 2))
      
      In [17]: pdf
      Out[17]: 
                0         1
      0  0.145171  0.482940
      1  0.151336  0.299861
      2  0.220338  0.830133
      3  0.001659  0.513787
      
      In [18]: pdf.columns
      Out[18]: RangeIndex(start=0, stop=2, step=1)
      
      In [19]: df = spark.createDataFrame(pdf)
      ---------------------------------------------------------------------------
      AttributeError                            Traceback (most recent call last)
      <ipython-input-18-11bcb07e0e39> in <module>()
      ----> 1 df = spark.createDataFrame(pdf)
      
      /home/bryan/git/spark/python/pyspark/sql/session.pyc in createDataFrame(self, data, schema, samplingRatio, verifySchema)
          646             # If no schema supplied by user then get the names of columns only
          647             if schema is None:
      --> 648                 schema = [x.encode('utf-8') if not isinstance(x, str) else x for x in data.columns]
          649 
          650             if self.conf.get("spark.sql.execution.arrow.enabled", "false").lower() == "true" \
      
      AttributeError: 'int' object has no attribute 'encode'
      

        Attachments

          Activity

            People

            • Assignee:
              bryanc Bryan Cutler
              Reporter:
              bryanc Bryan Cutler
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: