Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23018

PySpark creatDataFrame causes Pandas warning of assignment to a copy of a reference

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.0
    • Fix Version/s: 2.3.0
    • Component/s: PySpark
    • Labels:
      None

      Description

      When calling SparkSession.createDataFrame with a Pandas DataFrame as input (with Arrow disabled) a Pandas warning is raised when the DataFrame is a slice:

      In [1]: import numpy as np
         ...: import pandas as pd
         ...: pdf = pd.DataFrame(np.random.rand(100, 2))
         ...: 
      
      In [2]: df = spark.createDataFrame(pdf[:10])
      /home/bryan/git/spark/python/pyspark/sql/session.py:476: SettingWithCopyWarning: 
      A value is trying to be set on a copy of a slice from a DataFrame.
      Try using .loc[row_indexer,col_indexer] = value instead
      
      See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
        pdf[column] = s
      

      This doesn't seem to cause a bug in this case, but might for others. It could be avoided by only assigning the series if it was a modified timestamp field.

        Attachments

          Activity

            People

            • Assignee:
              bryanc Bryan Cutler
              Reporter:
              bryanc Bryan Cutler
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: