Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23290

inadvertent change in handling of DateType when converting to pandas dataframe

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.3.0
    • Fix Version/s: 2.3.0
    • Component/s: PySpark
    • Labels:
      None
    • Target Version/s:

      Description

      In this PR there was a change in how `DateType` is being returned to users (line 1968 in dataframe.py). This can cause client code to fail, as in the following example from a python terminal:

      >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
      >>> pdf.dtypes
      date    object
      num      int64
      dtype: object
      >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
      0    2015-01-01
      Name: date, dtype: object
      >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
      >>> pdf.dtypes
      date    object
      num      int64
      dtype: object
      >>> pdf['date'] = pd.to_datetime(pdf['date'])
      >>> pdf.dtypes
      date    datetime64[ns]
      num              int64
      dtype: object
      >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/Users/amenck/anaconda2/lib/python2.7/site-packages/pandas/core/series.py", line 2355, in apply
          mapped = lib.map_infer(values, f, convert=convert_dtype)
        File "pandas/_libs/src/inference.pyx", line 1574, in pandas._libs.lib.map_infer
        File "<stdin>", line 1, in <lambda>
      TypeError: strptime() argument 1 must be string, not Timestamp
      >>> 
      

      Above we show both the old behavior (returning an "object" col) and the new behavior (returning a datetime column). Since there may be user code relying on the old behavior, I'd suggest reverting this specific part of this change. Also note that the NOTE on the docstring for the "_to_corrected_pandas_type" seems to be off, referring to the old behavior and not the current one.

        Attachments

          Activity

            People

            • Assignee:
              ueshin Takuya Ueshin
              Reporter:
              amenck Andre Menck
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: