Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-3703

[Python] DataFrame.to_parquet crashes if datetime column has time zones

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.11.1
    • Fix Version/s: 0.12.0
    • Component/s: Python
    • Environment:
      pandas 0.23.4
      pyarrow 0.11.1
      Python 2.7, 3.5 - 3.7
      MacOS High Sierra (10.13.6)

      Description

      On CPython 2.7.15, 3.5.6, 3.6.6, and 3.7.0, creating a Pandas DataFrame with a datetime.datetime object serializes to Parquet just fine, but crashes with an AttributeError if you try to use the built-in timezone objects.

      To reproduce, on Python 3:

      import datetime as dt
      import pandas as pd
      
      df = pd.DataFrame({'foo': [dt.datetime(2018, 1, 1, 1, 23, 45, tzinfo=dt.timezone.utc)]})
      df.to_parquet('data.parq')
      

       

      On Python 2, create a subclass of datetime.tzinfo as shown here and try the same thing.

       

      The following exception results:

      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/core/frame.py", line 1945, in to_parquet
          compression=compression, **kwargs)
        File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py", line 257, in to_parquet
          return impl.write(df, path, compression=compression, **kwargs)
        File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py", line 118, in write
          table = self.api.Table.from_pandas(df)
        File "pyarrow/table.pxi", line 1217, in pyarrow.lib.Table.from_pandas
        File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 381, in dataframe_to_arrays
          convert_types)]
        File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 380, in <listcomp>
          for c, t in zip(columns_to_convert,
        File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 370, in convert_column
          return pa.array(col, type=ty, from_pandas=True, safe=safe)
        File "pyarrow/array.pxi", line 167, in pyarrow.lib.array
        File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 409, in get_datetimetz_type
          type_ = pa.timestamp(unit, tz)
        File "pyarrow/types.pxi", line 1038, in pyarrow.lib.timestamp
        File "pyarrow/types.pxi", line 955, in pyarrow.lib.tzinfo_to_string
      AttributeError: 'datetime.timezone' object has no attribute 'zone'
      
      'datetime.timezone' object has no attribute 'zone'
      

       
      This doesn't happen if you use pytz.UTC as the timezone object.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                kszucs Krisztian Szucs
                Reporter:
                yiannisliodakis Diego Argueta
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1.5h
                  1.5h