Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-7758

[Python] Wrong conversion of timestamps that are out of bounds for pandas (eg 0000-01-01)

    XMLWordPrintableJSON

Details

    Description

      Using pandas.read_parquet() with pyarrow as the engine produces ValueError when the parquet file contains a date column with the value 0000-01-01.

      PySpark can read the same parquet with no issues and PyArrow up to version 0.11.1 could read it as well. 

       

      // code placeholder
      
      ---------------------------------------------------------------------------
      ValueError                                Traceback (most recent call last)
      <ipython-input-7-06e3cce13e18> in <module>
      ----> 1 df_init_df = read_parquet_files('{}/DebtFacility'.format(ext_path))
      
      <ipython-input-4-f12125c1c8fe> in read_parquet_files(folder_path)
            2     files = [f for f in os.listdir(folder_path) if f.endswith('parquet')]
            3 
      ----> 4     df_list = [pd.read_parquet(os.path.join(folder_path, f)) for f in files]
            5 
            6     print(files)
      
      <ipython-input-4-f12125c1c8fe> in <listcomp>(.0)
            2     files = [f for f in os.listdir(folder_path) if f.endswith('parquet')]
            3 
      ----> 4     df_list = [pd.read_parquet(os.path.join(folder_path, f)) for f in files]
            5 
            6     print(files)
      
      /opt/conda/lib/python3.6/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, **kwargs)
          294 
          295     impl = get_engine(engine)
      --> 296     return impl.read(path, columns=columns, **kwargs)
      
      /opt/conda/lib/python3.6/site-packages/pandas/io/parquet.py in read(self, path, columns, **kwargs)
          123         kwargs["use_pandas_metadata"] = True
          124         result = self.api.parquet.read_table(
      --> 125             path, columns=columns, **kwargs
          126         ).to_pandas()
          127         if should_close:
      
      /opt/conda/lib/python3.6/site-packages/pyarrow/array.pxi in pyarrow.lib._PandasConvertible.to_pandas()
      
      /opt/conda/lib/python3.6/site-packages/pyarrow/table.pxi in pyarrow.lib.Table._to_pandas()
      
      /opt/conda/lib/python3.6/site-packages/pyarrow/pandas_compat.py in table_to_blockmanager(options, table, categories, ignore_metadata)
          702 
          703     _check_data_column_metadata_consistency(all_columns)
      --> 704     blocks = _table_to_blocks(options, table, categories)
          705     columns = _deserialize_column_index(table, all_columns, column_indexes)
          706 
      
      /opt/conda/lib/python3.6/site-packages/pyarrow/pandas_compat.py in _table_to_blocks(options, block_table, categories)
          974 
          975     # Convert an arrow table to Block from the internal pandas API
      --> 976     result = pa.lib.table_to_blocks(options, block_table, categories)
          977 
          978     # Defined above
      
      /opt/conda/lib/python3.6/site-packages/pyarrow/table.pxi in pyarrow.lib.table_to_blocks()
      
      ValueError: year -1 is out of range
      
      

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              vagos7 Evangelos Pertsinis
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 50m
                  1h 50m