Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-14104

Reading Lists of Timestamps from parquet files in Arrow 5.0.0 fails to preserve the TimeZone - unlike in Arrow 4.0.0

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Not A Problem
    • 5.0.0
    • None
    • C++, Parquet, Python
    • None

    Description

      In Arrow 4.0.0 it is possible to round-trip the TimeZone property of List<Timestamp> columns to and from parquet files: 

      >>> import pyarrow as pa
      >>> import pyarrow.parquet as pq
      >>> import datetime 
      
      >>> column = pa.array([[datetime.datetime(2023, 9, 23, 11)]], pa.list_(pa.timestamp('us', 'America/New_York')));
      
      >>> t = pa.Table.from_arrays([column], names=['TimestampColumn']);
      >>> pq.write_table(t, "example.parq");
      
      >>> t2 = pq.read_table("example.parq");
      >>> t2
      pyarrow.Table
      Dates: list<item: timestamp[us, tz=America/New_York]>
        child 0, item: timestamp[us, tz=America/New_York]
      

      However, if you read the same parquet file in pyarrow 5.0.0, the TimeZone is set to UTC:

      >>> t3 = pq.read_table("example.parq");
      >>> t3
      pyarrow.Table
      Dates: list<item: timestamp[us, tz=UTC]>
        child 0, item: timestamp[us, tz=UTC]
       

       

      I noticed that the TimeZone is preserved in Arrow 5.0 when reading non-nested timestamp columns. 

      Attachments

        1. exampleArrow4.parq
          0.8 kB
          Sarah Gilmore
        2. exampleArrow5.parq
          0.8 kB
          Sarah Gilmore

        Activity

          People

            Unassigned Unassigned
            sgilmore Sarah Gilmore
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: