Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8100

[Python] timestamp[ms] and date64 data types not working as expected on write

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • 0.15.1, 0.16.0
    • None
    • Python
    • None

    Description

      I expect that either timestamp[ms] or date64 will give me a millisecond presicion datetime/timestamp as written to a parquet file, instead this is the behavior I see:

       

      >>> arr = pa.array([datetime(2020, 12, 20)])

      (have used pa.array([datetime(2020, 12, 20), type=pa.timestamp('ms')]) with no later casting as well)

      >>> arr.cast(pa.timestamp('ms'), safe=False)

      <pyarrow.lib.TimestampArray object at 0x117f3d4c8>
      [
      2020-12-20 00:00:00.000
      ]

       

      >>> table = pa.Table.from_arrays([arr],

                                names=["start_date"])

      >>> table
      pyarrow.Table
      start_date: timestamp[us]

       

      // just to make sure

       

      >>> table.column("start_date").cast(pa.timestamp('ms'), safe=False)
      <pyarrow.lib.ChunkedArray object at 0x117f5e9a8>
      [
      [
      2020-12-20 00:00:00.000
      ]
      ]

       

      // just to make extra sure

       

      >>> schema = pa.schema([pa.field("start_date", pa.timestamp("ms"))])

      >>> table.cast(schema, safe=False)parquet.write_table(table,

                                                                                                    "sldkfjasldkfj.parquet",  

                                                                                                   coerce_timestamps="ms", 

                                                                                                    compression="SNAPPY", 

                                                allow_truncated_timestamps=True)

      Result for the written file:

      Schema:

      {
      "type" : "record",
      "name" : "schema",
      "fields" : [ {
      "name" : "start_date",
      "type" : [ "null",

      { "type" : "long", "logicalType" : "timestamp-millis" }

      ],
      "default" : null
      } ]
      }

      Data:

      start_date  
      1608422400000  

       

      that is a microsecond [us] value, despite casting to [ms] and setting the appropriate config on the write_table method. If it was a millisecond timestamp it would be accurate to translate back to a datetime with fromtimestamp, but:
      >>> from datetime import datetime
      >>>
      >>>
      >>>
      >>>
      >>> datetime.fromtimestamp(1608422400000)
      Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      ValueError: year 52938 is out of range
      >>> datetime.fromtimestamp(1608422400000 /1000)
      datetime.datetime(2020, 12, 19, 16, 0)
       

       

      Ok, so then we should use date64() type, after all the docs say Create instance of 64-bit date (milliseconds since UNIX epoch 1970-01-01)

       
      >>> arr = pa.array([datetime(2020, 12, 20, 0, 0, 0, 123)], type=pa.date64())
      >>> arr
      <pyarrow.lib.Date64Array object at 0x11da877c8>
      [
      2020-12-20
      ]

      >>> table = pa.Table.from_arrays([arr], names=["start_date"])
      >>> table
      pyarrow.Table

      start_date: date64[ms]

      parquet.write_table(table,

                                       "bebedabeep.parquet",

                                        coerce_timestamps="ms",

                                        compression="SNAPPY",

                                        allow_truncated_timestamps=True)

                                               
       

      Result for the written file:

      Schema:

      {
      "type" : "record",
      "name" : "schema",
      "fields" : [ {
      "name" : "start_date",
      "type" : [ "null",

      { "type" : "int", "logicalType" : "date" }

      ],
      "default" : null
      } ]
      }

      Data:

       

      start_date  
      18616  

       
      That is "days since UNIX epoch 1970-01-01" just like date32() type, the time info is stripped off, we can confirm this:
      >>> arr.to_pylist()
      [datetime.date(2020, 12, 20)]
       

      How do I write a millisecond precision timestamp with pyarrow.parquet?

      Attachments

        Activity

          People

            Unassigned Unassigned
            phess paul hess
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: