Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-15883

[C++] Support for fractional seconds in strptime()

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • C++

    Description

      Currently, we can't parse "our own" string representation of a timestamp array with the timestamp parser strptime:

      import datetime
      import pyarrow as pa
      import pyarrow.compute as pc
      
      >>> pa.array([datetime.datetime(2022, 3, 5, 9)])
      <pyarrow.lib.TimestampArray object at 0x7f00c1d53dc0>
      [
        2022-03-05 09:00:00.000000
      ]
      
      # trying to parse the above representation as string
      >>> pc.strptime(["2022-03-05 09:00:00.000000"], format="%Y-%m-%d %H:%M:%S", unit="us")
      ...
      ArrowInvalid: Failed to parse string: '2022-03-05 09:00:00.000000' as a scalar of type timestamp[us]
      

      The reason for this is the fractional second part, so the following works:

      >>> pc.strptime(["2022-03-05 09:00:00"], format="%Y-%m-%d %H:%M:%S", unit="us")
      <pyarrow.lib.TimestampArray object at 0x7f00c1d6f940>
      [
        2022-03-05 09:00:00.000000
      ]
      

      Now, I think the reason that this fails is because strptime only supports parsing seconds as an integer (https://man7.org/linux/man-pages/man3/strptime.3.html).

      But, it creates a strange situation where the timestamp parser cannot parse the representation we use for timestamps.

      In addition, for CSV we have a custom ISO parser (used by default), so when parsing the strings while reading a CSV file, the same string with fractional seconds does work:

      s = b"""a
      2022-03-05 09:00:00.000000"""
      
      from pyarrow import csv
      
      >>> csv.read_csv(io.BytesIO(s))
      pyarrow.Table
      a: timestamp[ns]
      ----
      a: [[2022-03-05 09:00:00.000000000]]
      

      I realize that you can use the generic "cast" for doing this string parsing:

      >>> pc.cast(["2022-03-05 09:00:00.000000"], pa.timestamp("us"))
      <pyarrow.lib.TimestampArray object at 0x7f00c1d53d60>
      [
        2022-03-05 09:00:00.000000
      ]
      

      But this was not the first way I thought about (I think it is quite typical to first think of strptime, and it is confusing that that doesn't work; the error message is also not helpful)
      cc apitrou rokm

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jorisvandenbossche Joris Van den Bossche
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: