Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10337

[C++] More liberal parsing of ISO8601 timestamps with fractional seconds

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: C++

      Description

      The current ISO8601 timestamp parser assumes MILLI timestamps have 3 decimal places, MICRO have 6 and NANO have 9. From ParseTimestampISO8601 in cpp/src/arrow/util/value_parsing.h:

      {{ // We allow the following formats for all units:}}
      {{ // - "YYYY-MM-DD"}}
      {{ // - "YYYY-MM-DD[ T]hh"}}
      {{ // - "YYYY-MM-DD[ T]hhZ"}}
      {{ // - "YYYY-MM-DD[ T]hh:mm"}}
      {{ // - "YYYY-MM-DD[ T]hh:mmZ"}}
      {{ // - "YYYY-MM-DD[ T]hh:mm:ss"}}
      {{ // - "YYYY-MM-DD[ T]hh:mm:ssZ"}}
      {{ //}}
      {{ // We allow the following formats for unit==MILLI:}}
      {{ // - "YYYY-MM-DD[ T]hh:mm:ss.mmm"}}
      {{ // - "YYYY-MM-DD[ T]hh:mm:ss.mmmZ"}}
      {{ //}}
      {{ // We allow the following formats for unit==MICRO:}}
      {{ // - "YYYY-MM-DD[ T]hh:mm:ss.uuuuuu"}}
      {{ // - "YYYY-MM-DD[ T]hh:mm:ss.uuuuuuZ"}}
      {{ //}}
      {{ // We allow the following formats for unit==NANO:}}
      {{ // - "YYYY-MM-DD[ T]hh:mm:ss.nnnnnnnnn"}}
      {{ // - "YYYY-MM-DD[ T]hh:mm:ss.nnnnnnnnnZ"}}
      {{ //}}

      I propose that we change the parser to accept 1 to 3 digits for MILLI, 1 to 6 digits for MICRO, and 1 to 9 digits for NANO, as follows:

      {{ // We allow the following formats for all units:}}
      {{ // - "YYYY-MM-DD"}}
      {{ // - "YYYY-MM-DD[ T]hhZ?"}}
      {{ // - "YYYY-MM-DD[ T]hh:mmZ?"}}
      {{ // - "YYYY-MM-DD[ T]hh:mm:ssZ?"}}
      {{ //}}
      {{ // We allow the following formats for unit == MILLI, MICRO, or NANO:}}
      {{ // - "YYYY-MM-DD[ T]hh:mm:ss.s{1,3}Z?"}}
      {{ //}}
      {{ // We allow the following formats for unit == MICRO, or NANO:}}
      {{ // - "YYYY-MM-DD[ T]hh:mm:ss.s{4,6}Z?"}}
      {{ //}}
      {{ // We allow the following formats for unit == NANO:}}
      {{ // - "YYYY-MM-DD[ T]hh:mm:ss.s{7,9}Z?"}}

      This will allow for parsing of timestamps when e.g. a CSV file does not write timestamps with trailing zeroes.

      I am almost complete implementing this functionality, so a PR will be following soon.

       

       
        

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                frank.smith Frank Smith
                Reporter:
                frank.smith Frank Smith
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h