Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10337

[C++] More liberal parsing of ISO8601 timestamps with fractional seconds

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 3.0.0
    • C++

    Description

      The current ISO8601 timestamp parser assumes MILLI timestamps have 3 decimal places, MICRO have 6 and NANO have 9. From ParseTimestampISO8601 in cpp/src/arrow/util/value_parsing.h:

      {{ // We allow the following formats for all units:}}
      {{ // - "YYYY-MM-DD"}}
      {{ // - "YYYY-MM-DD[ T]hh"}}
      {{ // - "YYYY-MM-DD[ T]hhZ"}}
      {{ // - "YYYY-MM-DD[ T]hh:mm"}}
      {{ // - "YYYY-MM-DD[ T]hh:mmZ"}}
      {{ // - "YYYY-MM-DD[ T]hh:mm:ss"}}
      {{ // - "YYYY-MM-DD[ T]hh:mm:ssZ"}}
      {{ //}}
      {{ // We allow the following formats for unit==MILLI:}}
      {{ // - "YYYY-MM-DD[ T]hh:mm:ss.mmm"}}
      {{ // - "YYYY-MM-DD[ T]hh:mm:ss.mmmZ"}}
      {{ //}}
      {{ // We allow the following formats for unit==MICRO:}}
      {{ // - "YYYY-MM-DD[ T]hh:mm:ss.uuuuuu"}}
      {{ // - "YYYY-MM-DD[ T]hh:mm:ss.uuuuuuZ"}}
      {{ //}}
      {{ // We allow the following formats for unit==NANO:}}
      {{ // - "YYYY-MM-DD[ T]hh:mm:ss.nnnnnnnnn"}}
      {{ // - "YYYY-MM-DD[ T]hh:mm:ss.nnnnnnnnnZ"}}
      {{ //}}

      I propose that we change the parser to accept 1 to 3 digits for MILLI, 1 to 6 digits for MICRO, and 1 to 9 digits for NANO, as follows:

      {{ // We allow the following formats for all units:}}
      {{ // - "YYYY-MM-DD"}}
      {{ // - "YYYY-MM-DD[ T]hhZ?"}}
      {{ // - "YYYY-MM-DD[ T]hh:mmZ?"}}
      {{ // - "YYYY-MM-DD[ T]hh:mm:ssZ?"}}
      {{ //}}
      {{ // We allow the following formats for unit == MILLI, MICRO, or NANO:}}
      {{ // - "YYYY-MM-DD[ T]hh:mm:ss.s{1,3}Z?"}}
      {{ //}}
      {{ // We allow the following formats for unit == MICRO, or NANO:}}
      {{ // - "YYYY-MM-DD[ T]hh:mm:ss.s{4,6}Z?"}}
      {{ //}}
      {{ // We allow the following formats for unit == NANO:}}
      {{ // - "YYYY-MM-DD[ T]hh:mm:ss.s{7,9}Z?"}}

      This will allow for parsing of timestamps when e.g. a CSV file does not write timestamps with trailing zeroes.

      I am almost complete implementing this functionality, so a PR will be following soon.

       

       
        

      Attachments

        Issue Links

          Activity

            People

              frank.smith Frank Smith
              frank.smith Frank Smith
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h