Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2642

MoreIndexingFilter parses ISO 8601 UTC dates in local time zone

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.3.1, 1.14, 1.15
    • Fix Version/s: 2.4, 1.16
    • Component/s: indexer, plugin
    • Labels:
      None

      Description

      The ISO 8601 pattern in MoreIndexingFilter.getTime is "yyyy-MM-dd'T'HH:mm:ss'Z'". Note the literal Z.

      https://github.com/apache/nutch/blob/b834b81/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java#L142

      Apache commons-lang's DateUtils uses the local time zone by default when parsing, and can't tell that a string matching this pattern is specifying an offset because the pattern doesn't have an offset, just a literal "Z":

      https://github.com/apache/commons-lang/blob/b610707/src/main/java/org/apache/commons/lang3/time/DateUtils.java#L370

      So, when parsing a date string such as "2018-09-04T12:34:56Z", the time is returned as a local time:

      DateUtils.parseDate("2018-09-04T12:34:56Z", new String[] { "yyyy-MM-dd'T'HH:mm:ss'Z'" })
      => Tue Sep 04 12:34:56 PDT 2018 (1536089696000)

      I think a reasonable fix would be to specify an offset pattern instead of a literal "Z": "yyyy-MM-dd'T'HH:mm:ssXXX". That would also allow arbitrary offsets, as well as "Z":

      DateUtils.parseDate("2018-09-04T12:34:56Z", new String[] { "yyyy-MM-dd'T'HH:mm:ssXXX" })
      => Tue Sep 04 05:34:56 PDT 2018 (1536064496000)

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              johnl John Lacey
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: