Pig
  1. Pig
  2. PIG-3341

Strict datetime parsing and improve performance of loading datetime values

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.11.1
    • Fix Version/s: 0.12.0, 0.11.2
    • Component/s: impl
    • Labels:
      None

      Description

      The performance of loading datetime values can be improved by about 25% by moving a single line in ToDate.java:

      public static DateTimeZone extractDateTimeZone(String dtStr) {
      Pattern pattern = Pattern.compile("(Z|(?<=(T[0-9\\.:]

      {0,12}))((\\+|-)
      d{2}(:?
      d{2})?))$");;

      should become:

      static Pattern pattern = Pattern.compile("(Z|(?<=(T[0-9\\.:]{0,12}

      ))((\\+|-)
      d

      {2}(:?
      d{2}

      )?))$");
      public static DateTimeZone extractDateTimeZone(String dtStr) {

      There is no need to recompile the regular expression for every value. I'm not sure if this function is ever called concurrently, but Pattern objects are thread-safe anyways.

      As a test, I created a file of 10M timestamps:

      for i in 0..10000000
      puts '2000-01-01T00:00:00+23'
      end

      I then ran this script:

      grunt> A = load 'data' as (a:datetime); B = filter A by a is null; dump B;

      Before the change it took 160s.
      After the change, the script took 120s.

      ----------------

      Another performance improvement can be made for invalid datetime values. If a datetime value is invalid, an exception is created and thrown, which is a costly way to fail a validity check. To test the performance impact, I created 10M invalid datetime values:

      for i in 0..10000000
      puts '2000-99-01T00:00:00+23'
      end

      In this test, the regex pattern was always recompiled. I then ran this script:

      grunt> A = load 'data' as (a:datetime); B = filter A by a is not null; dump B;

      The script took 190s.

      I understand this could be considered an edge case and might not be worth changing. However, if there are use cases where invalid dates are part of normal processing, then you might consider fixing this.

      1. PIG-3341-1.patch
        14 kB
        Rohini Palaniswamy
      2. PIG-3341-2.patch
        14 kB
        Rohini Palaniswamy
      3. PIG-3341-3.patch
        16 kB
        Rohini Palaniswamy
      4. PIG-3341-3-branch11.patch
        16 kB
        Rohini Palaniswamy

        Issue Links

          Activity

          pat chan created issue -
          Rohini Palaniswamy made changes -
          Field Original Value New Value
          Fix Version/s 0.12 [ 12323380 ]
          Fix Version/s 0.11.2 [ 12324296 ]
          Rohini Palaniswamy made changes -
          Assignee Rohini Palaniswamy [ rohini ]
          Rohini Palaniswamy made changes -
          Attachment PIG-3341-1.patch [ 12587430 ]
          Rohini Palaniswamy made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Rohini Palaniswamy made changes -
          Summary Improving performance of loading datetime values Strict datetime parsing and improve performance of loading datetime values
          Rohini Palaniswamy made changes -
          Attachment PIG-3341-2.patch [ 12587434 ]
          Rohini Palaniswamy made changes -
          Attachment PIG-3341-3.patch [ 12587479 ]
          Rohini Palaniswamy made changes -
          Priority Minor [ 4 ] Major [ 3 ]
          Rohini Palaniswamy made changes -
          Attachment PIG-3341-3-branch11.patch [ 12587618 ]
          Rohini Palaniswamy made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Rohini Palaniswamy made changes -
          Comment [ Correction. I did not come across any doc which suggested yyyyMMdd is supported. ]
          Rohini Palaniswamy made changes -
          Link This issue is duplicated by PIG-3442 [ PIG-3442 ]
          Daniel Dai made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Rohini Palaniswamy
              Reporter:
              pat chan
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development