Pig
  1. Pig
  2. PIG-3341

Strict datetime parsing and improve performance of loading datetime values

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.11.1
    • Fix Version/s: 0.12.0, 0.11.2
    • Component/s: impl
    • Labels:
      None

      Description

      The performance of loading datetime values can be improved by about 25% by moving a single line in ToDate.java:

      public static DateTimeZone extractDateTimeZone(String dtStr) {
      Pattern pattern = Pattern.compile("(Z|(?<=(T[0-9\\.:]

      {0,12}))((\\+|-)
      d{2}(:?
      d{2})?))$");;

      should become:

      static Pattern pattern = Pattern.compile("(Z|(?<=(T[0-9\\.:]{0,12}

      ))((\\+|-)
      d

      {2}(:?
      d{2}

      )?))$");
      public static DateTimeZone extractDateTimeZone(String dtStr) {

      There is no need to recompile the regular expression for every value. I'm not sure if this function is ever called concurrently, but Pattern objects are thread-safe anyways.

      As a test, I created a file of 10M timestamps:

      for i in 0..10000000
      puts '2000-01-01T00:00:00+23'
      end

      I then ran this script:

      grunt> A = load 'data' as (a:datetime); B = filter A by a is null; dump B;

      Before the change it took 160s.
      After the change, the script took 120s.

      ----------------

      Another performance improvement can be made for invalid datetime values. If a datetime value is invalid, an exception is created and thrown, which is a costly way to fail a validity check. To test the performance impact, I created 10M invalid datetime values:

      for i in 0..10000000
      puts '2000-99-01T00:00:00+23'
      end

      In this test, the regex pattern was always recompiled. I then ran this script:

      grunt> A = load 'data' as (a:datetime); B = filter A by a is not null; dump B;

      The script took 190s.

      I understand this could be considered an edge case and might not be worth changing. However, if there are use cases where invalid dates are part of normal processing, then you might consider fixing this.

      1. PIG-3341-3-branch11.patch
        16 kB
        Rohini Palaniswamy
      2. PIG-3341-3.patch
        16 kB
        Rohini Palaniswamy
      3. PIG-3341-2.patch
        14 kB
        Rohini Palaniswamy
      4. PIG-3341-1.patch
        14 kB
        Rohini Palaniswamy

        Issue Links

          Activity

          Hide
          Cheolsoo Park added a comment -

          This is fixed by PIG-3339, isn't it?

          Show
          Cheolsoo Park added a comment - This is fixed by PIG-3339 , isn't it?
          Hide
          pat chan added a comment -

          Yes, it looks like 3339 addresses this issue.
          However, there is possibly an even better improvement.
          By rewriting:

          @Override
          public DateTime bytesToDateTime(byte[] b) throws IOException {
          if (b == null)

          { return null; }
          try {
          DateTimeZone dtz = ToDate.extractDateTimeZone(dtStr);
          if (dtz == null) { return new DateTime(dtStr); } else { return new DateTime(dtStr, dtz); }

          to

          private static final DateTimeFormatter isoDateTimeFormatter
          = ISODateTimeFormat.dateOptionalTimeParser().withOffsetParsed();

          @Override
          public DateTime bytesToDateTime(byte[] b) throws IOException {
          if (b == null) { return null; }

          try {
          String dtStr = new String(b);
          return isoDateTimeFormatter.parseDateTime(dtStr);

          you will see another 2X performance improvement. The static formatter is thread-safe.

          Running the test case with this change now take 60s.
          This is half the time of the already optimized Pattern fix.

          Moreover, it turns out that the regular expression has a bug.
          In particular, the time part can be more than 12 character long.
          And when tickled, will produced corrupted datetime values.
          The proposed change will fix this bug.

          Show
          pat chan added a comment - Yes, it looks like 3339 addresses this issue. However, there is possibly an even better improvement. By rewriting: @Override public DateTime bytesToDateTime(byte[] b) throws IOException { if (b == null) { return null; } try { DateTimeZone dtz = ToDate.extractDateTimeZone(dtStr); if (dtz == null) { return new DateTime(dtStr); } else { return new DateTime(dtStr, dtz); } to private static final DateTimeFormatter isoDateTimeFormatter = ISODateTimeFormat.dateOptionalTimeParser().withOffsetParsed(); @Override public DateTime bytesToDateTime(byte[] b) throws IOException { if (b == null) { return null; } try { String dtStr = new String(b); return isoDateTimeFormatter.parseDateTime(dtStr); you will see another 2X performance improvement. The static formatter is thread-safe. Running the test case with this change now take 60s. This is half the time of the already optimized Pattern fix. Moreover, it turns out that the regular expression has a bug. In particular, the time part can be more than 12 character long. And when tickled, will produced corrupted datetime values. The proposed change will fix this bug.
          Hide
          Cheolsoo Park added a comment -

          pat chan, thank you for the clarification. Do you mind submitting a patch for what you propose?

          Show
          Cheolsoo Park added a comment - pat chan , thank you for the clarification. Do you mind submitting a patch for what you propose?
          Hide
          Rohini Palaniswamy added a comment -

          pat chan,
          Using a known formatter instead of doing a regular expression on our own is a better thing. Already saw that using smaller z for timezone instead of Z does not work correctly. ToDate.extractDateTimeZone() is called in many places. It would be good to fix all the places to call a new method ToDate.extractDateTime() which returns DateTime directly. And I think we should put this patch in 0.11.2 as it not only fixes performance, but also fixes corrupted values.

          Added you as a Contributor to pig. If you will be putting up a patch, please assign the jira to yourself. Else I can work on this jira.

          Show
          Rohini Palaniswamy added a comment - pat chan , Using a known formatter instead of doing a regular expression on our own is a better thing. Already saw that using smaller z for timezone instead of Z does not work correctly. ToDate.extractDateTimeZone() is called in many places. It would be good to fix all the places to call a new method ToDate.extractDateTime() which returns DateTime directly. And I think we should put this patch in 0.11.2 as it not only fixes performance, but also fixes corrupted values. Added you as a Contributor to pig. If you will be putting up a patch, please assign the jira to yourself. Else I can work on this jira.
          Hide
          pat chan added a comment -

          Before making the fix, I think there needs to be a little more clarity around exactly what formats are supported. For example, pig 0.11.1 currently supports datetime strings with no date - "T00:00:00" produces a date in 1970. Is this intentional?

          The general issue is that the actual implementation supports many more formats than is specified in the iso8601 profile http://www.w3.org/TR/NOTE-datetime. What is the specified policy regarding these extra formats? I see three choices:

          1. specify precisely the entire range of supported formats. In the implementation I submitted above, the spec (taken from joda docs) is:

          date-opt-time = date-element ['T' [time-element] [offset]]
          date-element = std-date-element | ord-date-element | week-date-element
          std-date-element = yyyy ['-' MM ['-' dd]]
          ord-date-element = yyyy ['-' DDD]
          week-date-element = xxxx '-W' ww ['-' e]
          time-element = HH [minute-element] | [fraction]
          minute-element = ':' mm [second-element] | [fraction]
          second-element = ':' ss [fraction]
          fraction = ('.' | ',') digit+

          2. state that the implementation may parse formats beyond the w3c profile but such formats may not be supported in future releases.

          3. run all dates through a regex that matches exactly the w3c profile and dates that don't conform to the format are turned into null.

          Show
          pat chan added a comment - Before making the fix, I think there needs to be a little more clarity around exactly what formats are supported. For example, pig 0.11.1 currently supports datetime strings with no date - "T00:00:00" produces a date in 1970. Is this intentional? The general issue is that the actual implementation supports many more formats than is specified in the iso8601 profile http://www.w3.org/TR/NOTE-datetime . What is the specified policy regarding these extra formats? I see three choices: 1. specify precisely the entire range of supported formats. In the implementation I submitted above, the spec (taken from joda docs) is: date-opt-time = date-element ['T' [time-element] [offset] ] date-element = std-date-element | ord-date-element | week-date-element std-date-element = yyyy ['-' MM ['-' dd] ] ord-date-element = yyyy ['-' DDD] week-date-element = xxxx '-W' ww ['-' e] time-element = HH [minute-element] | [fraction] minute-element = ':' mm [second-element] | [fraction] second-element = ':' ss [fraction] fraction = ('.' | ',') digit+ 2. state that the implementation may parse formats beyond the w3c profile but such formats may not be supported in future releases. 3. run all dates through a regex that matches exactly the w3c profile and dates that don't conform to the format are turned into null.
          Hide
          pat chan added a comment -

          Correction on option 3.
          Implement a Joda date formatter using the spec in the w3c profile so only that format is supported.

          Show
          pat chan added a comment - Correction on option 3. Implement a Joda date formatter using the spec in the w3c profile so only that format is supported.
          Hide
          Rohini Palaniswamy added a comment -

          Before making the fix, I think there needs to be a little more clarity around exactly what formats are supported. For example, pig 0.11.1 currently supports datetime strings with no date - "T00:00:00" produces a date in 1970. Is this intentional?

          I don't think anyone is looking for such a behaviour. Not intuitive.

          I think we can go with option 1 (more is better) but also state which of those formats supported are not part of w3c profile. We also need to return null if it does not confirm to the format instead of throwing an error.

          Show
          Rohini Palaniswamy added a comment - Before making the fix, I think there needs to be a little more clarity around exactly what formats are supported. For example, pig 0.11.1 currently supports datetime strings with no date - "T00:00:00" produces a date in 1970. Is this intentional? I don't think anyone is looking for such a behaviour. Not intuitive. I think we can go with option 1 (more is better) but also state which of those formats supported are not part of w3c profile. We also need to return null if it does not confirm to the format instead of throwing an error.
          Hide
          pat chan added a comment -

          Hi, you bring up two good design points.

          1. are more formats the better for this use case? Some possible cons:

          a) the spec becomes more complicated for probably unused formats. The simplest spec would be to conform to the w3c profile.
          b) you will have to support all these formats forever
          c) there could be a performance overhead to support the possibly unused formats
          d) ToDate(s,f) and UDFs already give users the ability to handle any format that's needed.
          e) asymmetry: seems cleaner if the default parseable format is exactly the default printed format

          2. What is the design philosophy for invalid conversions? Quietly turning invalid values into null seems like it could be a possibly dangerous default since it would be really hard to know if your query on terabytes of data is encountering problems which are quietly being ignored. A safer philosophy would have the default be as strict with the data as possible and then if the user finds a legitimate case for null-conversions, provide a way for the user to enable it explicitly in the script.

          cheers

          Show
          pat chan added a comment - Hi, you bring up two good design points. 1. are more formats the better for this use case? Some possible cons: a) the spec becomes more complicated for probably unused formats. The simplest spec would be to conform to the w3c profile. b) you will have to support all these formats forever c) there could be a performance overhead to support the possibly unused formats d) ToDate(s,f) and UDFs already give users the ability to handle any format that's needed. e) asymmetry: seems cleaner if the default parseable format is exactly the default printed format 2. What is the design philosophy for invalid conversions? Quietly turning invalid values into null seems like it could be a possibly dangerous default since it would be really hard to know if your query on terabytes of data is encountering problems which are quietly being ignored. A safer philosophy would have the default be as strict with the data as possible and then if the user finds a legitimate case for null-conversions, provide a way for the user to enable it explicitly in the script. cheers
          Hide
          Rohini Palaniswamy added a comment -

          The current behavior returns null if there is a invalid value while loading as datetime. Pig as far as I have seen does not fail loading when there is invalid values. But UDFs do fail.

          Asking the old timers..

          Alan Gates/Daniel Dai/Dmitriy V. Ryaboy/Julien Le Dem/Thejas M Nair,
          How should we handle the invalid dates?

          Show
          Rohini Palaniswamy added a comment - The current behavior returns null if there is a invalid value while loading as datetime. Pig as far as I have seen does not fail loading when there is invalid values. But UDFs do fail. Asking the old timers.. Alan Gates / Daniel Dai / Dmitriy V. Ryaboy / Julien Le Dem / Thejas M Nair , How should we handle the invalid dates?
          Hide
          Dmitriy V. Ryaboy added a comment -

          I don't think we are completely consistent, but turning invalid into null has been pretty standard.

          My personal preference is also to increment a counter for # of such conversions, and to log the first N occurrences (when N errors are encountered, log something to the effect of "not logging this error any more because there's so much of it.")

          Show
          Dmitriy V. Ryaboy added a comment - I don't think we are completely consistent, but turning invalid into null has been pretty standard. My personal preference is also to increment a counter for # of such conversions, and to log the first N occurrences (when N errors are encountered, log something to the effect of "not logging this error any more because there's so much of it.")
          Hide
          pat chan added a comment -

          I was looking in the docs for any documentation on this topic. I found the following in http://wiki.apache.org/pig/UDFManual

          <quote>
          The first thing to decide is what to do with invalid data. This depends on the format of the data. If the data is of type bytearray it means that it has not yet been converted to its proper type. In this case, if the format of the data does not match the expected type, a null value should be returned. If, on the other hand, the input data is of another type, this means that the conversion has already happened and the data should be in the correct format. This is the case with our example and that's why it throws an error (line 16.) Note that WrappedIOException is a helper class to convert the actual exception to an IOException.

          Also, note that lines 10-11 check if the input data is null or empty and if so returns null.
          </quote>

          If I'm reading this correctly, it says that if the type of the input doesn't match the signature of the UDF, a null should be returned. However, I get this:

          grunt> A = load 'o' as (a:bytearray);
          grunt> B = foreach A generate ToDate(a); dump B;
          2013-06-03 17:15:09,253 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1046:
          <line 2, column 23> Multiple matching functions for org.apache.pig.builtin.ToDate with input schema: (

          {long}

          ,

          {chararray}

          ). Please use an explicit cast.

          It also seems to be saying that if the types are right and the format is invalid, an error should be thrown. I just checked and yes, I get an error. However, this doesn't match Rohini's proposal to return a null instead. Also, as Dmitriy hinted, it's not philosophically consistent with loading behavior where invalid things turn into nulls.

          2013-06-03 17:25:12,977 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
          2013-06-03 17:25:12,981 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias B

          BTW, the note about lines 10-11 isn't quite right. The code in the example doesn't have a check for null and so a null would cause an exception.

          Show
          pat chan added a comment - I was looking in the docs for any documentation on this topic. I found the following in http://wiki.apache.org/pig/UDFManual <quote> The first thing to decide is what to do with invalid data. This depends on the format of the data. If the data is of type bytearray it means that it has not yet been converted to its proper type. In this case, if the format of the data does not match the expected type, a null value should be returned. If, on the other hand, the input data is of another type, this means that the conversion has already happened and the data should be in the correct format. This is the case with our example and that's why it throws an error (line 16.) Note that WrappedIOException is a helper class to convert the actual exception to an IOException. Also, note that lines 10-11 check if the input data is null or empty and if so returns null. </quote> If I'm reading this correctly, it says that if the type of the input doesn't match the signature of the UDF, a null should be returned. However, I get this: grunt> A = load 'o' as (a:bytearray); grunt> B = foreach A generate ToDate(a); dump B; 2013-06-03 17:15:09,253 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1046: <line 2, column 23> Multiple matching functions for org.apache.pig.builtin.ToDate with input schema: ( {long} , {chararray} ). Please use an explicit cast. It also seems to be saying that if the types are right and the format is invalid, an error should be thrown. I just checked and yes, I get an error. However, this doesn't match Rohini's proposal to return a null instead. Also, as Dmitriy hinted, it's not philosophically consistent with loading behavior where invalid things turn into nulls. 2013-06-03 17:25:12,977 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2013-06-03 17:25:12,981 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias B BTW, the note about lines 10-11 isn't quite right. The code in the example doesn't have a check for null and so a null would cause an exception.
          Hide
          Rohini Palaniswamy added a comment -

          pat chan,
          I am fine with throwing error in UDF instead of returning null. Came across PIG-2620 and it is a better solution to define a proper behavior on what to do when a exception is thrown by a UDF instead of the UDF skipping invalid records by itself.

          Show
          Rohini Palaniswamy added a comment - pat chan , I am fine with throwing error in UDF instead of returning null. Came across PIG-2620 and it is a better solution to define a proper behavior on what to do when a exception is thrown by a UDF instead of the UDF skipping invalid records by itself.
          Hide
          Rohini Palaniswamy added a comment -

          Went with the ISODateTimeFormat only. It is a strict ISO Parser only. As for the additional types Pat mentioned

          ord-date-element = yyyy ['-' DDD]
          week-date-element = xxxx '-W' ww ['-' e]

          Saw that they were mentioned as part of ISO 8601 standard in http://en.wikipedia.org/wiki/ISO_8601#Week_dates and http://en.wikipedia.org/wiki/ISO_8601#Ordinal_dates. They are also mentioned in http://www.cl.cam.ac.uk/~mgk25/iso-time.html which http://www.w3.org/TR/NOTE-datetime refers to. http://www.w3.org/TR/NOTE-datetime only defines a profile of ISO 8601, consisting of a few date/time formats from ISO 8601, likely to satisfy most requirements. It is not the full set.

          So as ISODateTimeFormat is totally ISO 8601 compliant went with that instead of spending time to cut down the scope to http://www.w3.org/TR/NOTE-datetime.

          Also added some missing documentation for DateTime.

          Show
          Rohini Palaniswamy added a comment - Went with the ISODateTimeFormat only. It is a strict ISO Parser only. As for the additional types Pat mentioned ord-date-element = yyyy ['-' DDD] week-date-element = xxxx '-W' ww ['-' e] Saw that they were mentioned as part of ISO 8601 standard in http://en.wikipedia.org/wiki/ISO_8601#Week_dates and http://en.wikipedia.org/wiki/ISO_8601#Ordinal_dates . They are also mentioned in http://www.cl.cam.ac.uk/~mgk25/iso-time.html which http://www.w3.org/TR/NOTE-datetime refers to. http://www.w3.org/TR/NOTE-datetime only defines a profile of ISO 8601, consisting of a few date/time formats from ISO 8601, likely to satisfy most requirements. It is not the full set. So as ISODateTimeFormat is totally ISO 8601 compliant went with that instead of spending time to cut down the scope to http://www.w3.org/TR/NOTE-datetime . Also added some missing documentation for DateTime.
          Hide
          Rohini Palaniswamy added a comment -

          Cleaned up some whitespace in PIG-3341-2.patch.
          https://reviews.apache.org/r/11834/

          Show
          Rohini Palaniswamy added a comment - Cleaned up some whitespace in PIG-3341 -2.patch. https://reviews.apache.org/r/11834/
          Hide
          Rohini Palaniswamy added a comment -

          Patch with unused imports removed

          Show
          Rohini Palaniswamy added a comment - Patch with unused imports removed
          Hide
          Cheolsoo Park added a comment -

          +1.

          Show
          Cheolsoo Park added a comment - +1.
          Hide
          pat chan added a comment -

          Just to be clear, although the parser is "strict", meaning it won't accept non-iso8601 formats, it is not a complete parser. It actually only accepts a small fraction of the iso8601 standard.

          When I was trying to work with dates, I couldn't tell what formats were accepted. For example, the documentation suggests that yyyyMMdd would be accepted (since it was listed in the iso8601 reference) but I couldn't get it work. If possible, it would be great if the docs either said

          • supports all of the w3c profile and a few undocumented iso8601-compliant formats
          • or literally specify the accepted formats.

          Either would help clear up the documentation around formats quite a bit.

          Show
          pat chan added a comment - Just to be clear, although the parser is "strict", meaning it won't accept non-iso8601 formats, it is not a complete parser. It actually only accepts a small fraction of the iso8601 standard. When I was trying to work with dates, I couldn't tell what formats were accepted. For example, the documentation suggests that yyyyMMdd would be accepted (since it was listed in the iso8601 reference) but I couldn't get it work. If possible, it would be great if the docs either said supports all of the w3c profile and a few undocumented iso8601-compliant formats or literally specify the accepted formats. Either would help clear up the documentation around formats quite a bit.
          Hide
          Rohini Palaniswamy added a comment - - edited

          Committed to 0.11 and trunk (0.12). Thanks Cheolsoo and Pat Chan.

          pat chan,
          I committed before seeing your comment. Can update the doc in a separate jira.

          For example, the documentation suggests that yyyyMMdd would be accepted (since it was listed in the iso8601 reference) but I couldn't get it work.

          Can you give the link to the documentation. I did not come across any doc which suggested yyyyMMdd is supported.

          Show
          Rohini Palaniswamy added a comment - - edited Committed to 0.11 and trunk (0.12). Thanks Cheolsoo and Pat Chan. pat chan , I committed before seeing your comment. Can update the doc in a separate jira. For example, the documentation suggests that yyyyMMdd would be accepted (since it was listed in the iso8601 reference) but I couldn't get it work. Can you give the link to the documentation. I did not come across any doc which suggested yyyyMMdd is supported.
          Hide
          pat chan added a comment -

          http://en.wikipedia.org/wiki/ISO_8601#Calendar_dates shows

          > YYYY-MM-DD or YYYYMMDD

          However, I just went back to the docs and found that it doesn't explicitly link to this information (although there are links to the other parts of this page). I must have incorrectly assumed all formats on this page was supported. At this point, I know what's going on so the patch looks great.

          A small change that I'm sure would have made things clearer for me would be:

          • And for the information of ISO date and time formats, please refer to Date and Time Formats.
            + And for the information of supported ISO date and time formats, please refer to Date and Time Formats.

          Thanks for pushing this out.

          Show
          pat chan added a comment - http://en.wikipedia.org/wiki/ISO_8601#Calendar_dates shows > YYYY-MM-DD or YYYYMMDD However, I just went back to the docs and found that it doesn't explicitly link to this information (although there are links to the other parts of this page). I must have incorrectly assumed all formats on this page was supported. At this point, I know what's going on so the patch looks great. A small change that I'm sure would have made things clearer for me would be: And for the information of ISO date and time formats, please refer to Date and Time Formats. + And for the information of supported ISO date and time formats, please refer to Date and Time Formats. Thanks for pushing this out.

            People

            • Assignee:
              Rohini Palaniswamy
              Reporter:
              pat chan
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development