[PIG-3341] Strict datetime parsing and improve performance of loading datetime values - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.11.1
Fix Version/s: 0.12.0, 0.11.2
Component/s: impl
Labels:
None

Description

The performance of loading datetime values can be improved by about 25% by moving a single line in ToDate.java:

public static DateTimeZone extractDateTimeZone(String dtStr) {
Pattern pattern = Pattern.compile("(Z|(?<=(T[0-9\\.:]

{0,12}))((\\+|-)
d{2}(:?
d{2})?))$");;

should become:

static Pattern pattern = Pattern.compile("(Z|(?<=(T[0-9\\.:]{0,12}

))((\\+|-)
d

{2}(:?
d{2}

)?))$");
public static DateTimeZone extractDateTimeZone(String dtStr) {

There is no need to recompile the regular expression for every value. I'm not sure if this function is ever called concurrently, but Pattern objects are thread-safe anyways.

As a test, I created a file of 10M timestamps:

for i in 0..10000000
puts '2000-01-01T00:00:00+23'
end

I then ran this script:

grunt> A = load 'data' as (a:datetime); B = filter A by a is null; dump B;

Before the change it took 160s.
After the change, the script took 120s.

----------------

Another performance improvement can be made for invalid datetime values. If a datetime value is invalid, an exception is created and thrown, which is a costly way to fail a validity check. To test the performance impact, I created 10M invalid datetime values:

for i in 0..10000000
puts '2000-99-01T00:00:00+23'
end

In this test, the regex pattern was always recompiled. I then ran this script:

grunt> A = load 'data' as (a:datetime); B = filter A by a is not null; dump B;

The script took 190s.

I understand this could be considered an edge case and might not be worth changing. However, if there are use cases where invalid dates are part of normal processing, then you might consider fixing this.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PIG-3341-1.patch
12/Jun/13 13:04
14 kB
Rohini Palaniswamy
PIG-3341-2.patch
12/Jun/13 13:22
14 kB
Rohini Palaniswamy
PIG-3341-3.patch
12/Jun/13 18:42
16 kB
Rohini Palaniswamy
PIG-3341-3-branch11.patch
13/Jun/13 13:05
16 kB
Rohini Palaniswamy

Issue Links

is duplicated by

PIG-3442 ToDate() null pointer exception when date is NULL

Resolved

Activity

People

Assignee:: Rohini Palaniswamy

Reporter:: pat chan

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 31/May/13 06:11

Updated:: 01/Oct/19 22:10

Resolved:: 13/Jun/13 13:07