Bug 55611 - Performance improvement (~7% up to ~27%) by adding a cache to DateUtil.isADateFormat(int, String)
Summary: Performance improvement (~7% up to ~27%) by adding a cache to DateUtil.isADat...
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: HSSF (show other bugs)
Version: 3.9-FINAL
Hardware: PC Linux
: P2 enhancement (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-09-30 09:11 UTC by Luca Della Toffola
Modified: 2013-10-19 15:32 UTC (History)
0 users



Attachments
Patch for poi-3.9 (2.92 KB, text/plain)
2013-09-30 09:11 UTC, Luca Della Toffola
Details
Patch for poi-3.9 (synchronized) (2.52 KB, patch)
2013-10-19 14:50 UTC, Luca Della Toffola
Details | Diff
Patch for poi-3.9 (synchronized + null-check) (2.54 KB, patch)
2013-10-19 15:32 UTC, Luca Della Toffola
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Luca Della Toffola 2013-09-30 09:11:33 UTC
Created attachment 30894 [details]
Patch for poi-3.9

We found an easy way to improve POI's performance. The idea is to avoid re-checking
in DateUtil.isADateFormat(int, String) if a given format string represents a date format if the same string is passed multiple times. 
This can be done safely by adding a single-static-entry cache and check if the parameters did change from the previous call, and in case invalidate the cache.
Our attached patch first checks that the format and format index are the same as in the previous call, otherwise execute the real check and store the required data.
For example, when running Poi 3.9 on a small document (~40 KB) and on a larger document (~13.5 MB), the patch reduces the running time
giving a speedup of ~7% in the first case and ~12% in the second case.
Additionally we executed an experiment using this patch with Tika 1.3.
We ran a test with a set of nine documents (~13.9MB) obtaining a 27% speedup.
Comment 1 Yegor Kozlov 2013-10-13 06:58:20 UTC
Thanks for the patch. The idea is cool but not thread safe. What if DateUtil.isADateFormat is called from two threads in parallel? I think you will have a race condition in this case. 

I would say we should declare this method as synchronized to be on the safe side. 
The same concern applies to https://issues.apache.org/bugzilla/show_bug.cgi?id=55612
Comment 2 Luca Della Toffola 2013-10-17 21:20:37 UTC
(In reply to Yegor Kozlov from comment #1)
> Thanks for the patch. The idea is cool but not thread safe. What if
> DateUtil.isADateFormat is called from two threads in parallel? I think you
> will have a race condition in this case. 
> 
> I would say we should declare this method as synchronized to be on the safe
> side. 
> The same concern applies to
> https://issues.apache.org/bugzilla/show_bug.cgi?id=55612

I changed the method DateUtil.isADateFormat as synchronized and re-done the tests.
We obtain similar results (in the range of -2%) with the same single-thread setups (I don't have any multi-threaded test-case to use).

More results for https://issues.apache.org/bugzilla/show_bug.cgi?id=55612 are
coming.
Comment 3 Yegor Kozlov 2013-10-19 14:12:53 UTC
Patch applied in r1533764

Yegor
Comment 4 Luca Della Toffola 2013-10-19 14:50:12 UTC
Created attachment 30946 [details]
Patch for poi-3.9 (synchronized)
Comment 5 Luca Della Toffola 2013-10-19 14:54:15 UTC
Hi Yegor:

Thanks for applying the patch!
I saw that in r1533764 from the method signature that isADateFormat is not synchronized. Is that on purpose? In case I submitted the patch that I used to do the tests in my previous message.

(In reply to Yegor Kozlov from comment #3)
> Patch applied in r1533764
> 
> Yegor
Comment 6 Luca Della Toffola 2013-10-19 15:26:25 UTC
Comment on attachment 30946 [details]
Patch for poi-3.9 (synchronized)

>diff -rupN -x '.*' poi-3.9/src/java/org/apache/poi/ss/usermodel/DateUtil.java poi-3.9-patch/src/java/org/apache/poi/ss/usermodel/DateUtil.java
>--- poi-3.9/src/java/org/apache/poi/ss/usermodel/DateUtil.java	2012-11-25 13:28:05.000000000 +0100
>+++ poi-3.9-patch/src/java/org/apache/poi/ss/usermodel/DateUtil.java	2013-10-17 17:32:51.000000000 +0200
>@@ -288,17 +288,31 @@ public class DateUtil {
>      * @param formatIndex The index of the format, eg from ExtendedFormatRecord.getFormatIndex
>      * @param formatString The format string, eg from FormatRecord.getFormatString
>      * @see #isInternalDateFormat(int)
>-     */
>-    public static boolean isADateFormat(int formatIndex, String formatString) {
>-        // First up, is this an internal date format?
>-        if(isInternalDateFormat(formatIndex)) {
>-            return true;
>-        }
>-
>-        // If we didn't get a real string, it can't be
>-        if(formatString == null || formatString.length() == 0) {
>-            return false;
>-        }
>+     */  
>+    private static int lastFormatIndex = Integer.MIN_VALUE;
>+    private static String lastFormatString = null;
>+    private static boolean isADateFormatCache = false;
>+    
>+    public static synchronized boolean isADateFormat(int formatIndex, String formatString) {
>+       
>+       if (formatIndex == lastFormatIndex && formatString != null && formatString.equals(lastFormatString)) {
>+          return isADateFormatCache;
>+       }
>+       
>+       lastFormatIndex = formatIndex;
>+       lastFormatString = formatString;
>+       
>+       // First up, is this an internal date format?
>+       if (isInternalDateFormat(formatIndex)) {
>+          isADateFormatCache = true;
>+          return true;
>+       }
>+
>+       // If we didn't get a real string, it can't be
>+       if(formatString == null || formatString.length() == 0) {
>+          isADateFormatCache = false;
>+          return false;
>+       }
> 
>         String fs = formatString;
>         /*if (false) {
>@@ -348,6 +362,7 @@ public class DateUtil {
> 
>         // short-circuit if it indicates elapsed time: [h], [m] or [s]
>         if(date_ptrn4.matcher(fs).matches()){
>+            isADateFormatCache = true;
>             return true;
>         }
> 
>@@ -367,7 +382,9 @@ public class DateUtil {
>         // Otherwise, check it's only made up, in any case, of:
>         //  y m d h s - \ / , . :
>         // optionally followed by AM/PM
>-        return date_ptrn3.matcher(fs).matches();
>+        boolean result = date_ptrn3.matcher(fs).matches();
>+        isADateFormatCache = result;
>+        return result;
>     }
> 
>     /**
Comment 7 Luca Della Toffola 2013-10-19 15:28:02 UTC
Comment on attachment 30946 [details]
Patch for poi-3.9 (synchronized)

>diff -rupN -x '.*' poi-3.9/src/java/org/apache/poi/ss/usermodel/DateUtil.java poi-3.9-patch/src/java/org/apache/poi/ss/usermodel/DateUtil.java
>--- poi-3.9/src/java/org/apache/poi/ss/usermodel/DateUtil.java	2012-11-25 13:28:05.000000000 +0100
>+++ poi-3.9-patch/src/java/org/apache/poi/ss/usermodel/DateUtil.java	2013-10-17 17:32:51.000000000 +0200
>@@ -288,17 +288,31 @@ public class DateUtil {
>      * @param formatIndex The index of the format, eg from ExtendedFormatRecord.getFormatIndex
>      * @param formatString The format string, eg from FormatRecord.getFormatString
>      * @see #isInternalDateFormat(int)
>-     */
>-    public static boolean isADateFormat(int formatIndex, String formatString) {
>-        // First up, is this an internal date format?
>-        if(isInternalDateFormat(formatIndex)) {
>-            return true;
>-        }
>-
>-        // If we didn't get a real string, it can't be
>-        if(formatString == null || formatString.length() == 0) {
>-            return false;
>-        }
>+     */  
>+    private static int lastFormatIndex = Integer.MIN_VALUE;
>+    private static String lastFormatString = null;
>+    private static boolean isADateFormatCache = false;
>+    
>+    public static synchronized boolean isADateFormat(int formatIndex, String formatString) {
>+       
>+       if (formatIndex == lastFormatIndex && formatString != null && formatString.equals(lastFormatString)) {
>+          return isADateFormatCache;
>+       }
>+       
>+       lastFormatIndex = formatIndex;
>+       lastFormatString = formatString;
>+       
>+       // First up, is this an internal date format?
>+       if (isInternalDateFormat(formatIndex)) {
>+          isADateFormatCache = true;
>+          return true;
>+       }
>+
>+       // If we didn't get a real string, it can't be
>+       if(formatString == null || formatString.length() == 0) {
>+          isADateFormatCache = false;
>+          return false;
>+       }
> 
>         String fs = formatString;
>         /*if (false) {
>@@ -348,6 +362,7 @@ public class DateUtil {
> 
>         // short-circuit if it indicates elapsed time: [h], [m] or [s]
>         if(date_ptrn4.matcher(fs).matches()){
>+            isADateFormatCache = true;
>             return true;
>         }
> 
>@@ -367,7 +382,9 @@ public class DateUtil {
>         // Otherwise, check it's only made up, in any case, of:
>         //  y m d h s - \ / , . :
>         // optionally followed by AM/PM
>-        return date_ptrn3.matcher(fs).matches();
>+        boolean result = date_ptrn3.matcher(fs).matches();
>+        isADateFormatCache = result;
>+        return result;
>     }
> 
>     /**
Comment 8 Luca Della Toffola 2013-10-19 15:32:29 UTC
Created attachment 30947 [details]
Patch for poi-3.9 (synchronized + null-check)

Sorry for the additional long comments (Clicked wrong button)