Created attachment 30894 [details] Patch for poi-3.9 We found an easy way to improve POI's performance. The idea is to avoid re-checking in DateUtil.isADateFormat(int, String) if a given format string represents a date format if the same string is passed multiple times. This can be done safely by adding a single-static-entry cache and check if the parameters did change from the previous call, and in case invalidate the cache. Our attached patch first checks that the format and format index are the same as in the previous call, otherwise execute the real check and store the required data. For example, when running Poi 3.9 on a small document (~40 KB) and on a larger document (~13.5 MB), the patch reduces the running time giving a speedup of ~7% in the first case and ~12% in the second case. Additionally we executed an experiment using this patch with Tika 1.3. We ran a test with a set of nine documents (~13.9MB) obtaining a 27% speedup.
Thanks for the patch. The idea is cool but not thread safe. What if DateUtil.isADateFormat is called from two threads in parallel? I think you will have a race condition in this case. I would say we should declare this method as synchronized to be on the safe side. The same concern applies to https://issues.apache.org/bugzilla/show_bug.cgi?id=55612
(In reply to Yegor Kozlov from comment #1) > Thanks for the patch. The idea is cool but not thread safe. What if > DateUtil.isADateFormat is called from two threads in parallel? I think you > will have a race condition in this case. > > I would say we should declare this method as synchronized to be on the safe > side. > The same concern applies to > https://issues.apache.org/bugzilla/show_bug.cgi?id=55612 I changed the method DateUtil.isADateFormat as synchronized and re-done the tests. We obtain similar results (in the range of -2%) with the same single-thread setups (I don't have any multi-threaded test-case to use). More results for https://issues.apache.org/bugzilla/show_bug.cgi?id=55612 are coming.
Patch applied in r1533764 Yegor
Created attachment 30946 [details] Patch for poi-3.9 (synchronized)
Hi Yegor: Thanks for applying the patch! I saw that in r1533764 from the method signature that isADateFormat is not synchronized. Is that on purpose? In case I submitted the patch that I used to do the tests in my previous message. (In reply to Yegor Kozlov from comment #3) > Patch applied in r1533764 > > Yegor
Comment on attachment 30946 [details] Patch for poi-3.9 (synchronized) >diff -rupN -x '.*' poi-3.9/src/java/org/apache/poi/ss/usermodel/DateUtil.java poi-3.9-patch/src/java/org/apache/poi/ss/usermodel/DateUtil.java >--- poi-3.9/src/java/org/apache/poi/ss/usermodel/DateUtil.java 2012-11-25 13:28:05.000000000 +0100 >+++ poi-3.9-patch/src/java/org/apache/poi/ss/usermodel/DateUtil.java 2013-10-17 17:32:51.000000000 +0200 >@@ -288,17 +288,31 @@ public class DateUtil { > * @param formatIndex The index of the format, eg from ExtendedFormatRecord.getFormatIndex > * @param formatString The format string, eg from FormatRecord.getFormatString > * @see #isInternalDateFormat(int) >- */ >- public static boolean isADateFormat(int formatIndex, String formatString) { >- // First up, is this an internal date format? >- if(isInternalDateFormat(formatIndex)) { >- return true; >- } >- >- // If we didn't get a real string, it can't be >- if(formatString == null || formatString.length() == 0) { >- return false; >- } >+ */ >+ private static int lastFormatIndex = Integer.MIN_VALUE; >+ private static String lastFormatString = null; >+ private static boolean isADateFormatCache = false; >+ >+ public static synchronized boolean isADateFormat(int formatIndex, String formatString) { >+ >+ if (formatIndex == lastFormatIndex && formatString != null && formatString.equals(lastFormatString)) { >+ return isADateFormatCache; >+ } >+ >+ lastFormatIndex = formatIndex; >+ lastFormatString = formatString; >+ >+ // First up, is this an internal date format? >+ if (isInternalDateFormat(formatIndex)) { >+ isADateFormatCache = true; >+ return true; >+ } >+ >+ // If we didn't get a real string, it can't be >+ if(formatString == null || formatString.length() == 0) { >+ isADateFormatCache = false; >+ return false; >+ } > > String fs = formatString; > /*if (false) { >@@ -348,6 +362,7 @@ public class DateUtil { > > // short-circuit if it indicates elapsed time: [h], [m] or [s] > if(date_ptrn4.matcher(fs).matches()){ >+ isADateFormatCache = true; > return true; > } > >@@ -367,7 +382,9 @@ public class DateUtil { > // Otherwise, check it's only made up, in any case, of: > // y m d h s - \ / , . : > // optionally followed by AM/PM >- return date_ptrn3.matcher(fs).matches(); >+ boolean result = date_ptrn3.matcher(fs).matches(); >+ isADateFormatCache = result; >+ return result; > } > > /**
Created attachment 30947 [details] Patch for poi-3.9 (synchronized + null-check) Sorry for the additional long comments (Clicked wrong button)