Lucene - Core
  1. Lucene - Core
  2. LUCENE-6723

Date field problems using ExtractingRequestHandler and java 9 (b71)

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.4, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Tracking bug to note that the (Tika based) ExtractingRequestHandler will not work properly with jdk9 starting with build71.

      This first manifested itself with failures like this from the tests...

         [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=ExtractingRequestHandlerTest
      -Dtests.method=testArabicPDF -Dtests.seed=232D0A5404C2ADED -Dtests.multiplier=3 -Dtests.slow=true
      -Dtests.locale=en_JM -Dtests.timezone=Etc/GMT-7 -Dtests.asserts=true -Dtests.file.encoding=UTF-8
         [junit4] ERROR   0.58s | ExtractingRequestHandlerTest.testArabicPDF <<<
         [junit4]    > Throwable #1: org.apache.solr.common.SolrException: Invalid Date String:'Tue Mar 09 13:44:49
      GMT+07:00 2010'
      

      Workarround noted by Uwe...

      The test passes on JDK 9 b71 with:
      -Dargs="-Djava.locale.providers=JRE,SPI"

      This reenabled the old Locale data. I will add this to the build parameters of policeman Jenkins to stop this from
      failing. To me it looks like the locale data somehow is not able to correctly parse weekdays and/or timezones. I
      will check this out tomorrow and report a bug to the OpenJDK people. There is something fishy with CLDR locale data.
      There are already some bugs open, so work is not yet finished (e.g. sometimes it uses wrong timezone shortcuts,...)

      1. SOLR-7770.patch
        4 kB
        Uwe Schindler

        Activity

        Hide
        Hoss Man added a comment -

        Full details of an example failure...

        http://jenkins.thetaphi.de/job/Lucene-Solr-5.x-Linux/13200/
        Java: 64bit/jdk1.9.0-ea-b71 -XX:-UseCompressedOops -XX:+UseG1GC
        r1689849

           [junit4]   2> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
           [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=ExtractingRequestHandlerTest
        -Dtests.method=testArabicPDF -Dtests.seed=232D0A5404C2ADED -Dtests.multiplier=3 -Dtests.slow=true
        -Dtests.locale=en_JM -Dtests.timezone=Etc/GMT-7 -Dtests.asserts=true -Dtests.file.encoding=UTF-8
           [junit4] ERROR   0.58s | ExtractingRequestHandlerTest.testArabicPDF <<<
           [junit4]    > Throwable #1: org.apache.solr.common.SolrException: Invalid Date String:'Tue Mar 09 13:44:49
        GMT+07:00 2010'
           [junit4]    >        at __randomizedtesting.SeedInfo.seed([232D0A5404C2ADED:4DEB715B070706B8]:0)
           [junit4]    >        at org.apache.solr.schema.TrieDateField.parseMath(TrieDateField.java:150)
           [junit4]    >        at org.apache.solr.schema.TrieField.createField(TrieField.java:657)
           [junit4]    >        at org.apache.solr.schema.TrieField.createFields(TrieField.java:694)
           [junit4]    >        at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:48)
           [junit4]    >        at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:123)
           [junit4]    >        at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:83)
           [junit4]    >        at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:237)
           [junit4]    >        at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:163)
           [junit4]    >        at
        org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
           [junit4]    >        at
        org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
           [junit4]    >        at
        org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:981)
           [junit4]    >        at
        org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:706)
           [junit4]    >        at
        org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:104)
           [junit4]    >        at
        org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:122)
           [junit4]    >        at
        org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:127)
           [junit4]    >        at
        org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:230)
           [junit4]    >        at
        org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
           [junit4]    >        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
           [junit4]    >        at org.apache.solr.core.SolrCore.execute(SolrCore.java:2058)
           [junit4]    >        at org.apache.solr.util.TestHarness.queryAndResponse(TestHarness.java:339)
           [junit4]    >        at
        org.apache.solr.handler.extraction.ExtractingRequestHandlerTest.loadLocalFromHandler(ExtractingRequestHandlerTest.ja
        va:737)
           [junit4]    >        at
        org.apache.solr.handler.extraction.ExtractingRequestHandlerTest.loadLocal(ExtractingRequestHandlerTest.java:744)
           [junit4]    >        at
        org.apache.solr.handler.extraction.ExtractingRequestHandlerTest.testArabicPDF(ExtractingRequestHandlerTest.java:526)
           [junit4]    >        at java.lang.Thread.run(Thread.java:745)
        
        Show
        Hoss Man added a comment - Full details of an example failure... http://jenkins.thetaphi.de/job/Lucene-Solr-5.x-Linux/13200/ Java: 64bit/jdk1.9.0-ea-b71 -XX:-UseCompressedOops -XX:+UseG1GC r1689849 [junit4] 2> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=ExtractingRequestHandlerTest -Dtests.method=testArabicPDF -Dtests.seed=232D0A5404C2ADED -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=en_JM -Dtests.timezone=Etc/GMT-7 -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] ERROR 0.58s | ExtractingRequestHandlerTest.testArabicPDF <<< [junit4] > Throwable #1: org.apache.solr.common.SolrException: Invalid Date String:'Tue Mar 09 13:44:49 GMT+07:00 2010' [junit4] > at __randomizedtesting.SeedInfo.seed([232D0A5404C2ADED:4DEB715B070706B8]:0) [junit4] > at org.apache.solr.schema.TrieDateField.parseMath(TrieDateField.java:150) [junit4] > at org.apache.solr.schema.TrieField.createField(TrieField.java:657) [junit4] > at org.apache.solr.schema.TrieField.createFields(TrieField.java:694) [junit4] > at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:48) [junit4] > at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:123) [junit4] > at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:83) [junit4] > at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:237) [junit4] > at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:163) [junit4] > at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) [junit4] > at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) [junit4] > at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:981) [junit4] > at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:706) [junit4] > at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:104) [junit4] > at org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:122) [junit4] > at org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:127) [junit4] > at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:230) [junit4] > at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) [junit4] > at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143) [junit4] > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2058) [junit4] > at org.apache.solr.util.TestHarness.queryAndResponse(TestHarness.java:339) [junit4] > at org.apache.solr.handler.extraction.ExtractingRequestHandlerTest.loadLocalFromHandler(ExtractingRequestHandlerTest.ja va:737) [junit4] > at org.apache.solr.handler.extraction.ExtractingRequestHandlerTest.loadLocal(ExtractingRequestHandlerTest.java:744) [junit4] > at org.apache.solr.handler.extraction.ExtractingRequestHandlerTest.testArabicPDF(ExtractingRequestHandlerTest.java:526) [junit4] > at java.lang.Thread.run(Thread.java:745)
        Hide
        Hoss Man added a comment -

        Misc comments from uwe on the mailing list regarding this...

        I debugged the date parsing problems with a new test (TestDateUtil in solrj).

        The reason for this failing is the following 2 things, but they are related (if not even the same bug):

        • https://bugs.openjdk.java.net/browse/JDK-8129881 is triggered: TIKA uses Date#toString() which inserts a broken
          timezone shortcut into the resulting date. This cannot be parsed anymore! This happens all the timein ROOT Locale
          (see below).
        • Solr uses Locale.ROOT to parse the date (of course, because it's language independent). This locale is missing all
          text representations of weekdays or timezones in OpenJDK's CLDR locale data, so it cannot parse the weekday or the
          time zones. If I change DateUtil to use Locale.ENGLISH, it works as expected.

        I will open a bug report at Oracle.

        ...

        I opened Report (Review ID: JI-9022158) - Change to CLDR Locale data in JDK 9 b71 causes SimpleDateFormat parsing errors

        ...

        I think the real issue here is the following (Rory can you add this to issue?):

        According to Unicode, all locales should fall back to the ROOT locale, if the specific Locale does not have data
        (e.g., http://cldr.unicode.org/development/development-process/design-proposals/generic-calendar-data). The problem
        is now that the CLDR Java implementation seems to fall back to the root locale, but the root locale does not have
        weekdays and time zone short names - our test verifies this: ROOT locale is missing all this information.

        This causes all the bugs, also the one in https://bugs.openjdk.java.net/browse/JDK-8129881. The root locale should
        have the default English weekday and timezone names (see
        http://cldr.unicode.org/development/development-process/design-proposals/generic-calendar-data).

        I think the ROOT locale and the fallback mechanism should be revisited in JDK's CLDR impl, there seems to be a bug
        with that (either missing data or the fallback to defaults does not work correctly).

        from Balchandra...

        Here is the JBS id: https://bugs.openjdk.java.net/browse/JDK-8130845

        Show
        Hoss Man added a comment - Misc comments from uwe on the mailing list regarding this... I debugged the date parsing problems with a new test (TestDateUtil in solrj). The reason for this failing is the following 2 things, but they are related (if not even the same bug): https://bugs.openjdk.java.net/browse/JDK-8129881 is triggered: TIKA uses Date#toString() which inserts a broken timezone shortcut into the resulting date. This cannot be parsed anymore! This happens all the timein ROOT Locale (see below). Solr uses Locale.ROOT to parse the date (of course, because it's language independent). This locale is missing all text representations of weekdays or timezones in OpenJDK's CLDR locale data, so it cannot parse the weekday or the time zones. If I change DateUtil to use Locale.ENGLISH, it works as expected. I will open a bug report at Oracle. ... I opened Report (Review ID: JI-9022158) - Change to CLDR Locale data in JDK 9 b71 causes SimpleDateFormat parsing errors ... I think the real issue here is the following (Rory can you add this to issue?): According to Unicode, all locales should fall back to the ROOT locale, if the specific Locale does not have data (e.g., http://cldr.unicode.org/development/development-process/design-proposals/generic-calendar-data ). The problem is now that the CLDR Java implementation seems to fall back to the root locale, but the root locale does not have weekdays and time zone short names - our test verifies this: ROOT locale is missing all this information. This causes all the bugs, also the one in https://bugs.openjdk.java.net/browse/JDK-8129881 . The root locale should have the default English weekday and timezone names (see http://cldr.unicode.org/development/development-process/design-proposals/generic-calendar-data ). I think the ROOT locale and the fallback mechanism should be revisited in JDK's CLDR impl, there seems to be a bug with that (either missing data or the fallback to defaults does not work correctly). from Balchandra... Here is the JBS id: https://bugs.openjdk.java.net/browse/JDK-8130845
        Hide
        Hoss Man added a comment -

        Uwe Also added some specific DateUtil tests of this w/o depending on tika to produce the date values...

        http://svn.apache.org/r1690031
        http://svn.apache.org/r1690032

        Show
        Hoss Man added a comment - Uwe Also added some specific DateUtil tests of this w/o depending on tika to produce the date values... http://svn.apache.org/r1690031 http://svn.apache.org/r1690032
        Hide
        Uwe Schindler added a comment -

        Hi, i keep this issue open for a while. There is nothing we can do at Solr side, this is really a bug. The only thing we could do is to use Locale.ENGLISH instead of Locale.ROOT for date parsing. But this is just a workaround and not really a good one.

        Show
        Uwe Schindler added a comment - Hi, i keep this issue open for a while. There is nothing we can do at Solr side, this is really a bug. The only thing we could do is to use Locale.ENGLISH instead of Locale.ROOT for date parsing. But this is just a workaround and not really a good one.
        Hide
        Hoss Man added a comment -

        There is nothing we can do at Solr side, ...

        correct. my main concern is having an open issue here to track the known problem and the workarround – once there is a jdk9 that doesn't have this problem we can resolve SOLR-7770 and note which JDK versions are known to work (vs known to be broken)

        Show
        Hoss Man added a comment - There is nothing we can do at Solr side, ... correct. my main concern is having an open issue here to track the known problem and the workarround – once there is a jdk9 that doesn't have this problem we can resolve SOLR-7770 and note which JDK versions are known to work (vs known to be broken)
        Hide
        Uwe Schindler added a comment -

        Hi,
        https://bugs.openjdk.java.net/browse/JDK-8130845 gives the following:

        In fact the parsing of weekday or month names in the root locale was a bug in earlier Java versions. The root locale has accoring to unicode Month names like "M01", "M02",... - but no english month names. Same with weekdays.

        Using the root locale is fine for parsing ISO formatted dates, but some of the formats are clearly "english" e.g. the "Cookie" or java.util.Date#toString() format. In Solr we should therefore change those SimpleDateFormats using english names while parsing to use Locale.ENGLISH.

        In JDK 9, they fixed the problem, but we are still not 100% correct. I checked the CLDR locale data, in fact it has no month names, only those "pseudo names". Otherwise this may break again in later versions or for people using ICU SPIs for timezones or locales.

        I will provide a patch for those date formats, which use english names later (I am currently on vacation, so don't hurry!). We should fix this in 5.3.

        Show
        Uwe Schindler added a comment - Hi, https://bugs.openjdk.java.net/browse/JDK-8130845 gives the following: In fact the parsing of weekday or month names in the root locale was a bug in earlier Java versions. The root locale has accoring to unicode Month names like "M01", "M02",... - but no english month names. Same with weekdays. Using the root locale is fine for parsing ISO formatted dates, but some of the formats are clearly "english" e.g. the "Cookie" or java.util.Date#toString() format. In Solr we should therefore change those SimpleDateFormats using english names while parsing to use Locale.ENGLISH . In JDK 9, they fixed the problem, but we are still not 100% correct. I checked the CLDR locale data, in fact it has no month names, only those "pseudo names". Otherwise this may break again in later versions or for people using ICU SPIs for timezones or locales. I will provide a patch for those date formats, which use english names later (I am currently on vacation, so don't hurry!). We should fix this in 5.3.
        Hide
        Uwe Schindler added a comment -

        Here the patch. I will review Lucene/Solr a second time later, but this should be all "english" date formats, that should not use ROOT.

        Show
        Uwe Schindler added a comment - Here the patch. I will review Lucene/Solr a second time later, but this should be all "english" date formats, that should not use ROOT.
        Hide
        ASF subversion and git services added a comment -

        Commit 1694276 from Uwe Schindler in branch 'dev/trunk'
        [ https://svn.apache.org/r1694276 ]

        LUCENE-6723: Fix date parsing problems in Java 9 with date formats using English weekday/month names.

        Show
        ASF subversion and git services added a comment - Commit 1694276 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1694276 ] LUCENE-6723 : Fix date parsing problems in Java 9 with date formats using English weekday/month names.
        Hide
        ASF subversion and git services added a comment -

        Commit 1694277 from Uwe Schindler in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1694277 ]

        Merged revision(s) 1694276 from lucene/dev/trunk:
        LUCENE-6723: Fix date parsing problems in Java 9 with date formats using English weekday/month names.

        Show
        ASF subversion and git services added a comment - Commit 1694277 from Uwe Schindler in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1694277 ] Merged revision(s) 1694276 from lucene/dev/trunk: LUCENE-6723 : Fix date parsing problems in Java 9 with date formats using English weekday/month names.
        Hide
        ASF subversion and git services added a comment -

        Commit 1694278 from Uwe Schindler in branch 'dev/branches/lucene_solr_5_3'
        [ https://svn.apache.org/r1694278 ]

        Merged revision(s) 1694277 from lucene/dev/branches/branch_5x:
        Merged revision(s) 1694276 from lucene/dev/trunk:
        LUCENE-6723: Fix date parsing problems in Java 9 with date formats using English weekday/month names.

        Show
        ASF subversion and git services added a comment - Commit 1694278 from Uwe Schindler in branch 'dev/branches/lucene_solr_5_3' [ https://svn.apache.org/r1694278 ] Merged revision(s) 1694277 from lucene/dev/branches/branch_5x: Merged revision(s) 1694276 from lucene/dev/trunk: LUCENE-6723 : Fix date parsing problems in Java 9 with date formats using English weekday/month names.
        Hide
        Uwe Schindler added a comment -

        I also committed to 5.3.

        Show
        Uwe Schindler added a comment - I also committed to 5.3.
        Hide
        Uwe Schindler added a comment -

        I reopen this issue, because with Java 9 build 78 there are still problems (which are bugs in the JDK). This time the timezones cannot be parsed correctly.

        Show
        Uwe Schindler added a comment - I reopen this issue, because with Java 9 build 78 there are still problems (which are bugs in the JDK). This time the timezones cannot be parsed correctly.
        Hide
        Uwe Schindler added a comment -

        Hi Rory, hi Balchandra,

        I set up a quick round trip test (it iterates all available timezones in the JDK, sets them as default, creates a String out of new Date().toString() and tried to parse that afterwards with ENGLISH, US and ROOT locale.

        import java.text.ParseException;
        import java.text.SimpleDateFormat;
        import java.util.Date;
        import java.util.Locale;
        import java.util.TimeZone;
        
        public final class Test {
          
          private static void testParse(Locale locale, String date) {
            try {
              new SimpleDateFormat("EEE MMM d hh:mm:ss z yyyy", locale).parse(date);
              System.out.println(String.format(Locale.ENGLISH, "OK parsing '%s' in locale '%s'", date, locale));
            } catch (ParseException pe) {
              System.out.println(String.format(Locale.ENGLISH, "ERROR parsing '%s' in locale '%s': %s", date, locale, pe.toString()));
            }
          }
          
          public static void main(String[] args) {
            for (String id : TimeZone.getAvailableIDs()) {
              System.out.println("Testing time zone: " + id);
              TimeZone.setDefault(TimeZone.getTimeZone(id));
              
              // some date today:
              String date1 = new Date(1440358930504L).toString();
              testParse(Locale.ENGLISH, date1);
              testParse(Locale.US, date1);
              testParse(Locale.ROOT, date1);
              // half a year back to hit DST difference:
              String date2 = new Date(1440358930504L - 86400000L * 180).toString();
              testParse(Locale.ENGLISH, date2);
              testParse(Locale.US, date2);
              testParse(Locale.ROOT, date2);
            }
          } 
           
        }
        

        With Java 8 this passes, with Java 9 build 78 it fails for several timezones. The funny thing is: SimpleDateFormat is not even able to parse "UTC" - LOL.

        Could you pass this to the issue after reopening? It’s a good test!

        Show
        Uwe Schindler added a comment - Hi Rory, hi Balchandra, I set up a quick round trip test (it iterates all available timezones in the JDK, sets them as default, creates a String out of new Date().toString() and tried to parse that afterwards with ENGLISH, US and ROOT locale. import java.text.ParseException; import java.text.SimpleDateFormat; import java.util.Date; import java.util.Locale; import java.util.TimeZone; public final class Test { private static void testParse(Locale locale, String date) { try { new SimpleDateFormat( "EEE MMM d hh:mm:ss z yyyy" , locale).parse(date); System .out.println( String .format(Locale.ENGLISH, "OK parsing '%s' in locale '%s'" , date, locale)); } catch (ParseException pe) { System .out.println( String .format(Locale.ENGLISH, "ERROR parsing '%s' in locale '%s': %s" , date, locale, pe.toString())); } } public static void main( String [] args) { for ( String id : TimeZone.getAvailableIDs()) { System .out.println( "Testing time zone: " + id); TimeZone.setDefault(TimeZone.getTimeZone(id)); // some date today: String date1 = new Date(1440358930504L).toString(); testParse(Locale.ENGLISH, date1); testParse(Locale.US, date1); testParse(Locale.ROOT, date1); // half a year back to hit DST difference: String date2 = new Date(1440358930504L - 86400000L * 180).toString(); testParse(Locale.ENGLISH, date2); testParse(Locale.US, date2); testParse(Locale.ROOT, date2); } } } With Java 8 this passes, with Java 9 build 78 it fails for several timezones. The funny thing is: SimpleDateFormat is not even able to parse "UTC" - LOL. Could you pass this to the issue after reopening? It’s a good test!
        Hide
        Uwe Schindler added a comment -

        Specifically, this time this date failed to parse: "Sat Jun 23 02:57:58 XJT 2012"

        Show
        Uwe Schindler added a comment - Specifically, this time this date failed to parse: "Sat Jun 23 02:57:58 XJT 2012"
        Hide
        Uwe Schindler added a comment -

        New issue to get hold on this problem: https://bugs.openjdk.java.net/browse/JDK-8134384

        Show
        Uwe Schindler added a comment - New issue to get hold on this problem: https://bugs.openjdk.java.net/browse/JDK-8134384
        Hide
        Uwe Schindler added a comment -

        The OpenJDK bug was fixed.

        Show
        Uwe Schindler added a comment - The OpenJDK bug was fixed.

          People

          • Assignee:
            Uwe Schindler
            Reporter:
            Hoss Man
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development