Solr
  1. Solr
  2. SOLR-2381

The included jetty server does not support UTF-8

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 3.2, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None

      Description

      Some background here: http://www.lucidimagination.com/search/document/6babe83bd4a98b64/which_unicode_version_is_supported_with_lucene

      Some possible solutions:

      • wait and see if we get resolution on http://jira.codehaus.org/browse/JETTY-1340. To be honest, I am not even sure where jetty is being maintained (there is a separate jetty project at eclipse.org with another bugtracker, but the older releases are at codehaus).
      • include a patched version of jetty with correct utf-8, using that patch.
      • remove jetty and include a different container instead.
      1. SOLR-2381.patch
        3 kB
        Robert Muir
      2. jetty-6.1.26-patched-JETTY-1340.jar
        527 kB
        Robert Muir
      3. jetty-util-6.1.26-patched-JETTY-1340.jar
        173 kB
        Robert Muir
      4. SOLR-ServletOutputWriter.patch
        2 kB
        Uwe Schindler
      5. SOLR-2381_xmltest.patch
        4 kB
        Robert Muir
      6. post_utf8enhanced.sh
        1 kB
        Bernd Fehling
      7. utf8enhanced.xml
        4 kB
        Bernd Fehling
      8. jetty-6.1.26-patched-SOLR-2381.jar
        528 kB
        Bernd Fehling
      9. jetty-util-6.1.26-patched-SOLR-2381.jar
        173 kB
        Bernd Fehling
      10. SOLR-ServletOutputWriter.patch
        2 kB
        Uwe Schindler
      11. SOLR-2381_take2.patch
        6 kB
        Robert Muir
      12. jetty-6.1.26-patched-JETTY-1340.jar
        528 kB
        Robert Muir
      13. jetty-util-6.1.26-patched-JETTY-1340.jar
        173 kB
        Robert Muir
      14. SOLR-2381-3.x+3.1.patch
        14 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1.0 release

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1.0 release
          Hide
          uygar bayar added a comment - - edited

          hi i use 3.x trunk. I insert documents with pecl php.

          rw-rr- 1 nutch nutch 540234 Mar 17 12:37 jetty-6.1.26-patched-JETTY-1340.jar
          rw-rr- 1 nutch nutch 11358 Mar 17 12:37 jetty-LICENSE.txt
          rw-rr- 1 nutch nutch 1621 Mar 17 12:37 jetty-NOTICE.txt
          rw-rr- 1 nutch nutch 177393 Mar 17 12:37 jetty-util-6.1.26-patched-JETTY-1340.jar

          SEVERE: org.apache.solr.common.SolrException: Invalid UTF-8 middle byte 0x63 (at char #334, byte #127)
          at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
          at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55)
          at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
          at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
          at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
          at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
          at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
          at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
          at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
          at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
          at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
          at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
          at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
          at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
          at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
          at org.mortbay.jetty.Server.handle(Server.java:326)
          at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
          at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
          at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756)
          at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
          at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
          at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
          at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
          Caused by: com.ctc.wstx.exc.WstxIOException: Invalid UTF-8 middle byte 0x63 (at char #334, byte #127)
          at com.ctc.wstx.sr.StreamScanner.throwFromIOE(StreamScanner.java:708)
          at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1086)
          at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:281)
          at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146)
          at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77)
          ... 22 more
          Caused by: java.io.CharConversionException: Invalid UTF-8 middle byte 0x63 (at char #334, byte #127)
          at com.ctc.wstx.io.UTF8Reader.reportInvalidOther(UTF8Reader.java:313)
          at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:204)
          at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101)
          at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
          at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
          at com.ctc.wstx.sr.StreamScanner.loadMoreFromCurrent(StreamScanner.java:1046)
          at com.ctc.wstx.sr.StreamScanner.parseLocalName2(StreamScanner.java:1796)
          at com.ctc.wstx.sr.StreamScanner.parseLocalName(StreamScanner.java:1756)
          at com.ctc.wstx.sr.BasicStreamReader.handleNsAttrs(BasicStreamReader.java:2981)
          at com.ctc.wstx.sr.BasicStreamReader.handleStartElem(BasicStreamReader.java:2936)
          at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2848)
          at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019)
          ... 25 more

          Mar 18, 2011 3:13:42 PM org.apache.solr.core.SolrCore execute
          INFO: [] webapp=/solr path=/update/ params=

          {indent=on&wt=xml&version=2.2}

          status=400 QTime=0

          Show
          uygar bayar added a comment - - edited hi i use 3.x trunk. I insert documents with pecl php. rw-r r - 1 nutch nutch 540234 Mar 17 12:37 jetty-6.1.26-patched-JETTY-1340.jar rw-r r - 1 nutch nutch 11358 Mar 17 12:37 jetty-LICENSE.txt rw-r r - 1 nutch nutch 1621 Mar 17 12:37 jetty-NOTICE.txt rw-r r - 1 nutch nutch 177393 Mar 17 12:37 jetty-util-6.1.26-patched-JETTY-1340.jar SEVERE: org.apache.solr.common.SolrException: Invalid UTF-8 middle byte 0x63 (at char #334, byte #127) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Caused by: com.ctc.wstx.exc.WstxIOException: Invalid UTF-8 middle byte 0x63 (at char #334, byte #127) at com.ctc.wstx.sr.StreamScanner.throwFromIOE(StreamScanner.java:708) at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1086) at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:281) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) ... 22 more Caused by: java.io.CharConversionException: Invalid UTF-8 middle byte 0x63 (at char #334, byte #127) at com.ctc.wstx.io.UTF8Reader.reportInvalidOther(UTF8Reader.java:313) at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:204) at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101) at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84) at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57) at com.ctc.wstx.sr.StreamScanner.loadMoreFromCurrent(StreamScanner.java:1046) at com.ctc.wstx.sr.StreamScanner.parseLocalName2(StreamScanner.java:1796) at com.ctc.wstx.sr.StreamScanner.parseLocalName(StreamScanner.java:1756) at com.ctc.wstx.sr.BasicStreamReader.handleNsAttrs(BasicStreamReader.java:2981) at com.ctc.wstx.sr.BasicStreamReader.handleStartElem(BasicStreamReader.java:2936) at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2848) at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019) ... 25 more Mar 18, 2011 3:13:42 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update/ params= {indent=on&wt=xml&version=2.2} status=400 QTime=0
          Hide
          Uwe Schindler added a comment -

          Committed 3.x revision: 1079954
          Committed 3.1 revision: 1079955

          Show
          Uwe Schindler added a comment - Committed 3.x revision: 1079954 Committed 3.1 revision: 1079955
          Hide
          Robert Muir added a comment -

          OK, i committed my patch to trunk: Committed revision 1079949.

          Uwe, can you take it from here for 3.x and 3.1?

          Show
          Robert Muir added a comment - OK, i committed my patch to trunk: Committed revision 1079949. Uwe, can you take it from here for 3.x and 3.1?
          Hide
          Uwe Schindler added a comment -

          Again revised (one optimization in the deprecated UpdateServlet). Sorry for multiple patch posts.

          Its now ready to commit (all branches).

          Show
          Uwe Schindler added a comment - Again revised (one optimization in the deprecated UpdateServlet). Sorry for multiple patch posts. Its now ready to commit (all branches).
          Hide
          Uwe Schindler added a comment -

          Patch for 3.1 and 3.x, revised & cleaned up as described before

          Show
          Uwe Schindler added a comment - Patch for 3.1 and 3.x, revised & cleaned up as described before
          Hide
          Uwe Schindler added a comment -

          Here the patch for 3.x and 3.1, also fixing the other Servlets to use only byte streams when reading/writing. This also contains the rest of issue SOLR-2347 to fix deprecated parts of XML using Readers (legacyUpdateRequest).

          Show
          Uwe Schindler added a comment - Here the patch for 3.x and 3.1, also fixing the other Servlets to use only byte streams when reading/writing. This also contains the rest of issue SOLR-2347 to fix deprecated parts of XML using Readers (legacyUpdateRequest).
          Hide
          Yonik Seeley added a comment -

          Awesome news guys... not using Jetty's writers did in fact result in performance improvements!
          This was a simple test that requested 500 docs per request (hitting all the caches to try and isolate writer performance). Performance improvements of almost 30% for XML!

          =============== trunk, using jetty's writers ==========
          wt=javabin
          qps: 1297
          50%   = 7938
          qps: 1317
          50%   = 8114
          qps: 1319
          50%   = 8395
          qps: 1349
          50%   = 8160
          qps: 1293
          50%   = 8922
          wt=xml
          qps: 634
          50%   = 21983
          qps: 713
          50%   = 22138
          qps: 718
          50%   = 21594
          qps: 717
          50%   = 20935
          qps: 741
          50%   = 20546
          wt=json
          qps: 945
          50%   = 15500
          qps: 938
          50%   = 16812
          qps: 921
          50%   = 15467
          qps: 930
          50%   = 15337
          qps: 932
          50%   = 15447
          wt=python
          qps: 1024
          50%   = 12975
          qps: 1046
          50%   = 12883
          qps: 996
          50%   = 14033
          qps: 988
          50%   = 14295
          qps: 1013
          50%   = 13206
          wt=ruby
          qps: 893
          50%   = 18897
          qps: 878
          50%   = 18943
          qps: 871
          50%   = 18413
          qps: 857
          50%   = 19190
          qps: 902
          50%   = 18554
          ========= trunk with SOLR-2381_take2.patch (not using jetty's writers) ===========
          wt=javabin
          qps: 1315
          50%   = 7884
          qps: 1285
          50%   = 8946
          qps: 1280
          50%   = 8083
          qps: 1340
          50%   = 7899
          qps: 1310
          50%   = 7872
          wt=xml
          qps: 773
          50%   = 16006
          qps: 938
          50%   = 14316
          qps: 946
          50%   = 15709
          qps: 956
          50%   = 14735
          qps: 950
          50%   = 14825
          wt=json
          qps: 1127
          50%   = 10168
          qps: 1104
          50%   = 11147
          qps: 1166
          50%   = 10691
          qps: 1100
          50%   = 10654
          qps: 1138
          50%   = 10437
          wt=python
          qps: 1004
          50%   = 12502
          qps: 1033
          50%   = 13525
          qps: 1007
          50%   = 13762
          qps: 1043
          50%   = 11854
          qps: 985
          50%   = 13289
          wt=ruby
          qps: 1164
          50%   = 9457
          qps: 1175
          50%   = 9994
          qps: 1212
          50%   = 9437
          qps: 1203
          50%   = 9756
          qps: 1197
          50%   = 10640
          
          Show
          Yonik Seeley added a comment - Awesome news guys... not using Jetty's writers did in fact result in performance improvements! This was a simple test that requested 500 docs per request (hitting all the caches to try and isolate writer performance). Performance improvements of almost 30% for XML! =============== trunk, using jetty's writers ========== wt=javabin qps: 1297 50% = 7938 qps: 1317 50% = 8114 qps: 1319 50% = 8395 qps: 1349 50% = 8160 qps: 1293 50% = 8922 wt=xml qps: 634 50% = 21983 qps: 713 50% = 22138 qps: 718 50% = 21594 qps: 717 50% = 20935 qps: 741 50% = 20546 wt=json qps: 945 50% = 15500 qps: 938 50% = 16812 qps: 921 50% = 15467 qps: 930 50% = 15337 qps: 932 50% = 15447 wt=python qps: 1024 50% = 12975 qps: 1046 50% = 12883 qps: 996 50% = 14033 qps: 988 50% = 14295 qps: 1013 50% = 13206 wt=ruby qps: 893 50% = 18897 qps: 878 50% = 18943 qps: 871 50% = 18413 qps: 857 50% = 19190 qps: 902 50% = 18554 ========= trunk with SOLR-2381_take2.patch (not using jetty's writers) =========== wt=javabin qps: 1315 50% = 7884 qps: 1285 50% = 8946 qps: 1280 50% = 8083 qps: 1340 50% = 7899 qps: 1310 50% = 7872 wt=xml qps: 773 50% = 16006 qps: 938 50% = 14316 qps: 946 50% = 15709 qps: 956 50% = 14735 qps: 950 50% = 14825 wt=json qps: 1127 50% = 10168 qps: 1104 50% = 11147 qps: 1166 50% = 10691 qps: 1100 50% = 10654 qps: 1138 50% = 10437 wt=python qps: 1004 50% = 12502 qps: 1033 50% = 13525 qps: 1007 50% = 13762 qps: 1043 50% = 11854 qps: 985 50% = 13289 wt=ruby qps: 1164 50% = 9457 qps: 1175 50% = 9994 qps: 1212 50% = 9437 qps: 1203 50% = 9756 qps: 1197 50% = 10640
          Hide
          Robert Muir added a comment -

          Here is _take2.patch:

          1. I took Bernd's update to JETTY-1340, retested and rebuilt jetty. things look good from this perspective.
          2. I then added my random test, and things look fine with the new Jetty.
          3. Finally I incorporated uwe's patch also.

          I think this is the best solution, much safer and with a lot better tests.

          Show
          Robert Muir added a comment - Here is _take2.patch: 1. I took Bernd's update to JETTY-1340, retested and rebuilt jetty. things look good from this perspective. 2. I then added my random test, and things look fine with the new Jetty. 3. Finally I incorporated uwe's patch also. I think this is the best solution, much safer and with a lot better tests.
          Hide
          Bernd Fehling added a comment -

          Jetty patch will be uploaded to http://jira.codehaus.org/browse/JETTY-1340.

          I'm installing Uwe's patch also and try to "stay away" from XML for java.

          Show
          Bernd Fehling added a comment - Jetty patch will be uploaded to http://jira.codehaus.org/browse/JETTY-1340 . I'm installing Uwe's patch also and try to "stay away" from XML for java.
          Hide
          Robert Muir added a comment -

          Thanks Uwe: my test passes with your patch.

          To summarize, this is what I think we should do, once we get Bernd's patch:

          1. we should commit the random test (SOLR-2381_xmltest.patch)
          2. rebuild/test jetty with Bernd's modifications, and commit that if everything is ok.
          3. we should commit Uwe's patch for extra safety and improved performance.

          Show
          Robert Muir added a comment - Thanks Uwe: my test passes with your patch. To summarize, this is what I think we should do, once we get Bernd's patch: 1. we should commit the random test ( SOLR-2381 _xmltest.patch) 2. rebuild/test jetty with Bernd's modifications, and commit that if everything is ok. 3. we should commit Uwe's patch for extra safety and improved performance.
          Hide
          Uwe Schindler added a comment -

          Robert and me discussed about the Jetty OutputWriter and found out:

          • It's much more broken, as it would even not support writing half surrogates in write(char[], ofset, length), which may also fail for other ResponseWriters!!!
          • Jettys implementation is SLOOOOOOOOOOOW!

          The attached patch now uses no Writer supplied by Jetty or any other servlet container at all - it just handles HTTP as it is: a binary protocol using byte streams. Like for UpdateReqHandler it uses its own mapper inside Solr (on the input side ContentStream is used for that).

          As most output in solr is done using UTF-8 (the default), it uses a pre-looked up NIO Charset for that.

          Show
          Uwe Schindler added a comment - Robert and me discussed about the Jetty OutputWriter and found out: It's much more broken, as it would even not support writing half surrogates in write(char[], ofset, length), which may also fail for other ResponseWriters!!! Jettys implementation is SLOOOOOOOOOOOW! The attached patch now uses no Writer supplied by Jetty or any other servlet container at all - it just handles HTTP as it is: a binary protocol using byte streams. Like for UpdateReqHandler it uses its own mapper inside Solr (on the input side ContentStream is used for that). As most output in solr is done using UTF-8 (the default), it uses a pre-looked up NIO Charset for that.
          Hide
          Robert Muir added a comment -

          Bernd, i didn't test your jars, but can you update the patch on http://jira.codehaus.org/browse/JETTY-1340
          with your fixes?

          As an open source project, we can't just commit the binary jars.

          I did however, test Uwe's patch. I think we should fix this bug in jetty, but I think we should also use Uwe's patch (my random test passes always with his patch).

          This jetty writer is hardly fast, i think it makes sense to try to bypass this "optimization" in jetty which only causes bugs and likely only makes things slower actually (for example lots of conditionals and state-keeping, Character.isLowSurrogate on every char, and handling silly 6-byte UTF-8 cases which do not exist).

          Its also a good safety net, I don't trust these servlet containers to do this correctly.

          Show
          Robert Muir added a comment - Bernd, i didn't test your jars, but can you update the patch on http://jira.codehaus.org/browse/JETTY-1340 with your fixes? As an open source project, we can't just commit the binary jars. I did however, test Uwe's patch. I think we should fix this bug in jetty, but I think we should also use Uwe's patch (my random test passes always with his patch). This jetty writer is hardly fast, i think it makes sense to try to bypass this "optimization" in jetty which only causes bugs and likely only makes things slower actually (for example lots of conditionals and state-keeping, Character.isLowSurrogate on every char, and handling silly 6-byte UTF-8 cases which do not exist). Its also a good safety net, I don't trust these servlet containers to do this correctly.
          Hide
          Bernd Fehling added a comment - - edited

          And here it is, the fixed jetty.
          jetty-6.1-26-patched-SOLR-2381.jar
          jetty-util-6.1.26-patched-SOLR-2381.jar

          Please test it and give your feedback.
          At least my problems are gone.

          Thanks for your patience and help.

          Show
          Bernd Fehling added a comment - - edited And here it is, the fixed jetty. jetty-6.1-26-patched- SOLR-2381 .jar jetty-util-6.1.26-patched- SOLR-2381 .jar Please test it and give your feedback. At least my problems are gone. Thanks for your patience and help.
          Hide
          Uwe Schindler added a comment -

          Hi Bernd,
          we know where the problem in Jetty is (they buffer 512 chars without respecting surrogates). When they then convert those buffered chars to UTF-8 its broken at the boundaries. This bug in Jetty may also affect JSON output, but JSON is much more compact and may not easily hit this buffer issue (as it does not use Strings to feed to writer, the broken method in JETTY is handling Writer.write(String,...).

          In general we are discussing to not use Readers and Writers supplied by the Servlet Container. As HTTP is a byte-based protocol, code should only use InputStreams and OutputStreams to communicate with the client. Writers and Readers are only provided for convenience with JSP engines.

          The input part of Solr no longer uses Readers, they pass always pass InputStreams around. I uploaded a patch a week ago to do the same on the output side of Solr: SOLR-ServletOutputWriter.patch

          Please note: As JSP pages use Jetty's writers, analysis.jsp may still produce corrupt output.

          Can you patch your solr with that one, then your problems should disappear for all OutputHandler generated content except JSP pages in Solr. We are thinking about optimizing this, internally, but the above patch removes all use of Solr. The patch is against trunk as far as I know.

          Show
          Uwe Schindler added a comment - Hi Bernd, we know where the problem in Jetty is (they buffer 512 chars without respecting surrogates). When they then convert those buffered chars to UTF-8 its broken at the boundaries. This bug in Jetty may also affect JSON output, but JSON is much more compact and may not easily hit this buffer issue (as it does not use Strings to feed to writer, the broken method in JETTY is handling Writer.write(String,...). In general we are discussing to not use Readers and Writers supplied by the Servlet Container. As HTTP is a byte-based protocol, code should only use InputStreams and OutputStreams to communicate with the client. Writers and Readers are only provided for convenience with JSP engines. The input part of Solr no longer uses Readers, they pass always pass InputStreams around. I uploaded a patch a week ago to do the same on the output side of Solr: SOLR-ServletOutputWriter.patch Please note: As JSP pages use Jetty's writers, analysis.jsp may still produce corrupt output. Can you patch your solr with that one, then your problems should disappear for all OutputHandler generated content except JSP pages in Solr. We are thinking about optimizing this, internally, but the above patch removes all use of Solr. The patch is against trunk as far as I know.
          Hide
          Robert Muir added a comment -

          Quickest fix would be to use the working code snippet from jetty-7.3.1 and replace the buggy jetty-6.1.26-patched-JETTY-1340.

          There's nothing quick about fixing bugs in jetty at all: for example the 6.1 branch's unit test suite does not even reliably pass out of box, making it difficult to test changes.

          I'm certainly going to fix it, but its going to take probably a day of my time to ensure that its done safely (just like it took a day for me to fix the previous jetty bug on this issue).

          Unfortunately, as we are switching from FAST System to Solr, all our Interfaces are using XML.
          We never had any problems with FAST, XML and UTF-8.
          It would be a mess reworking everything to JSON just for Solr.

          I'm not really concerned at all with what FAST does or doesn't do.

          I still stand by my statement that I strongly recommend against the use of XML (in general, nothing to do with Jetty) if you need correct unicode support and are using java-based components. This is just my practical advice based on building applications that have to work with all of unicode.

          You won't be reworking just for Solr, its pretty likely as your system grows you will run into other unicode bugs in java-based XML libraries, too.

          Show
          Robert Muir added a comment - Quickest fix would be to use the working code snippet from jetty-7.3.1 and replace the buggy jetty-6.1.26-patched-JETTY-1340. There's nothing quick about fixing bugs in jetty at all: for example the 6.1 branch's unit test suite does not even reliably pass out of box, making it difficult to test changes. I'm certainly going to fix it, but its going to take probably a day of my time to ensure that its done safely (just like it took a day for me to fix the previous jetty bug on this issue). Unfortunately, as we are switching from FAST System to Solr, all our Interfaces are using XML. We never had any problems with FAST, XML and UTF-8. It would be a mess reworking everything to JSON just for Solr. I'm not really concerned at all with what FAST does or doesn't do. I still stand by my statement that I strongly recommend against the use of XML (in general, nothing to do with Jetty) if you need correct unicode support and are using java-based components. This is just my practical advice based on building applications that have to work with all of unicode. You won't be reworking just for Solr, its pretty likely as your system grows you will run into other unicode bugs in java-based XML libraries, too.
          Hide
          Bernd Fehling added a comment -

          I just debugged jetty-6.1.26-patched-JETTY-1340 and located the bug.
          As I already said above, it is due to buffer size of 512 bytes for output and the surrogates.
          If the buffer is filled up to 510 bytes and the next one is a UTF-8 above BMP (greater than 2 bytes) then
          jetty is in trouble.
          Quickest fix would be to use the working code snippet from jetty-7.3.1 and replace the buggy jetty-6.1.26-patched-JETTY-1340.

          Unfortunately, as we are switching from FAST System to Solr, all our Interfaces are using XML.
          We never had any problems with FAST, XML and UTF-8.
          It would be a mess reworking everything to JSON just for Solr.

          Show
          Bernd Fehling added a comment - I just debugged jetty-6.1.26-patched-JETTY-1340 and located the bug. As I already said above, it is due to buffer size of 512 bytes for output and the surrogates. If the buffer is filled up to 510 bytes and the next one is a UTF-8 above BMP (greater than 2 bytes) then jetty is in trouble. Quickest fix would be to use the working code snippet from jetty-7.3.1 and replace the buggy jetty-6.1.26-patched-JETTY-1340. Unfortunately, as we are switching from FAST System to Solr, all our Interfaces are using XML. We never had any problems with FAST, XML and UTF-8. It would be a mess reworking everything to JSON just for Solr.
          Hide
          Robert Muir added a comment -

          Bend according to my tests, solr is fine except when using XML.

          In general I recommend STRONGLY AGAINST using xml technologies if you need good unicode support. We can try to fix this bug, but I suspect you will only encounter more in your application if you decide to go with xml, it will be a long, tough, battle.

          Show
          Robert Muir added a comment - Bend according to my tests, solr is fine except when using XML. In general I recommend STRONGLY AGAINST using xml technologies if you need good unicode support. We can try to fix this bug, but I suspect you will only encounter more in your application if you decide to go with xml, it will be a long, tough, battle.
          Hide
          Bernd Fehling added a comment -

          I couldn't reproduce the error with JSON, only mit XML response.
          But you never know...

          Show
          Bernd Fehling added a comment - I couldn't reproduce the error with JSON, only mit XML response. But you never know...
          Hide
          Uwe Schindler added a comment -

          additionally it only fails with the XML response format (the default binary is fine). the test chooses different formats for each iteration.

          With binary it cannot fail, as no Writers from Jetty are in use. What other formats (text-based using Writers) are used and work? Or is e.g. JSON also failing?

          Because all text formats use a Writer supplied by Jetty.

          Show
          Uwe Schindler added a comment - additionally it only fails with the XML response format (the default binary is fine). the test chooses different formats for each iteration. With binary it cannot fail, as no Writers from Jetty are in use. What other formats (text-based using Writers) are used and work? Or is e.g. JSON also failing? Because all text formats use a Writer supplied by Jetty.
          Hide
          Bernd Fehling added a comment -

          I have created a test with a longer text from Apache Solr web page. The text is all transformed in UTF8 code with chars above BMP.
          Should be readable in your Browser.
          To load it copy post_utf8enhanced.sh and utf8enhanced.xml to your exampledocs dir and call post_utf8enhanced.sh.
          Hope this helps.

          Show
          Bernd Fehling added a comment - I have created a test with a longer text from Apache Solr web page. The text is all transformed in UTF8 code with chars above BMP. Should be readable in your Browser. To load it copy post_utf8enhanced.sh and utf8enhanced.xml to your exampledocs dir and call post_utf8enhanced.sh. Hope this helps.
          Hide
          Robert Muir added a comment -

          attached is a unit test. if you disable the 'case 4' so that it only uses 1, 2, and 3 byte codepoints, the test always passes.

          additionally it only fails with the XML response format (the default binary is fine). the test chooses different formats for each iteration.

          junit-sequential:
              [junit] Testsuite: org.apache.solr.client.solrj.embedded.SolrExampleJettyTest
              [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 3.829 sec
              [junit]
              [junit] ------------- Standard Error -----------------
              [junit] NOTE: reproduce with: ant test -Dtestcase=SolrExampleJettyTest -Dtestmethod=testUnicode -Dtests.seed=-8507816048970822444:1424998400651628841
              [junit] WARNING: test class left thread running: Thread[MultiThreadedHttpConnectionManager cleanup,5,main]
              [junit] RESOURCE LEAK: test class left 1 thread(s) running
              [junit] NOTE: test params are: codec=PreFlex, locale=es_GT, timezone=Asia/Hovd
              [junit] NOTE: all tests run in this JVM:
              [junit] [SolrExampleJettyTest]
              [junit] NOTE: Windows Vista 6.0 x86/Sun Microsystems Inc. 1.6.0_23 (32-bit)/cpus=4,threads=2,free=9760576,total=16252928
              [junit] ------------- ---------------- ---------------
              [junit] Testcase: testUnicode(org.apache.solr.client.solrj.embedded.SolrExampleJettyTest):  Caused an ERROR
              [junit] Error executing query
              [junit] org.apache.solr.client.solrj.SolrServerException: Error executing query
              [junit]     at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
              [junit]     at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:119)
              [junit]     at org.apache.solr.client.solrj.SolrExampleTests.testUnicode(SolrExampleTests.java:290)
              [junit]     at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1213)
              [junit]     at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1145)
              [junit] Caused by: org.apache.solr.common.SolrException: parsing error
              [junit]     at org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:145)
              [junit]     at org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:106)
              [junit]     at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:478)
              [junit]     at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:245)
              [junit]     at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
              [junit] Caused by: com.ctc.wstx.exc.WstxIOException: Invalid UTF-8 character 0xdf05(a surrogate character)  at char #2475, byte #127)
              [junit]     at com.ctc.wstx.sr.StreamScanner.throwFromIOE(StreamScanner.java:708)
              [junit]     at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1086)
              [junit]     at org.apache.solr.client.solrj.impl.XMLResponseParser.readNamedList(XMLResponseParser.java:218)
              [junit]     at org.apache.solr.client.solrj.impl.XMLResponseParser.readNamedList(XMLResponseParser.java:244)
              [junit]     at org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:130)
              [junit] Caused by: java.io.CharConversionException: Invalid UTF-8 character 0xdf05(a surrogate character)  at char #2475, byte #127)
              [junit]     at com.ctc.wstx.io.UTF8Reader.reportInvalid(UTF8Reader.java:335)
              [junit]     at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:247)
              [junit]     at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101)
              [junit]     at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
              [junit]     at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
              [junit]     at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992)
              [junit]     at com.ctc.wstx.sr.StreamScanner.getNext(StreamScanner.java:763)
              [junit]     at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2721)
              [junit]     at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019)
              [junit]
              [junit]
          
          Show
          Robert Muir added a comment - attached is a unit test. if you disable the 'case 4' so that it only uses 1, 2, and 3 byte codepoints, the test always passes. additionally it only fails with the XML response format (the default binary is fine). the test chooses different formats for each iteration. junit-sequential: [junit] Testsuite: org.apache.solr.client.solrj.embedded.SolrExampleJettyTest [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 3.829 sec [junit] [junit] ------------- Standard Error ----------------- [junit] NOTE: reproduce with: ant test -Dtestcase=SolrExampleJettyTest -Dtestmethod=testUnicode -Dtests.seed=-8507816048970822444:1424998400651628841 [junit] WARNING: test class left thread running: Thread[MultiThreadedHttpConnectionManager cleanup,5,main] [junit] RESOURCE LEAK: test class left 1 thread(s) running [junit] NOTE: test params are: codec=PreFlex, locale=es_GT, timezone=Asia/Hovd [junit] NOTE: all tests run in this JVM: [junit] [SolrExampleJettyTest] [junit] NOTE: Windows Vista 6.0 x86/Sun Microsystems Inc. 1.6.0_23 (32-bit)/cpus=4,threads=2,free=9760576,total=16252928 [junit] ------------- ---------------- --------------- [junit] Testcase: testUnicode(org.apache.solr.client.solrj.embedded.SolrExampleJettyTest): Caused an ERROR [junit] Error executing query [junit] org.apache.solr.client.solrj.SolrServerException: Error executing query [junit] at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95) [junit] at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:119) [junit] at org.apache.solr.client.solrj.SolrExampleTests.testUnicode(SolrExampleTests.java:290) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1213) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1145) [junit] Caused by: org.apache.solr.common.SolrException: parsing error [junit] at org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:145) [junit] at org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:106) [junit] at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:478) [junit] at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:245) [junit] at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89) [junit] Caused by: com.ctc.wstx.exc.WstxIOException: Invalid UTF-8 character 0xdf05(a surrogate character) at char #2475, byte #127) [junit] at com.ctc.wstx.sr.StreamScanner.throwFromIOE(StreamScanner.java:708) [junit] at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1086) [junit] at org.apache.solr.client.solrj.impl.XMLResponseParser.readNamedList(XMLResponseParser.java:218) [junit] at org.apache.solr.client.solrj.impl.XMLResponseParser.readNamedList(XMLResponseParser.java:244) [junit] at org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:130) [junit] Caused by: java.io.CharConversionException: Invalid UTF-8 character 0xdf05(a surrogate character) at char #2475, byte #127) [junit] at com.ctc.wstx.io.UTF8Reader.reportInvalid(UTF8Reader.java:335) [junit] at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:247) [junit] at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101) [junit] at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84) [junit] at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57) [junit] at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992) [junit] at com.ctc.wstx.sr.StreamScanner.getNext(StreamScanner.java:763) [junit] at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2721) [junit] at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019) [junit] [junit]
          Hide
          Robert Muir added a comment -

          reopening as I produced a unit test finding more problems along the lines of what Bernd reported (possibly in jetty, possibly in Solr, dont know yet)

          Show
          Robert Muir added a comment - reopening as I produced a unit test finding more problems along the lines of what Bernd reported (possibly in jetty, possibly in Solr, dont know yet)
          Hide
          Robert Muir added a comment -

          Bernd, you can download 6.1.26 then apply the patch from http://jira.codehaus.org/browse/JETTY-1340

          If you can produce any kind of testcase showing there is a bug with 6.1.26-patched, then we can reopen this issue and try to fix it (as to me its a blocker if UTF-8 is not supported).

          Without a test case its going to be slow-going though... we really need a Solr test for any problems you are encountering.

          Show
          Robert Muir added a comment - Bernd, you can download 6.1.26 then apply the patch from http://jira.codehaus.org/browse/JETTY-1340 If you can produce any kind of testcase showing there is a bug with 6.1.26-patched, then we can reopen this issue and try to fix it (as to me its a blocker if UTF-8 is not supported). Without a test case its going to be slow-going though... we really need a Solr test for any problems you are encountering.
          Hide
          Bernd Fehling added a comment -

          I first tested the patched version jetty-6.1.26-patched and still have the bug.
          Then I used jetty-7.3.0 and got the same bug.
          Then I debugged jetty-7.3.0 and located the bug and saw that it is fixed in jetty-7.3.1.
          And now I need the sources of the patched jetty-6.1.26 to see why there is still a bug
          and fix taht one also.
          Or if you know where to look for then I leave to you, no problem.

          May be you have contact to the jetty developer and they want to fix this for jetty-6.1.26 at all
          and make a jetty-6.1.27 out of it?

          Show
          Bernd Fehling added a comment - I first tested the patched version jetty-6.1.26-patched and still have the bug. Then I used jetty-7.3.0 and got the same bug. Then I debugged jetty-7.3.0 and located the bug and saw that it is fixed in jetty-7.3.1. And now I need the sources of the patched jetty-6.1.26 to see why there is still a bug and fix taht one also. Or if you know where to look for then I leave to you, no problem. May be you have contact to the jetty developer and they want to fix this for jetty-6.1.26 at all and make a jetty-6.1.27 out of it?
          Hide
          Robert Muir added a comment -

          If I find a svn with jetty-6.1.26 sources I will look into that one also.

          But can you test the patched version of jetty we have here? This is more useful because its the version we include (its the only one we worry about)

          Show
          Robert Muir added a comment - If I find a svn with jetty-6.1.26 sources I will look into that one also. But can you test the patched version of jetty we have here? This is more useful because its the version we include (its the only one we worry about)
          Hide
          Bernd Fehling added a comment -

          Robert, unfortunately I wasn't able to build a reproducible test so I decided to debug it on my server.
          The bug is in Jetty and has been fixed with jetty-7.3.1.v20110307.
          Because I started debugging during weekend I used the older jetty.7.3.0 with the bug included, located the bug
          and recognized today that it had just been fixed in the new version from yesterday.

          Nevertheless here is the description because I went through all the bits and bytes.
          In jetty-7 there is jetty-server with org.eclipse.jetty.server.HttpWriter.java.
          That is the OutputWriter which extends Writer and does the UTF-8 encoding.
          The buffer comes of size 8192 bytes and is chunked and encoded with HttpWriter in sizes of 512 bytes.
          The encoding is that in java it is UTF-16 and is read as integer. If the code is above BMP ist has a surrogate
          which is read first and thereafter the next integer.
          Excample: 55349(dec) and 56320(dec) is converted to 119808(10) which is U+1D400

          Remember that the buffer is of size 512 bytes. But what if the counter is at 510 and a Unicode above
          BMP comes up? The solution is to write the current buffer to output, reset it and start over with an empty
          buffer. And here is/was the bug.
          The "surrogate reminder" was cleared to early at a wrong place and got lost.

          If I find a svn with jetty-6.1.26 sources I will look into that one also.
          Otherwise use jetty-7.3.1-v20110307 that is fixed.

          May be we should setup a xml page for testing that has at least more than 512 characters of UTF-8 code
          above BMP in a row for testing?

          Show
          Bernd Fehling added a comment - Robert, unfortunately I wasn't able to build a reproducible test so I decided to debug it on my server. The bug is in Jetty and has been fixed with jetty-7.3.1.v20110307. Because I started debugging during weekend I used the older jetty.7.3.0 with the bug included, located the bug and recognized today that it had just been fixed in the new version from yesterday. Nevertheless here is the description because I went through all the bits and bytes. In jetty-7 there is jetty-server with org.eclipse.jetty.server.HttpWriter.java. That is the OutputWriter which extends Writer and does the UTF-8 encoding. The buffer comes of size 8192 bytes and is chunked and encoded with HttpWriter in sizes of 512 bytes. The encoding is that in java it is UTF-16 and is read as integer. If the code is above BMP ist has a surrogate which is read first and thereafter the next integer. Excample: 55349(dec) and 56320(dec) is converted to 119808(10) which is U+1D400 Remember that the buffer is of size 512 bytes. But what if the counter is at 510 and a Unicode above BMP comes up? The solution is to write the current buffer to output, reset it and start over with an empty buffer. And here is/was the bug. The "surrogate reminder" was cleared to early at a wrong place and got lost. If I find a svn with jetty-6.1.26 sources I will look into that one also. Otherwise use jetty-7.3.1-v20110307 that is fixed. May be we should setup a xml page for testing that has at least more than 512 characters of UTF-8 code above BMP in a row for testing?
          Hide
          Bernd Fehling added a comment -

          Yes, I'm trying to build a reproducible test.

          Show
          Bernd Fehling added a comment - Yes, I'm trying to build a reproducible test.
          Hide
          Robert Muir added a comment -

          Bernd, is it possible for you to produce some sort of reproducible test that demonstrates the problem?

          Then we could try to track down wherever the problem is with xml.

          Because when I use solr with unicode outside the BMP (the example test documents), I'm not seeing this issue... but there could be some boundary-related problem in jetty or solr. SOLR-1489 was an example of this.

          Show
          Robert Muir added a comment - Bernd, is it possible for you to produce some sort of reproducible test that demonstrates the problem? Then we could try to track down wherever the problem is with xml. Because when I use solr with unicode outside the BMP (the example test documents), I'm not seeing this issue... but there could be some boundary-related problem in jetty or solr. SOLR-1489 was an example of this.
          Hide
          Bernd Fehling added a comment -

          Well, just tested with jetty7 (jetty-hightide-7.3.0.v20110203.tar.gz)
          from http://dist.codehaus.org/jetty/jetty-hightide-7.3.0/

          • same problem with jetty 7 (broken utf8 with xml output)
          • if a utf8 code gets mangled under jetty 7 is the same position and outputs the same
            broken byte code as with jetty 6
          • jetty 7 always sends the result without chunking it and always sets "Content-Length: xxxx".

          This leads me to the conclusion that either jetty 7 is also buggy or it is still a solr problem.
          What do you think?

          Show
          Bernd Fehling added a comment - Well, just tested with jetty7 (jetty-hightide-7.3.0.v20110203.tar.gz) from http://dist.codehaus.org/jetty/jetty-hightide-7.3.0/ same problem with jetty 7 (broken utf8 with xml output) if a utf8 code gets mangled under jetty 7 is the same position and outputs the same broken byte code as with jetty 6 jetty 7 always sends the result without chunking it and always sets "Content-Length: xxxx". This leads me to the conclusion that either jetty 7 is also buggy or it is still a solr problem. What do you think?
          Hide
          Uwe Schindler added a comment -

          Bernd, two things:

          • use the patched jetty from the issue and also use it in Eclipse
          • your comment explains why JSON writer works, because JSON is much more compact and so it was not chunked in your tests.

          This is all a Jetty problem, because inside Solr there is really no difference between XML and JSON output, both is written in UTF-8 using the Writer supplied by the underlying servlet container.

          Show
          Uwe Schindler added a comment - Bernd, two things: use the patched jetty from the issue and also use it in Eclipse your comment explains why JSON writer works, because JSON is much more compact and so it was not chunked in your tests. This is all a Jetty problem, because inside Solr there is really no difference between XML and JSON output, both is written in UTF-8 using the Writer supplied by the underlying servlet container.
          Hide
          Bernd Fehling added a comment -

          Hmmm, disregard my last.
          After loading real data with code above BMP I get a reproducable error with destroyed UTF-8 code via xml...
          No error with wt=json.
          So is this a jetty problem or a solr problem?

          Another strange thing to mention, the error is only if the server can send all data at once.
          There is a "Content-Length: xxxx" in the server header.

          The error is not if the server chunks his reply (sends it in multiple parts).
          The server header has then "Transfer-Encoding: chunked" and no "Content-Length: xxxx".

          I have solr under eclipse but my runjettyrun is only jetty version 6.1.6.
          Have to get the jetty 6.1.26 source.

          Show
          Bernd Fehling added a comment - Hmmm, disregard my last. After loading real data with code above BMP I get a reproducable error with destroyed UTF-8 code via xml... No error with wt=json. So is this a jetty problem or a solr problem? Another strange thing to mention, the error is only if the server can send all data at once. There is a "Content-Length: xxxx" in the server header. The error is not if the server chunks his reply (sends it in multiple parts). The server header has then "Transfer-Encoding: chunked" and no "Content-Length: xxxx". I have solr under eclipse but my runjettyrun is only jetty version 6.1.6. Have to get the jetty 6.1.26 source.
          Hide
          Uwe Schindler added a comment -

          Ok, thanks for reporting back. So there was maybe a problem in the past with XMLWriter, which is solved with Lucene trunk. Can you also check branch_3x (Lucene 3.1), because this is the next release and trunk (Lucene 4.0) is very unstable.

          Show
          Uwe Schindler added a comment - Ok, thanks for reporting back. So there was maybe a problem in the past with XMLWriter, which is solved with Lucene trunk. Can you also check branch_3x (Lucene 3.1), because this is the next release and trunk (Lucene 4.0) is very unstable.
          Hide
          Bernd Fehling added a comment -

          Sorry for the delay, have checked last trunk version with wt=json , wt=xml and wt=velocity.
          Works fine now, thanks a lot.

          Show
          Bernd Fehling added a comment - Sorry for the delay, have checked last trunk version with wt=json , wt=xml and wt=velocity. Works fine now, thanks a lot.
          Hide
          William Bell added a comment -

          I vote for Jetty 7. I have been using it with SOLR since v1.4.0

          Show
          William Bell added a comment - I vote for Jetty 7. I have been using it with SOLR since v1.4.0
          Hide
          Bernd Fehling added a comment -

          Looks already much better.
          First tests show that with DIH the unicode above BMP get correctly stored in a string index field.
          If displayed with wt=json it is correct unicode.
          If displayed with wt=xml it is invalid unicode.

          Example (Mathematical sans-serif capital S):
          loaded unicode with DIH - U+1D5B2 (F0 9D 96 B2)
          displayed with wt=json - U+1D5B2 (F0 9D 96 B2)
          displayed with wt=xml - ??????? (ED A0 B5 ED B6 B2)

          This was logged with wireshark directly from the network.

          Open question:

          • is the xml output a jetty problem or XMLwriter from Lucene/Solr?
          Show
          Bernd Fehling added a comment - Looks already much better. First tests show that with DIH the unicode above BMP get correctly stored in a string index field. If displayed with wt=json it is correct unicode. If displayed with wt=xml it is invalid unicode. Example (Mathematical sans-serif capital S): loaded unicode with DIH - U+1D5B2 (F0 9D 96 B2) displayed with wt=json - U+1D5B2 (F0 9D 96 B2) displayed with wt=xml - ??????? (ED A0 B5 ED B6 B2) This was logged with wireshark directly from the network. Open question: is the xml output a jetty problem or XMLwriter from Lucene/Solr?
          Hide
          Uwe Schindler added a comment -

          I just want to attach the patch, that removes usage of ServletResponse.getWriter() and uses own Writer on ServletOutputStream.

          With a fixed Jetty, thats not needed, but may get interesting, if other servlet containers are broken, too.

          Just to note: on the input side (reading POST requests) we no longer use ServletRequest.getReader() since long time...

          Show
          Uwe Schindler added a comment - I just want to attach the patch, that removes usage of ServletResponse.getWriter() and uses own Writer on ServletOutputStream. With a fixed Jetty, thats not needed, but may get interesting, if other servlet containers are broken, too. Just to note: on the input side (reading POST requests) we no longer use ServletRequest.getReader() since long time...
          Hide
          Robert Muir added a comment -

          Just for reference, i took a look at the jetty 7 branch (from http://download.eclipse.org/jetty/stable-7/dist/)

          It appears the code was fixed to try to cover these cases... not saying it actually works, I didn't test it and that would be a scarier larger change.

          Show
          Robert Muir added a comment - Just for reference, i took a look at the jetty 7 branch (from http://download.eclipse.org/jetty/stable-7/dist/ ) It appears the code was fixed to try to cover these cases... not saying it actually works, I didn't test it and that would be a scarier larger change.
          Hide
          Robert Muir added a comment -

          Committed revision 1074726, 1074742 (branch_3x)

          Show
          Robert Muir added a comment - Committed revision 1074726, 1074742 (branch_3x)
          Hide
          Yonik Seeley added a comment -

          Seems to work fine, +1 to commit!

          I added an output test to test_utf8.sh, but the char is directly included in the script and it's probably not that robust (but it is just a test). I tried using "od" first, but that took different args in OS X and ubuntu.

          Show
          Yonik Seeley added a comment - Seems to work fine, +1 to commit! I added an output test to test_utf8.sh, but the char is directly included in the script and it's probably not that robust (but it is just a test). I tried using "od" first, but that took different args in OS X and ubuntu.
          Hide
          Robert Muir added a comment -

          attached is a patch file for trunk (just adding a tiny test so we stand a chance of knowing if this somehow breaks again).

          included are the patched jar files, i applied the patch to the release version of 6.1.26.

          the test_utf8.sh is now correct by default, additionally the manual test I did before on the mailing list works.

          I tested this against branch_3x too.

          Show
          Robert Muir added a comment - attached is a patch file for trunk (just adding a tiny test so we stand a chance of knowing if this somehow breaks again). included are the patched jar files, i applied the patch to the release version of 6.1.26. the test_utf8.sh is now correct by default, additionally the manual test I did before on the mailing list works. I tested this against branch_3x too.
          Hide
          Yonik Seeley added a comment -

          include a patched version of jetty with correct utf-8, using that patch.

          +1

          There's no telling when the next point release of jetty is going to come out.

          Show
          Yonik Seeley added a comment - include a patched version of jetty with correct utf-8, using that patch. +1 There's no telling when the next point release of jetty is going to come out.
          Hide
          Karl Wright added a comment -

          The resolution of this issue is of interest to me, since ManifoldCF uses the same jetty container currently used by Solr.

          Show
          Karl Wright added a comment - The resolution of this issue is of interest to me, since ManifoldCF uses the same jetty container currently used by Solr.

            People

            • Assignee:
              Robert Muir
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development