Solr
  1. Solr
  2. SOLR-214

deficit of InputStreamReader support in anonymous class of ContentStream

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.2
    • Component/s: None
    • Labels:
      None

      Description

      After SOLR-197 is applied, POSTed Japanese XML contents turn into garbled characters in the index.
      I can see the garbled characters through Luke. The issue was never seen before SOLR-197.
      The cause of this problem is that the deficit of InputStreamReader support in the anonymous class of ContentStream in SolrRequestParsers.parseParamsAndFillStreams() method.

      Before SOLR-197, InputStreamReader was used in XmlUpdateRequestHandler.handleRequestBody() method:

      // Cycle through each stream
      for( ContentStream stream : req.getContentStreams() ) {
      String charset = getCharsetFromContentType( stream.getContentType() );
      Reader reader = null;
      if( charset == null )

      { reader = new InputStreamReader( stream.getStream() ); }

      else

      { reader = new InputStreamReader( stream.getStream(), charset ); }

      rsp.add( "update", this.update( reader ) );

      // Make sure its closed
      try

      { reader.close(); }

      catch( Exception ex ){}
      }

      The patch will apply this effect to SolrRequestParsers.

      regards,

        Issue Links

          Activity

          Hide
          Koji Sekiguchi added a comment -

          The patch attached.

          Show
          Koji Sekiguchi added a comment - The patch attached.
          Hide
          Ryan McKinley added a comment -

          Weird - the javadocs a pretty explicit that request.getReader() should take care of the character encoding:
          http://java.sun.com/javaee/5/docs/api/javax/servlet/ServletRequest.html#getReader()

          What app server are you running?

          Does this happen when you are using the /update from servlet? (when /update is not mapped in solrconfig.xml)

          SolrUpdateServlet.java has always used getReader() .

          Show
          Ryan McKinley added a comment - Weird - the javadocs a pretty explicit that request.getReader() should take care of the character encoding: http://java.sun.com/javaee/5/docs/api/javax/servlet/ServletRequest.html#getReader( ) What app server are you running? Does this happen when you are using the /update from servlet? (when /update is not mapped in solrconfig.xml) SolrUpdateServlet.java has always used getReader() .
          Hide
          Ken Krugler added a comment -

          There's some complex interplay of the content-type in the request, the charset (if any) in the request, and the container being used. So some interesting questions are:

          1. exactly how the content is being posted (e.g. via the example script?)
          2. what request header values are being sent along with the post.
          3. what servlet container (and version) is being used.
          Show
          Ken Krugler added a comment - There's some complex interplay of the content-type in the request, the charset (if any) in the request, and the container being used. So some interesting questions are: exactly how the content is being posted (e.g. via the example script?) what request header values are being sent along with the post. what servlet container (and version) is being used.
          Hide
          Toru Matsuzawa added a comment -

          This problem can be confirmed with tomcat 5.5.23.

          This problem had occurred by "/update" before the correction of SOLR-197.
          stream.getReader() is acquired by org.apache.catalina.connector.CoyoteReader.

          CoyoteReader use org.apache.catalina.connector.InputBuffer#realReadBytes().
          realReadBytes() is read with byte order.
          Therefore, garbled characters in the index.

          Show
          Toru Matsuzawa added a comment - This problem can be confirmed with tomcat 5.5.23. This problem had occurred by "/update" before the correction of SOLR-197 . stream.getReader() is acquired by org.apache.catalina.connector.CoyoteReader. CoyoteReader use org.apache.catalina.connector.InputBuffer#realReadBytes(). realReadBytes() is read with byte order. Therefore, garbled characters in the index.
          Hide
          Koji Sekiguchi added a comment -

          > Weird - the javadocs a pretty explicit that request.getReader() should take care of the character encoding:
          > http://java.sun.com/javaee/5/docs/api/javax/servlet/ServletRequest.html#getReader()

          Good point. I simply thought the cause of this problem was the deficit of InputStreamReader support at SOLR-197.
          But according to the javadoc, the servlet container should take care of encoding. We are using Tomcat 5.5.23. We should check out the servlet container. Thanks.

          Show
          Koji Sekiguchi added a comment - > Weird - the javadocs a pretty explicit that request.getReader() should take care of the character encoding: > http://java.sun.com/javaee/5/docs/api/javax/servlet/ServletRequest.html#getReader( ) Good point. I simply thought the cause of this problem was the deficit of InputStreamReader support at SOLR-197 . But according to the javadoc, the servlet container should take care of encoding. We are using Tomcat 5.5.23. We should check out the servlet container. Thanks.
          Hide
          Koji Sekiguchi added a comment -

          Close as invalid. The servlet container should take care of character encoding.

          Show
          Koji Sekiguchi added a comment - Close as invalid. The servlet container should take care of character encoding.
          Hide
          Ryan McKinley added a comment -

          Without this patch, resin balks at utf-8 input

          http://www.nabble.com/UTF-8-problem-with-Resin-tf3704271.html

          If resin and tomcat don't handle "getReader()" correctly, maybe we should handle it explicitly

          Show
          Ryan McKinley added a comment - Without this patch, resin balks at utf-8 input http://www.nabble.com/UTF-8-problem-with-Resin-tf3704271.html If resin and tomcat don't handle "getReader()" correctly, maybe we should handle it explicitly
          Hide
          Koji Sekiguchi added a comment -

          At this moment, to avoid this problem, we are examining to put a servlet filter to work.
          But if Solr handles character encoding explicitly, we will be happy with it. We are using Tomcat 5.5.23.

          Show
          Koji Sekiguchi added a comment - At this moment, to avoid this problem, we are examining to put a servlet filter to work. But if Solr handles character encoding explicitly, we will be happy with it. We are using Tomcat 5.5.23.
          Hide
          Ryan McKinley added a comment -

          added in rev 536019

          Show
          Ryan McKinley added a comment - added in rev 536019
          Hide
          Hoss Man added a comment -

          This bug was modified as part of a bulk update using the criteria...

          • Marked ("Resolved" or "Closed") and "Fixed"
          • Had no "Fix Version" versions
          • Was listed in the CHANGES.txt for 1.2

          The Fix Version for all 39 issues found was set to 1.2, email notification
          was suppressed to prevent excessive email.

          For a list of all the issues modified, search jira comments for this
          (hopefully) unique string: 20080415hossman2

          Show
          Hoss Man added a comment - This bug was modified as part of a bulk update using the criteria... Marked ("Resolved" or "Closed") and "Fixed" Had no "Fix Version" versions Was listed in the CHANGES.txt for 1.2 The Fix Version for all 39 issues found was set to 1.2, email notification was suppressed to prevent excessive email. For a list of all the issues modified, search jira comments for this (hopefully) unique string: 20080415hossman2

            People

            • Assignee:
              Ryan McKinley
              Reporter:
              Koji Sekiguchi
            • Votes:
              2 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development