Solr
  1. Solr
  2. SOLR-412

XsltWriter does not output UTF-8 by default

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 1.2
    • Fix Version/s: None
    • Component/s: Response Writers
    • Labels:
      None
    • Environment:

      Tomcat 5.5
      Linux Red Hat ES4 (2.6.9-5.ELsmp from 'uname -a')

      Description

      XsltWriter outputs XML text in ISO-8859-1 encoding by default.

      Tomcat 5.5 has URIEncoding="UTF-8" set in the <Connector> element as described in the Wiki.

      This outout description in the XML:

      <xsl:output method="xml" encoding="utf-8" />

      gives output with this header:

      HTTP/1.1 200 OK
      Server: Apache-Coyote/1.1
      Content-Type: text/xml;charset=ISO-8859-1
      Transfer-Encoding: chunked
      Date: Wed, 14 Nov 2007 17:49:11 GMT

      I had to change the <xsl:output> directive to this:

      <xsl:output media-type="text/xml; charset=UTF-8" encoding="UTF-8"/>

      This is the root cause of SOLR-233.

      1. diff-2009-10-22
        3 kB
        Age Jan Kuperus

        Issue Links

          Activity

          Hide
          Lance Norskog added a comment -

          SOLR-233 was repaired with a band-aid. This bug describes the root cause of the problem.

          Show
          Lance Norskog added a comment - SOLR-233 was repaired with a band-aid. This bug describes the root cause of the problem.
          Hide
          Hoss Man added a comment -

          i'm confused as to what the fix here would be... what do you think Solr should do instead of the current behavior? the XSLTResponseWriter takes the media-type and uses it as the Content-Type ... Tomcat decides that since the Content-Type doesn't have a charset, it will add one (it's default, which i'm assuming can be configured in the tomcat configs)

          ...what would you suggest as an improvement?

          (i agree UTF-8 should be the Solr default as much as possible ... but the point of the XSLTResponseWriter is to give the xslt creator total control over the content-type ... doing anything that might circumvent their intentions seems like a pad idea).

          Show
          Hoss Man added a comment - i'm confused as to what the fix here would be... what do you think Solr should do instead of the current behavior? the XSLTResponseWriter takes the media-type and uses it as the Content-Type ... Tomcat decides that since the Content-Type doesn't have a charset, it will add one (it's default, which i'm assuming can be configured in the tomcat configs) ...what would you suggest as an improvement? (i agree UTF-8 should be the Solr default as much as possible ... but the point of the XSLTResponseWriter is to give the xslt creator total control over the content-type ... doing anything that might circumvent their intentions seems like a pad idea).
          Hide
          Lance Norskog added a comment -

          I am not an XSL expert. From what I can tell, the XSLT
          documentation says that this:
          <xsl:output method="xml" encoding="utf-8" />
          <xsl:output media-type="text/xml; charset=UTF-8"
          are equivalent. It seems like both should create XML
          encoded in UTF-8, and should should create the same
          Content-type header line. My bug report is that the
          media-type form works, but that the method="xml" form
          does not.

          I would not be surprised to learn that the
          method="xml" form does not do what it looks like; at
          this point I have no respect for the XSLT language.
          Thank you for your time and attention to my humble
          complaint.

          Lance

          — "Hoss Man (JIRA)" <jira@apache.org> wrote:

          https://issues.apache.org/jira/browse/SOLR-412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542609

          ____________________________________________________________________________________
          Never miss a thing. Make Yahoo your home page.
          http://www.yahoo.com/r/hs

          Show
          Lance Norskog added a comment - I am not an XSL expert. From what I can tell, the XSLT documentation says that this: <xsl:output method="xml" encoding="utf-8" /> <xsl:output media-type="text/xml; charset=UTF-8" are equivalent. It seems like both should create XML encoded in UTF-8, and should should create the same Content-type header line. My bug report is that the media-type form works, but that the method="xml" form does not. I would not be surprised to learn that the method="xml" form does not do what it looks like; at this point I have no respect for the XSLT language. Thank you for your time and attention to my humble complaint. Lance — "Hoss Man (JIRA)" <jira@apache.org> wrote: https://issues.apache.org/jira/browse/SOLR-412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542609 ____________________________________________________________________________________ Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs
          Hide
          Hoss Man added a comment -

          based on my reading of: http://www.w3.org/TR/xslt20/#element-output

          the "method" attribute exists solely to instruct the transformer how to generate the output ... it appears to exist largely to support hacks for html but also to support plain text output.

          "encoding" dictates the actual character encoding used in the output stream.

          "media-type" is ... the media-type, which if unspecified defaults to either "text/xml" if method="xml" or "text/html" or "text/plain" for the corrisponding methods ... but the default media-type does not ever seem to be influenced by the "encoding" attribute.

          I'm not convinced there isn't something Solr can do to handle this situation better, i just don't know what it is.

          Show
          Hoss Man added a comment - based on my reading of: http://www.w3.org/TR/xslt20/#element-output the "method" attribute exists solely to instruct the transformer how to generate the output ... it appears to exist largely to support hacks for html but also to support plain text output. "encoding" dictates the actual character encoding used in the output stream. "media-type" is ... the media-type, which if unspecified defaults to either "text/xml" if method="xml" or "text/html" or "text/plain" for the corrisponding methods ... but the default media-type does not ever seem to be influenced by the "encoding" attribute. I'm not convinced there isn't something Solr can do to handle this situation better, i just don't know what it is.
          Hide
          Age Jan Kuperus added a comment - - edited

          IMHO the documentation in xslt 1.0 (http://www.w3.org/TR/xslt#output) is a bit clearer on the usage of these fields:

          "The method attribute on xsl:output identifies the overall method that should be used for outputting the result tree. The value must be a QName. If the QName does not have a prefix, then it identifies a method specified in this document and must be one of xml, html or text."

          "encoding specifies the preferred character encoding that the XSLT processor should use to encode sequences of characters as sequences of bytes; the value of the attribute should be treated case-insensitively; the value must contain only characters in the range #x21 to #x7E (i.e. printable ASCII characters); the value should either be a charset registered with the Internet Assigned Numbers Authority [IANA], [RFC2278] or start with X-"

          "media-type specifies the media type (MIME content type) of the data that results from outputting the result tree; the charset parameter should not be specified explicitly; instead, when the top-level media type is text, a charset parameter should be added according to the character encoding actually used by the output method"

          If I understand this correctly, this means the correct output specification is <xsl:output method="xml" encoding="utf-8" />, and <xsl:output media-type="text/xml; charset=UTF-8"/> should never be used.

          My suggestion would be to change XSLTResponseWriter.getContentType() in such a way that (in pseudocode):
          if encoding is null
          .. encoding = "utf-8"
          end if
          if media-type is not null
          .. /* next if is for compatibility with the workaround only */
          .. if media-type contains "charset='
          .... return media-type
          .. else
          .... return media-type + "; charset=\"" + encoding
          .. end if
          else
          .. if method is "html" or the first element in the final output is <html>
          .... media-type = "text/html"
          .. elseif method is "text"
          .... media-type = "text/plain"
          .. else /* it must be xml */
          .... media-type = "text/xml"
          .. end if
          .. return media-type + "; charset=\"" + encoding
          end if

          Show
          Age Jan Kuperus added a comment - - edited IMHO the documentation in xslt 1.0 ( http://www.w3.org/TR/xslt#output ) is a bit clearer on the usage of these fields: "The method attribute on xsl:output identifies the overall method that should be used for outputting the result tree. The value must be a QName. If the QName does not have a prefix, then it identifies a method specified in this document and must be one of xml, html or text." "encoding specifies the preferred character encoding that the XSLT processor should use to encode sequences of characters as sequences of bytes; the value of the attribute should be treated case-insensitively; the value must contain only characters in the range #x21 to #x7E (i.e. printable ASCII characters); the value should either be a charset registered with the Internet Assigned Numbers Authority [IANA] , [RFC2278] or start with X-" "media-type specifies the media type (MIME content type) of the data that results from outputting the result tree; the charset parameter should not be specified explicitly; instead, when the top-level media type is text, a charset parameter should be added according to the character encoding actually used by the output method" If I understand this correctly, this means the correct output specification is <xsl:output method="xml" encoding="utf-8" />, and <xsl:output media-type="text/xml; charset=UTF-8"/> should never be used. My suggestion would be to change XSLTResponseWriter.getContentType() in such a way that (in pseudocode): if encoding is null .. encoding = "utf-8" end if if media-type is not null .. /* next if is for compatibility with the workaround only */ .. if media-type contains "charset=' .... return media-type .. else .... return media-type + "; charset=\"" + encoding .. end if else .. if method is "html" or the first element in the final output is <html> .... media-type = "text/html" .. elseif method is "text" .... media-type = "text/plain" .. else /* it must be xml */ .... media-type = "text/xml" .. end if .. return media-type + "; charset=\"" + encoding end if
          Hide
          Age Jan Kuperus added a comment -

          Attached a patch against the 2009-10-22 daily tgz as we implemented it, which correctly handles all legal situations we tried, including the defaults.

          This patch does not explicitly handle two corner cases (this is documented in the patch), which could lead to less expected results (I can't test that here):

          1) html documents without explicit <xsl:output method="html" .../> will be treated as xml. IMHO this situation should never occur as it is bad XSLT programming behaviour.

          2) the (IMHO incorrect) previous solution (<xsl:output media-type="...; charset=... encoding=.../>) will result in a double charset definition. Although that is incorrect, it is accepted without error by Firefox and possibly by all browsers (I did not test that) . As stated before, it should not be done that way.

          Show
          Age Jan Kuperus added a comment - Attached a patch against the 2009-10-22 daily tgz as we implemented it, which correctly handles all legal situations we tried, including the defaults. This patch does not explicitly handle two corner cases (this is documented in the patch), which could lead to less expected results (I can't test that here): 1) html documents without explicit <xsl:output method="html" .../> will be treated as xml. IMHO this situation should never occur as it is bad XSLT programming behaviour. 2) the (IMHO incorrect) previous solution (<xsl:output media-type="...; charset=... encoding=.../>) will result in a double charset definition. Although that is incorrect, it is accepted without error by Firefox and possibly by all browsers (I did not test that) . As stated before, it should not be done that way.
          Hide
          Hoss Man added a comment -

          IMHO the documentation in xslt 1.0 (http://www.w3.org/TR/xslt#output) is a bit clearer on the usage of these fields

          I'm not sure if looking at an older specification proposal is really the right way to go here. Shouldn't the fact that all of that language was removed from the XSLT 2.0 spec suggest that it was changed for a reason?

          Show
          Hoss Man added a comment - IMHO the documentation in xslt 1.0 ( http://www.w3.org/TR/xslt#output ) is a bit clearer on the usage of these fields I'm not sure if looking at an older specification proposal is really the right way to go here. Shouldn't the fact that all of that language was removed from the XSLT 2.0 spec suggest that it was changed for a reason?
          Hide
          Age Jan Kuperus added a comment -

          I agree. Although I was pretty sure XSLT 2.0 was even stricter but could not immediately find a formal reference.
          So I did some more research today and found the following confirmation in http://www.w3.org/TR/xslt-xquery-serialization/, which is part of XSLT 2.0:

          "media-type A string of Unicode characters specifying the media type (MIME content type) [RFC2046]; the charset parameter of the media type MUST NOT be specified explicitly in the value of the media-type parameter".

          Therefore I would like you to have a look at my patch and comment on it (or even commit it . Committing this patch would also require the patches for SOLR-233 and SOLR-514 to be undone (as their results are illegal in both XSLT 1.0 and 2.0), and possibly has documentation consequences.

          Show
          Age Jan Kuperus added a comment - I agree. Although I was pretty sure XSLT 2.0 was even stricter but could not immediately find a formal reference. So I did some more research today and found the following confirmation in http://www.w3.org/TR/xslt-xquery-serialization/ , which is part of XSLT 2.0: "media-type A string of Unicode characters specifying the media type (MIME content type) [RFC2046] ; the charset parameter of the media type MUST NOT be specified explicitly in the value of the media-type parameter". Therefore I would like you to have a look at my patch and comment on it (or even commit it . Committing this patch would also require the patches for SOLR-233 and SOLR-514 to be undone (as their results are illegal in both XSLT 1.0 and 2.0), and possibly has documentation consequences.
          Hide
          Hoss Man added a comment -

          Ok, i've become comvinced that we should do something like the psuedo-code Age posted above ... not so much by the additional xslt-query-serialization refrnece, but by thinking through the practical use cases...

          • If a template specifies a charset in it's media-type property it doesnt' change anything for those people
          • If people have media-types w/o charset's but they do declare an encoding then we're matching their wishes as best we can, and if they don't like it they can add a charset to the media-type

          Age: I haven't looked carefully at your patch, but if we can fix the double charset problem you described (which should be easy with a simple substring test) then i'm +1 for making this change.

          Show
          Hoss Man added a comment - Ok, i've become comvinced that we should do something like the psuedo-code Age posted above ... not so much by the additional xslt-query-serialization refrnece, but by thinking through the practical use cases... If a template specifies a charset in it's media-type property it doesnt' change anything for those people If people have media-types w/o charset's but they do declare an encoding then we're matching their wishes as best we can, and if they don't like it they can add a charset to the media-type Age: I haven't looked carefully at your patch, but if we can fix the double charset problem you described (which should be easy with a simple substring test) then i'm +1 for making this change.
          Hide
          Jan Høydahl added a comment -

          I believe this is fixed in 3.1 by SOLR-2391. Have looked in code but not verified by testing. Please re-open if anyone still thinks there is work left on this.

          Show
          Jan Høydahl added a comment - I believe this is fixed in 3.1 by SOLR-2391 . Have looked in code but not verified by testing. Please re-open if anyone still thinks there is work left on this.

            People

            • Assignee:
              Unassigned
              Reporter:
              Lance Norskog
            • Votes:
              2 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development