Solr
  1. Solr
  2. SOLR-231

By default, use UTF-8 for posted content streams

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.2
    • Component/s: None
    • Labels:
      None

      Description

      Solr should assume UTF-8 encoding unless the contentType says otherwise. To change the contentType and encoding set the header value with contentType ="text/xml; charset=utf-8"

      likewise, with stream.body=xxxx, will default to UTF-8 unless the stream.contentType says otherwise.

      For previous discussion, see:

      http://www.nabble.com/resin-and-UTF-8-in-URLs-tf3152910.html

      http://www.nabble.com/charset-in-POST-from-browser-tf3153057.html

      http://www.nabble.com/Re%3A-svn-commit%3A-r536048----lucene-solr-trunk-src-webapp-src-org-apache-solr-servlet-SolrRequestParsers.java-tf3712816.html

      1. SOLR-231-ContentType-UTF8.patch
        5 kB
        Ryan McKinley
      2. SOLR-231-ContentType-UTF8.patch
        5 kB
        Ryan McKinley

        Activity

        Hide
        Ryan McKinley added a comment -

        added in 537024

        Show
        Ryan McKinley added a comment - added in 537024
        Hide
        Hoss Man added a comment -

        Yonik: agreed that the XML parsing should (eventually) use the raw InputStream instead of a Reader if no explicit charset is declaured in teh content type ... but that's a seperate issue (SOLR-96) specific to XmlUpdateRequestHandler.

        Independent of that is the question: "what should an arbitrary request handler get if it calls ContentStream.getReader and the ContentStream doesn't know explicitly know the charset of the InputStream it has?"

        The patch seems clean to me.

        Show
        Hoss Man added a comment - Yonik: agreed that the XML parsing should (eventually) use the raw InputStream instead of a Reader if no explicit charset is declaured in teh content type ... but that's a seperate issue ( SOLR-96 ) specific to XmlUpdateRequestHandler. Independent of that is the question: "what should an arbitrary request handler get if it calls ContentStream.getReader and the ContentStream doesn't know explicitly know the charset of the InputStream it has?" The patch seems clean to me.
        Hide
        Ryan McKinley added a comment -

        >> Solr should assume UTF-8 encoding unless the contentType says otherwise.
        >
        > In general yes (when Solr is asked for a Reader).
        > For XML, we should probably give the parser an InputStream.
        > http://www.nabble.com/double-curl-calls-in-post.sh--tf2287469.html#a6369448
        >

        sounds good. This patch only affects what charset is used when you call getReader()

        Perhaps as part of SOLR-133, we should make sure it it passed uses the getInputStream() method.

        Show
        Ryan McKinley added a comment - >> Solr should assume UTF-8 encoding unless the contentType says otherwise. > > In general yes (when Solr is asked for a Reader). > For XML, we should probably give the parser an InputStream. > http://www.nabble.com/double-curl-calls-in-post.sh--tf2287469.html#a6369448 > sounds good. This patch only affects what charset is used when you call getReader() Perhaps as part of SOLR-133 , we should make sure it it passed uses the getInputStream() method.
        Hide
        Yonik Seeley added a comment -

        > Solr should assume UTF-8 encoding unless the contentType says otherwise.

        In general yes (when Solr is asked for a Reader).
        For XML, we should probably give the parser an InputStream.
        http://www.nabble.com/double-curl-calls-in-post.sh--tf2287469.html#a6369448

        Show
        Yonik Seeley added a comment - > Solr should assume UTF-8 encoding unless the contentType says otherwise. In general yes (when Solr is asked for a Reader). For XML, we should probably give the parser an InputStream. http://www.nabble.com/double-curl-calls-in-post.sh--tf2287469.html#a6369448
        Hide
        Ryan McKinley added a comment -

        oops, need to grant license

        Show
        Ryan McKinley added a comment - oops, need to grant license
        Hide
        Ryan McKinley added a comment -

        This patch also makes sure the behavior is consistent for multipart file uploads

        Show
        Ryan McKinley added a comment - This patch also makes sure the behavior is consistent for multipart file uploads

          People

          • Assignee:
            Ryan McKinley
            Reporter:
            Ryan McKinley
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development