Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.4
    • Fix Version/s: 4.5, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      Allow a user to send a query or update to Solr in a character set other than UTF-8 and inform Solr what charset to use with an "ie" parameter, for input encoding. This was discussed in SOLR-4265 and SOLR-4283.

      Changing the default charset is a bad idea because distributed search (SolrCloud) relies on UTF-8.

      1. SOLR-5082.patch
        10 kB
        Uwe Schindler
      2. SOLR-5082.patch
        9 kB
        Uwe Schindler

        Activity

        Hide
        Uwe Schindler added a comment - - edited

        I have a patch already, but it is not yet for the public: I want to improve it so we don't need to scan the query_string 2 times, but instead split URL %-decoding and byte->string into separate steps.

        Show
        Uwe Schindler added a comment - - edited I have a patch already, but it is not yet for the public: I want to improve it so we don't need to scan the query_string 2 times, but instead split URL %-decoding and byte->string into separate steps.
        Hide
        Uwe Schindler added a comment -

        Patch.

        This uses a buffering approach: It buffers all key-value pair until it sees a ie=CHARSET kv pair. It then decodes all buffered tokens and from now on directly decodes. This is the most memory efficent approach I was able to find.

        Show
        Uwe Schindler added a comment - Patch. This uses a buffering approach: It buffers all key-value pair until it sees a ie=CHARSET kv pair. It then decodes all buffered tokens and from now on directly decodes. This is the most memory efficent approach I was able to find.
        Hide
        Yonik Seeley added a comment -

        Given that this will almost never be used, maybe we should handle it as an exception case that doesn't slow down the normal/standard UTF8 case.
        We could do it like before, but we could check for "ie" after the fact and re-parse (and also try a slower re-parse on an exception).

        Show
        Yonik Seeley added a comment - Given that this will almost never be used, maybe we should handle it as an exception case that doesn't slow down the normal/standard UTF8 case. We could do it like before, but we could check for "ie" after the fact and re-parse (and also try a slower re-parse on an exception).
        Hide
        Uwe Schindler added a comment - - edited

        This one does not slow down, as it does not reparse. I will soon post a patch that only enabled this mode for the query string, not POSTed content. For POSTed content you can supply the charset in the Content-Type-Header.

        Show
        Uwe Schindler added a comment - - edited This one does not slow down, as it does not reparse. I will soon post a patch that only enabled this mode for the query string, not POSTed content. For POSTed content you can supply the charset in the Content-Type-Header.
        Hide
        Uwe Schindler added a comment -

        Patch:

        • Uses LinkedList (more memory effectove, as the buffer is freed while replay)
        • Does not allow ie= for POSTED formadata. The encoding must be set via Content-Type header in that case.
        Show
        Uwe Schindler added a comment - Patch: Uses LinkedList (more memory effectove, as the buffer is freed while replay) Does not allow ie= for POSTED formadata. The encoding must be set via Content-Type header in that case.
        Hide
        Uwe Schindler added a comment -

        More strict about ie param.

        Show
        Uwe Schindler added a comment - More strict about ie param.
        Hide
        Uwe Schindler added a comment -

        Shawn Heisey: Are you fine with this code?

        From my tests here I have seen no slowdown for query-string parsing, it is as fast as before, every slowdown is smaller than measureable. In any case, the current URLDecoder is much more efficient than the one embedded into Jetty (the one with broken UTF8 in earlier versions). The slowest part in the whole code is MultiMapSolrParams#add, because it reallocates arrays all the time on duplicate keys...

        Show
        Uwe Schindler added a comment - Shawn Heisey : Are you fine with this code? From my tests here I have seen no slowdown for query-string parsing, it is as fast as before, every slowdown is smaller than measureable. In any case, the current URLDecoder is much more efficient than the one embedded into Jetty (the one with broken UTF8 in earlier versions). The slowest part in the whole code is MultiMapSolrParams#add, because it reallocates arrays all the time on duplicate keys...
        Hide
        ASF subversion and git services added a comment -

        Commit 1508236 from Uwe Schindler in branch 'dev/trunk'
        [ https://svn.apache.org/r1508236 ]

        SOLR-5082: The encoding of URL-encoded query parameters can be changed with the "ie" (input encoding) parameter, e.g. "select?q=m%FCller&ie=ISO-8859-1". The default is UTF-8. To change the encoding of POSTed content, use the "Content-Type" HTTP header

        Show
        ASF subversion and git services added a comment - Commit 1508236 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1508236 ] SOLR-5082 : The encoding of URL-encoded query parameters can be changed with the "ie" (input encoding) parameter, e.g. "select?q=m%FCller&ie=ISO-8859-1". The default is UTF-8. To change the encoding of POSTed content, use the "Content-Type" HTTP header
        Hide
        ASF subversion and git services added a comment -

        Commit 1508237 from Uwe Schindler in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1508237 ]

        Merged revision(s) 1508236 from lucene/dev/trunk:
        SOLR-5082: The encoding of URL-encoded query parameters can be changed with the "ie" (input encoding) parameter, e.g. "select?q=m%FCller&ie=ISO-8859-1". The default is UTF-8. To change the encoding of POSTed content, use the "Content-Type" HTTP header

        Show
        ASF subversion and git services added a comment - Commit 1508237 from Uwe Schindler in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1508237 ] Merged revision(s) 1508236 from lucene/dev/trunk: SOLR-5082 : The encoding of URL-encoded query parameters can be changed with the "ie" (input encoding) parameter, e.g. "select?q=m%FCller&ie=ISO-8859-1". The default is UTF-8. To change the encoding of POSTed content, use the "Content-Type" HTTP header
        Hide
        David Smiley added a comment -

        Uwe, why did you give me credit with you on this in CHANGES.txt?

        By the way, I was looking through the code for this. Why in decodeBuffer() do you call remove() from the buffer iterator on every item; couldn't you not to that and simply call clear() when the loop is done? If you made that change, I think ArrayList would perform better for this buffer than LinkedList.

        Show
        David Smiley added a comment - Uwe, why did you give me credit with you on this in CHANGES.txt? By the way, I was looking through the code for this. Why in decodeBuffer() do you call remove() from the buffer iterator on every item; couldn't you not to that and simply call clear() when the loop is done? If you made that change, I think ArrayList would perform better for this buffer than LinkedList.
        Hide
        ASF subversion and git services added a comment -

        Commit 1524086 from David Smiley in branch 'dev/trunk'
        [ https://svn.apache.org/r1524086 ]

        SOLR-5082: removed inadvertent credit to dsmiley

        Show
        ASF subversion and git services added a comment - Commit 1524086 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1524086 ] SOLR-5082 : removed inadvertent credit to dsmiley
        Hide
        ASF subversion and git services added a comment -

        Commit 1524090 from David Smiley in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1524090 ]

        SOLR-5082: removed inadvertent credit to dsmiley

        Show
        ASF subversion and git services added a comment - Commit 1524090 from David Smiley in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1524090 ] SOLR-5082 : removed inadvertent credit to dsmiley
        Hide
        ASF subversion and git services added a comment -

        Commit 1524091 from David Smiley in branch 'dev/branches/lucene_solr_4_5'
        [ https://svn.apache.org/r1524091 ]

        SOLR-5082: removed inadvertent credit to dsmiley

        Show
        ASF subversion and git services added a comment - Commit 1524091 from David Smiley in branch 'dev/branches/lucene_solr_4_5' [ https://svn.apache.org/r1524091 ] SOLR-5082 : removed inadvertent credit to dsmiley
        Hide
        Uwe Schindler added a comment -

        Hi David. Sorry for adding credit to you. The credit was meant to Shawn Heisey, so maybe you can add him instead. I can also do this.

        Uwe

        Show
        Uwe Schindler added a comment - Hi David. Sorry for adding credit to you. The credit was meant to Shawn Heisey, so maybe you can add him instead. I can also do this. Uwe
        Hide
        ASF subversion and git services added a comment -

        Commit 1524282 from Uwe Schindler in branch 'dev/trunk'
        [ https://svn.apache.org/r1524282 ]

        SOLR-5082: Fix credits

        Show
        ASF subversion and git services added a comment - Commit 1524282 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1524282 ] SOLR-5082 : Fix credits
        Hide
        ASF subversion and git services added a comment -

        Commit 1524283 from Uwe Schindler in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1524283 ]

        SOLR-5082: Fix credits

        Show
        ASF subversion and git services added a comment - Commit 1524283 from Uwe Schindler in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1524283 ] SOLR-5082 : Fix credits
        Hide
        ASF subversion and git services added a comment -

        Commit 1524284 from Uwe Schindler in branch 'dev/branches/lucene_solr_4_5'
        [ https://svn.apache.org/r1524284 ]

        Merged revision(s) 1524282 from lucene/dev/trunk:
        SOLR-5082: Fix credits

        Show
        ASF subversion and git services added a comment - Commit 1524284 from Uwe Schindler in branch 'dev/branches/lucene_solr_4_5' [ https://svn.apache.org/r1524284 ] Merged revision(s) 1524282 from lucene/dev/trunk: SOLR-5082 : Fix credits
        Hide
        Adrien Grand added a comment -

        4.5 release -> bulk close

        Show
        Adrien Grand added a comment - 4.5 release -> bulk close

          People

          • Assignee:
            Uwe Schindler
            Reporter:
            Shawn Heisey
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development