Solr
  1. Solr
  2. SOLR-96

Solr should support alternate charsets for XML update messages

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: update
    • Labels:
      None

      Description

      At the moment, the XML messages sent to solr to add/delete documents must be in UTF-8. The imput processing should be changed to determine the charset based on the HTTP header info, or the XML contents.

      Background and refrence material...

      http://www.nabble.com/double-curl-calls-in-post.sh--tf2287469.html#a6369448
      http://www.nabble.com/wana-use-CJKAnalyzer-tf2303256.html#a6451918
      http://www.ietf.org/rfc/rfc3023.txt
      http://www.w3.org/TR/REC-xml/#sec-guessing

      1. SOLR-96.patch
        4 kB
        Uwe Schindler
      2. SOLR-96.patch
        2 kB
        Uwe Schindler
      3. SOLR-96-3.x.patch
        8 kB
        Uwe Schindler
      4. SOLR-96-test.patch
        5 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Here is a patch to fix this.

          The whole problem everywhere in solr (even for config files) is, that XML files per spec are not intended to be handled a "text", they are binary!!! (this is why the MIME type is application/xml and text/xml was deprecated by IANA).

          The APIs provided by Java that take java.io.Reader are only convenience methods to support parsing strings or database contents that are in text contents with already detected CharSet. XML files from unknown source must always be parsed as a byte-stream. Charsets determined from HTTP headers may only be used as a hint to the parser.

          The patch changes the XmlUpdateRequestHandler to use the byte stream and pass the charset from Content-Type header as a hint to the parser.

          This patch still misses a test.

          In general we should review all XML parsing in solr and never ever use java.io.Reader!!!

          Show
          Uwe Schindler added a comment - Here is a patch to fix this. The whole problem everywhere in solr (even for config files) is, that XML files per spec are not intended to be handled a "text", they are binary!!! (this is why the MIME type is application/xml and text/xml was deprecated by IANA). The APIs provided by Java that take java.io.Reader are only convenience methods to support parsing strings or database contents that are in text contents with already detected CharSet. XML files from unknown source must always be parsed as a byte-stream. Charsets determined from HTTP headers may only be used as a hint to the parser. The patch changes the XmlUpdateRequestHandler to use the byte stream and pass the charset from Content-Type header as a hint to the parser. This patch still misses a test. In general we should review all XML parsing in solr and never ever use java.io.Reader!!!
          Hide
          Uwe Schindler added a comment - - edited

          Just a comment for Robert:

          RFC-3023 say, text/xml without charset should be handled as US-ASCII by the transport layer (HTTP). So text/xml mime-type without charset only affects currently proxies or other components inbetween that may transform the charset during the HTTP processing.

          XML parsers by definition only take a byte stream and a charset hint and use the XML 1.0 spec to determince the charset: http://www.w3.org/TR/REC-xml/#charencoding and http://www.w3.org/TR/REC-xml/#sec-guessing

          Show
          Uwe Schindler added a comment - - edited Just a comment for Robert: RFC-3023 say, text/xml without charset should be handled as US-ASCII by the transport layer (HTTP). So text/xml mime-type without charset only affects currently proxies or other components inbetween that may transform the charset during the HTTP processing. XML parsers by definition only take a byte stream and a charset hint and use the XML 1.0 spec to determince the charset: http://www.w3.org/TR/REC-xml/#charencoding and http://www.w3.org/TR/REC-xml/#sec-guessing
          Hide
          Robert Muir added a comment -

          Thanks Uwe, I can only trust the XML policeman here!

          Show
          Robert Muir added a comment - Thanks Uwe, I can only trust the XML policeman here!
          Hide
          Yonik Seeley added a comment -

          Thanks Uwe, patch looks good!

          Show
          Yonik Seeley added a comment - Thanks Uwe, patch looks good!
          Hide
          Uwe Schindler added a comment -

          DocumentAnalysisReqHandler had the same problem, tis patch fixes this. It also fixes a missing IOUtils.closeQuietly in the log.trace-branch.

          Will commit this soon. Is there already another issue open to fix XML parsing in general to not use java.io.Reader? (config/schema)

          Show
          Uwe Schindler added a comment - DocumentAnalysisReqHandler had the same problem, tis patch fixes this. It also fixes a missing IOUtils.closeQuietly in the log.trace-branch. Will commit this soon. Is there already another issue open to fix XML parsing in general to not use java.io.Reader? (config/schema)
          Hide
          Uwe Schindler added a comment -

          Committed trunk revision: 1066819
          Committed 3.x revision: 1066823

          Show
          Uwe Schindler added a comment - Committed trunk revision: 1066819 Committed 3.x revision: 1066823
          Hide
          Uwe Schindler added a comment -

          Here the patch for 3.x as reference (AnalysisRequestHandler was fixed, too).

          Show
          Uwe Schindler added a comment - Here the patch for 3.x as reference (AnalysisRequestHandler was fixed, too).
          Hide
          Uwe Schindler added a comment -

          Here is the missing test with real byte streams and charset inside document and outside document.

          Show
          Uwe Schindler added a comment - Here is the missing test with real byte streams and charset inside document and outside document.
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1.0 release

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1.0 release

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Hoss Man
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development