Solr
  1. Solr
  2. SOLR-3375

Charset problem using HttpSolrServer instead of CommonsHttpSolrServer

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.6
    • Fix Version/s: 3.6.1
    • Component/s: clients - java
    • Labels:
      None

      Description

      I've written an application which sends PDF files to Solr for indexing, but I also need to index some meta-data which isn't contained inside the PDF.
      I recently upgraded to 3.6.0 and when recompiling my app, I got some deprecated messages which mainly was to switch from CommonsHttpSolrServer to HttpSolrServer.

      The problem I've noticed since doing this, is that all extra fields which I add is sent to the Solr server as ASCII only, i.e UTF-8/ISO-8859-1 doesn't matter, anything above char 127 is sent as '?'. This was not the behaviour of CommonsHttpSolrServer.

      I've tracked it down to a line (271 in 3.6.0) in HttpSolrServer.java which is:
      entity.addPart(name, new StringBody(value));

      The problem is that StringBody(String text) maps to
      StringBody(text, "text/plain", null);
      and in
      StringBody(String text, String mimeType, Charset charset)
      we have this piece of code:
      if (charset == null)

      { charset = Charset.forName("US-ASCII"); }

      this.content = text.getBytes(charset.name());
      this.charset = charset;
      So unless charset is set everything is converted to US-ASCII.

      On the other hand, in CommonsHttpSolrServer.java (line 310 in 3.6.0) there is this line
      parts.add(new StringPart(p, v, "UTF-8"));
      which adds everything as UTF-8.

      The simple solution would be to change the faulty line in HttpSolrServer.java to
      entity.addPart(name, new StringBody(value,Charset.forName("UTF-8")));

      However, this doesn't work either since my tests have shown that neither Jetty or Tomcat recognizes the strings as UTF-8 but interprets them as 8-bit (8859-1 I guess).

      So changing HttpSolrServer.java to
      entity.addPart(name, new StringBody(value,Charset.forName("ISO-8859-1")));
      actually gives me the same result as using CommonsHttpSolrServer.

      But my investigations have shown that there is a difference in how Commons-HttpClient and HttpClient-4.x works.
      Commons-HttpClient sends all parameters as regular POST parameters but URLEncoded (/update/extract?param1=value&param2=value2) while
      HttpClient-4.x sends them as multipart/form-data messages and I think that the problem is that each multipart-message should have its own charset parameter.

      I.e HttpClient-4.x sends
      -----------------------------------------------------------------------------------
      --jNljZ3jE1sHG529HrzSjZWYEad-6Wu
      Content-Disposition: form-data; name="literal.string_txt"

      åäö
      -----------------------------------------------------------------------------------

      But it should probably send something like this

      -----------------------------------------------------------------------------------
      --jNljZ3jE1sHG529HrzSjZWYEad-6Wu
      Content-Disposition: form-data; name="literal.string_txt"
      Content-Type: text/plain; charset=utf-8

      åäö
      -----------------------------------------------------------------------------------

      1. SolrTest.java
        1 kB
        Roger Håkansson
      2. httpsolrserver-dump.txt
        15 kB
        Roger Håkansson
      3. commonshttpsolrserver-dump.txt
        0.9 kB
        Roger Håkansson

        Activity

        Hide
        Roger Håkansson added a comment -

        Test program to show the problem.
        Pass a URL to a Solr server as first arg and a PDF file as second.

        Then search for id 1234567890 and 1234567891 and see the difference in string_txt/string2_txt between the documents

        Show
        Roger Håkansson added a comment - Test program to show the problem. Pass a URL to a Solr server as first arg and a PDF file as second. Then search for id 1234567890 and 1234567891 and see the difference in string_txt/string2_txt between the documents
        Hide
        Roger Håkansson added a comment -

        After having to go through a ton of code back an forth, I've come to this conclusion.

        First, the reason for the initial problem is that CommonsHttpSolrServer will make the client send an ContentStreamUpdateRequest as a POST with all parameters in the URL plus the file data. HttpSolrServer on the other hand sends everything as different parts in a multipart-post, one part for each parameter.

        Regarding fixing HttpSolrServer, I've tested the two solutions I previously described and both seems to work but might have totally different implications.

        First solution is to change HttpSolrServer.java so

         
        entity.addPart(name, new StringBody(value));
        

        is changed to

         
        entity.addPart(name, new StringBody(value, "text/plain", Charset.forName("ISO-8859-1")));
        

        What implications this might have I'm not sure, it might be wrong according to some standard to assume 8859-1 and it doesn't solve this problem universally. But both the dist-Jetty and my Tomcat(7.0.22) work with this fix.

        Second solution is a more generic fix.
        This involves the same change as the previous, except the charset is "UTF-8".

         
        entity.addPart(name, new StringBody(value, "text/plain", Charset.forName("UTF-8")));
        

        But it also involves getting the guys developing HttpClient to make a change.
        Currently their code looks like this

        HttpMultiPart.java
          String filename = part.getBody().getFilename();
          if (filename != null) {
            MinimalField ct = part.getHeader().getField(MIME.CONTENT_TYPE);
            writeField(ct, this.charset, out);
          }
        

        If they would change their code to not only add Content-Type when there is a filename, but always do it.
        Together with the fix in HttpSolrServer.java that would make sure that UTF-8 encoded strings always would be sent to the server.
        But this requires them to make a change...

        A third option would be to get HttpClient to post just like Commons-HttpClient did, i.e no multipart posting, but what that might require in terms of work I have no idea

        Show
        Roger Håkansson added a comment - After having to go through a ton of code back an forth, I've come to this conclusion. First, the reason for the initial problem is that CommonsHttpSolrServer will make the client send an ContentStreamUpdateRequest as a POST with all parameters in the URL plus the file data. HttpSolrServer on the other hand sends everything as different parts in a multipart-post, one part for each parameter. Regarding fixing HttpSolrServer, I've tested the two solutions I previously described and both seems to work but might have totally different implications. First solution is to change HttpSolrServer.java so entity.addPart(name, new StringBody(value)); is changed to entity.addPart(name, new StringBody(value, "text/plain" , Charset.forName( "ISO-8859-1" ))); What implications this might have I'm not sure, it might be wrong according to some standard to assume 8859-1 and it doesn't solve this problem universally. But both the dist-Jetty and my Tomcat(7.0.22) work with this fix. Second solution is a more generic fix. This involves the same change as the previous, except the charset is "UTF-8". entity.addPart(name, new StringBody(value, "text/plain" , Charset.forName( "UTF-8" ))); But it also involves getting the guys developing HttpClient to make a change. Currently their code looks like this HttpMultiPart.java String filename = part.getBody().getFilename(); if (filename != null ) { MinimalField ct = part.getHeader().getField(MIME.CONTENT_TYPE); writeField(ct, this .charset, out); } If they would change their code to not only add Content-Type when there is a filename, but always do it. Together with the fix in HttpSolrServer.java that would make sure that UTF-8 encoded strings always would be sent to the server. But this requires them to make a change... A third option would be to get HttpClient to post just like Commons-HttpClient did, i.e no multipart posting, but what that might require in terms of work I have no idea
        Hide
        Roger Håkansson added a comment -

        Uploaded network dumps which shows difference between CommonsHttpSolrServer and HttpSolrServer

        Show
        Roger Håkansson added a comment - Uploaded network dumps which shows difference between CommonsHttpSolrServer and HttpSolrServer
        Hide
        Sami Siren added a comment -

        Thanks Roger for the detailed report. I already fixed some bugs in trunk that I introduced in SOLR-2020 and I believe that this problem should be fixed there (r1327635).

        I will leave this issue open so that if there will be a 3.6.1 release this fix must be backported. In the meanwhile on 3.x the only workaround is to use the CommonsHttpSolrServer.

        Show
        Sami Siren added a comment - Thanks Roger for the detailed report. I already fixed some bugs in trunk that I introduced in SOLR-2020 and I believe that this problem should be fixed there (r1327635). I will leave this issue open so that if there will be a 3.6.1 release this fix must be backported. In the meanwhile on 3.x the only workaround is to use the CommonsHttpSolrServer.
        Hide
        Roger Håkansson added a comment -

        I've downloaded HttpSolrServer.java from trunk and recompiled the 3.6 tree and verified that the fix solves the problem.

        Show
        Roger Håkansson added a comment - I've downloaded HttpSolrServer.java from trunk and recompiled the 3.6 tree and verified that the fix solves the problem.
        Hide
        Oleg Kalnichevski added a comment -

        @Roger

        > But it also involves getting the guys developing HttpClient to make a change

        HttpClient supports two modes for multipart MIME messages: strict and browser compatible. The code snippet you have pasted above is executed in the compatibility mode only. Common browsers include a Content-Type field in body parts that represent a file upload.

        Oleg

        Show
        Oleg Kalnichevski added a comment - @Roger > But it also involves getting the guys developing HttpClient to make a change HttpClient supports two modes for multipart MIME messages: strict and browser compatible. The code snippet you have pasted above is executed in the compatibility mode only. Common browsers include a Content-Type field in body parts that represent a file upload. Oleg
        Hide
        Sami Siren added a comment -

        the fix is now committed to 3.6 branch

        Show
        Sami Siren added a comment - the fix is now committed to 3.6 branch
        Hide
        Uwe Schindler added a comment -

        Bulk close for 3.6.1

        Show
        Uwe Schindler added a comment - Bulk close for 3.6.1

          People

          • Assignee:
            Sami Siren
            Reporter:
            Roger Håkansson
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development