1. Solr
  2. SOLR-3375

Charset problem using HttpSolrServer instead of CommonsHttpSolrServer


    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.6
    • Fix Version/s: 3.6.1
    • Component/s: clients - java
    • Labels:


      I've written an application which sends PDF files to Solr for indexing, but I also need to index some meta-data which isn't contained inside the PDF.
      I recently upgraded to 3.6.0 and when recompiling my app, I got some deprecated messages which mainly was to switch from CommonsHttpSolrServer to HttpSolrServer.

      The problem I've noticed since doing this, is that all extra fields which I add is sent to the Solr server as ASCII only, i.e UTF-8/ISO-8859-1 doesn't matter, anything above char 127 is sent as '?'. This was not the behaviour of CommonsHttpSolrServer.

      I've tracked it down to a line (271 in 3.6.0) in which is:
      entity.addPart(name, new StringBody(value));

      The problem is that StringBody(String text) maps to
      StringBody(text, "text/plain", null);
      and in
      StringBody(String text, String mimeType, Charset charset)
      we have this piece of code:
      if (charset == null)

      { charset = Charset.forName("US-ASCII"); }

      this.content = text.getBytes(;
      this.charset = charset;
      So unless charset is set everything is converted to US-ASCII.

      On the other hand, in (line 310 in 3.6.0) there is this line
      parts.add(new StringPart(p, v, "UTF-8"));
      which adds everything as UTF-8.

      The simple solution would be to change the faulty line in to
      entity.addPart(name, new StringBody(value,Charset.forName("UTF-8")));

      However, this doesn't work either since my tests have shown that neither Jetty or Tomcat recognizes the strings as UTF-8 but interprets them as 8-bit (8859-1 I guess).

      So changing to
      entity.addPart(name, new StringBody(value,Charset.forName("ISO-8859-1")));
      actually gives me the same result as using CommonsHttpSolrServer.

      But my investigations have shown that there is a difference in how Commons-HttpClient and HttpClient-4.x works.
      Commons-HttpClient sends all parameters as regular POST parameters but URLEncoded (/update/extract?param1=value&param2=value2) while
      HttpClient-4.x sends them as multipart/form-data messages and I think that the problem is that each multipart-message should have its own charset parameter.

      I.e HttpClient-4.x sends
      Content-Disposition: form-data; name="literal.string_txt"


      But it should probably send something like this

      Content-Disposition: form-data; name="literal.string_txt"
      Content-Type: text/plain; charset=utf-8


      1. commonshttpsolrserver-dump.txt
        0.9 kB
        Roger Håkansson
      2. httpsolrserver-dump.txt
        15 kB
        Roger Håkansson
        1 kB
        Roger Håkansson


        Roger Håkansson created issue -
        Roger Håkansson made changes -
        Field Original Value New Value
        Attachment [ 12523238 ]
        Roger Håkansson made changes -
        Attachment commonshttpsolrserver-dump.txt [ 12523275 ]
        Attachment httpsolrserver-dump.txt [ 12523276 ]
        Sami Siren made changes -
        Fix Version/s 3.6.1 [ 12320754 ]
        Affects Version/s 4.0 [ 12314992 ]
        Affects Version/s 3.6.1 [ 12320754 ]
        Sami Siren made changes -
        Assignee Sami Siren [ siren ]
        Sami Siren made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Uwe Schindler made changes -
        Status Resolved [ 5 ] Closed [ 6 ]


          • Assignee:
            Sami Siren
            Roger Håkansson
          • Votes:
            0 Vote for this issue
            3 Start watching this issue


            • Created: