[SOLR-3375] Charset problem using HttpSolrServer instead of CommonsHttpSolrServer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 3.6
Fix Version/s: 3.6.1
Component/s: clients - java
Labels:
None

Description

I've written an application which sends PDF files to Solr for indexing, but I also need to index some meta-data which isn't contained inside the PDF.
I recently upgraded to 3.6.0 and when recompiling my app, I got some deprecated messages which mainly was to switch from CommonsHttpSolrServer to HttpSolrServer.

The problem I've noticed since doing this, is that all extra fields which I add is sent to the Solr server as ASCII only, i.e UTF-8/ISO-8859-1 doesn't matter, anything above char 127 is sent as '?'. This was not the behaviour of CommonsHttpSolrServer.

I've tracked it down to a line (271 in 3.6.0) in HttpSolrServer.java which is:
entity.addPart(name, new StringBody(value));

The problem is that StringBody(String text) maps to
StringBody(text, "text/plain", null);
and in
StringBody(String text, String mimeType, Charset charset)
we have this piece of code:
if (charset == null)

{ charset = Charset.forName("US-ASCII"); }

this.content = text.getBytes(charset.name());
this.charset = charset;
So unless charset is set everything is converted to US-ASCII.

On the other hand, in CommonsHttpSolrServer.java (line 310 in 3.6.0) there is this line
parts.add(new StringPart(p, v, "UTF-8"));
which adds everything as UTF-8.

The simple solution would be to change the faulty line in HttpSolrServer.java to
entity.addPart(name, new StringBody(value,Charset.forName("UTF-8")));

However, this doesn't work either since my tests have shown that neither Jetty or Tomcat recognizes the strings as UTF-8 but interprets them as 8-bit (8859-1 I guess).

So changing HttpSolrServer.java to
entity.addPart(name, new StringBody(value,Charset.forName("ISO-8859-1")));
actually gives me the same result as using CommonsHttpSolrServer.

But my investigations have shown that there is a difference in how Commons-HttpClient and HttpClient-4.x works.
Commons-HttpClient sends all parameters as regular POST parameters but URLEncoded (/update/extract?param1=value&param2=value2) while
HttpClient-4.x sends them as multipart/form-data messages and I think that the problem is that each multipart-message should have its own charset parameter.

I.e HttpClient-4.x sends
-----------------------------------------------------------------------------------
--jNljZ3jE1sHG529HrzSjZWYEad-6Wu
Content-Disposition: form-data; name="literal.string_txt"

Ã¥Ã¤Ã¶
-----------------------------------------------------------------------------------

But it should probably send something like this

-----------------------------------------------------------------------------------
--jNljZ3jE1sHG529HrzSjZWYEad-6Wu
Content-Disposition: form-data; name="literal.string_txt"
Content-Type: text/plain; charset=utf-8

Ã¥Ã¤Ã¶
-----------------------------------------------------------------------------------

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SolrTest.java
18/Apr/12 19:02
1 kB
Roger Håkansson
httpsolrserver-dump.txt
18/Apr/12 23:27
15 kB
Roger Håkansson
commonshttpsolrserver-dump.txt
18/Apr/12 23:27
0.9 kB
Roger Håkansson

Activity

People

Assignee:: Sami Siren

Reporter:: Roger Håkansson

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 18/Apr/12 18:58

Updated:: 22/Jul/12 16:05

Resolved:: 27/Apr/12 08:56