Affects Version/s: 1.1
Fix Version/s: None
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
Java HotSpot(TM) Server VM (build 20.1-b02, mixed mode)
java version "1.6.0_31"
Java(TM) SE Runtime Environment (build 1.6.0_31-b05)
Java HotSpot(TM) Client VM (build 20.6-b01, mixed mode, sharing)
When the response to the /tika servlet contains non-ASCII characters, Tika doesn't tell us what encoding it's using, and the encoding differs depending on which OS the server is running on.
This is a server running on Tomcat on Linux:
And this is a server running on Tomcat on Windows:
As you can see, the data (last few bytes) is encoded differently. The Linux server encodes it as UTF-8, while Windows is using something strange, probably Windows-1252, where 0x92 is a curly quote and 0x95 is a bullet point.
A client can't know what encoding the server used, because the Content-Type is just text/plain with no encoding.
Ideally I would like it to use UTF-8 always, so that the client doesn't have to do extra work to decode it. The attached patch does that, and declares it.