Affects Version/s: 0.7.1, 0.8.0, 0.8.1, 0.9.0
Fix Version/s: None
Windows / RHEL5 with LANG = en_US.CP1252
Tags:UTF8, Globalization, encoding, startup, system, configuration
This is a bug in Hive that is exacerbated by replatforming it to Windows without CYGWIN. Basically, it assumes that the default file.encoding is UTF8. There are something like 6-7 getBytes() calls and write() calls that don't specify the encoding. The rest specify UTF-8 explicitly, which blocks auto-detection of UTF-16 data in files with a BOM present. The mix of explicit encodings and default encoding assumptions means that Hive must be run in a JVM whose default encoding is UTF-8 and only UTF-8.
When the JVM starts up, it derives the default encoding from the C runtime setlocale() call. On Linux/Unix, this would use the LANG env variable (which is almost always <locale>.UTF8 for machines handling internationalized data, but not guaranteed to be so). On Windows, this is derived from the user's language settings, and cannot return a UTF-8 encoding, right now. So there isn't an environment setting for Windows that would reliably provide the JVM with a set of inputs to cause it to set the default encoding to UTF-8 on startup without additional options.
However, there are 2 feasible options:
1.) the JVM has a startup option -Dfile.encoding=UTF-8 which should explicitly override the default encoding detection behavior in the JVM to make it always UTF-8 regardless of the environmental configuration. This would make all deployments on all OS/environment configs behave consistently. I don't know where Hive sets the JVM options we use when it starts the service.
2.) We could add "UTF8" explicitly to all the remaining getBytes() calls that need it, and make all the string I/O explicitly UTF-8 encoded. This is probably being changed right now as part of Hive-1505, so we would duplicate effort and maybe make that change harder. Seems easier to trick the JVM into behaving like it is on a well-configured machine WRT default encoding instead of setting explicit encodings everywhere.
- Pretty much any globalized strings than Western European are going to be corrupted in the current Hive service on Windows with this bug present because there really isn't a way to have the JVM read the environment and determine by default that UTF8 should be the default encoding.
- Anyone can repro this on Linux fairly easily – Add "export LANG=en_US.CP1252" to /etc/profile to modify the global LANG default encoding to CP1252 explicitly, then restart the service and do a query over internationalized UTF-8 data.
- We shouldn't rely on JVM default codepage selection if we want to support UTF-8 consistently and reliably as the default encoding.
- The estimate can range wildly, but adding an explicit default encoding on startup should only take a little while if you know where to do it, theoretically.
- I don't know where to update the start arguments of the JVM when the service is started, just getting into the code for the first time with this bug investigation.