Affects Version/s: 0.13.0
Fix Version/s: None
Marked critical because this results in data loss from using built-in functionality. I think the issue is concat_ws, though I suppose it could be the VIEW as well.
Hive is losing the distinction between non-ASCII characters, folding distinct values into the same value. Here are steps to reproduce, and I've attached a small sample containing 3 distinct lines from the larger input file.
Grab sample data, confirm the number of total records and the number of unique combinations of the first two columns match.
Create hive table over input data.
confirm number of unique combinations of the first two columns
Create a view over the raw data, concatenating first two columns. Distinct count does not match.
Perform same "view" from shell. distinct count is retained.
Look at some data.
Choose 2nd line of output to inspect on the shell. My locale isn't able to find a character for the codepoints, but sort | uniq identify them as different.
Print them as C-escape codes. They are indeed distinct.