[HIVE-6428] concat_ws mangles non-ASCII characters - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Critical
Resolution: Unresolved
Affects Version/s: 0.13.0
Fix Version/s: None
Component/s: UDF
Labels:
None

Description

Marked critical because this results in data loss from using built-in functionality. I think the issue is concat_ws, though I suppose it could be the VIEW as well.

Hive is losing the distinction between non-ASCII characters, folding distinct values into the same value. Here are steps to reproduce, and I've attached a small sample containing 3 distinct lines from the larger input file.

Grab sample data, confirm the number of total records and the number of unique combinations of the first two columns match.

$ mkdir /tmp/pagecounts
$ cd /tmp/pagecounts
$ wget http://dumps.wikimedia.org/other/pagecounts-raw/2008/2008-10/pagecounts-20081001-000000.gz
$ gzcat pagecounts-20081001-000000.gz | cut -d\  -f1,2 | wc -l
 4268675
$ gzcat pagecounts-20081001-000000.gz | cut -d\  -f1,2 | sort | uniq | wc -l
 4268675

Create hive table over input data.

CREATE EXTERNAL TABLE pagecounts (projectcode STRING, pagename STRING, pageviews STRING, bytes STRING)
ROW FORMAT
  DELIMITED FIELDS TERMINATED BY ' '
  LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/tmp/pagecounts/';

confirm number of unique combinations of the first two columns

SELECT count(DISTINCT projectcode, pagename) FROM pagecounts;
=> 4268675

Create a view over the raw data, concatenating first two columns. Distinct count does not match.

CREATE VIEW pgc_simple (rowkey, pageviews, bytes) AS
SELECT concat_ws('/', projectcode, pagename), pageviews, bytes
FROM pagecounts;

SELECT count(DISTINCT rowkey) FROM pgc_simple;
=> 4268561

Perform same "view" from shell. distinct count is retained.

$ gzcat pagecounts-20081001-000000.gz | awk '{print $1 "/" $2}' | wc -l
 4268675
$ gzcat pagecounts-20081001-000000.gz | awk '{print $1 "/" $2}' | sort | uniq | wc -l                                                                                                       
 4268675

Look at some data.

$ hive -e "SELECT i.* FROM (SELECT rowkey, count(*) AS nbr FROM pgc_simple GROUP BY rowkey) i WHERE i.nbr > 1;" | tail
OK
Time taken: 40.493 seconds, Fetched: 37 row(s)
ru/?�N  2
ru/�    3
zh/?��  2
zh/Category:��  2
zh/Wikipedia:�� 5
zh/�    2
zh/��   11
zh/��˹��        2
zh/���� 17
zh/������       15

Choose 2nd line of output to inspect on the shell. My locale isn't able to find a character for the codepoints, but sort | uniq identify them as different.

$ gzcat pagecounts-20081001-000000.gz | egrep '^ru \?. ' | cut -d\  -f1,2 | sort | uniq
ru ??
ru ??
ru ??

Print them as C-escape codes. They are indeed distinct.

$ cat pagecounts-20081001-000000 | egrep '^ru \?. ' | cut -d\  -f1,2 | od -c
0000000    r   u       ? 240  \n   r   u       ? 244  \n   r   u       ?
0000020  247  \n
0000022

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

ru.txt
13/Feb/14 23:05
0.0 kB
Nick Dimiduk

Activity

People

Assignee:: Unassigned

Reporter:: Nick Dimiduk

Votes:: 1 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 13/Feb/14 23:05

Updated:: 13/Feb/14 23:05