Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-6428

concat_ws mangles non-ASCII characters

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 0.13.0
    • None
    • UDF
    • None

    Description

      Marked critical because this results in data loss from using built-in functionality. I think the issue is concat_ws, though I suppose it could be the VIEW as well.

      Hive is losing the distinction between non-ASCII characters, folding distinct values into the same value. Here are steps to reproduce, and I've attached a small sample containing 3 distinct lines from the larger input file.

      Grab sample data, confirm the number of total records and the number of unique combinations of the first two columns match.

      $ mkdir /tmp/pagecounts
      $ cd /tmp/pagecounts
      $ wget http://dumps.wikimedia.org/other/pagecounts-raw/2008/2008-10/pagecounts-20081001-000000.gz
      $ gzcat pagecounts-20081001-000000.gz | cut -d\  -f1,2 | wc -l
       4268675
      $ gzcat pagecounts-20081001-000000.gz | cut -d\  -f1,2 | sort | uniq | wc -l
       4268675
      

      Create hive table over input data.

      CREATE EXTERNAL TABLE pagecounts (projectcode STRING, pagename STRING, pageviews STRING, bytes STRING)
      ROW FORMAT
        DELIMITED FIELDS TERMINATED BY ' '
        LINES TERMINATED BY '\n'
      STORED AS TEXTFILE
      LOCATION '/tmp/pagecounts/';
      

      confirm number of unique combinations of the first two columns

      SELECT count(DISTINCT projectcode, pagename) FROM pagecounts;
      => 4268675
      

      Create a view over the raw data, concatenating first two columns. Distinct count does not match.

      CREATE VIEW pgc_simple (rowkey, pageviews, bytes) AS
      SELECT concat_ws('/', projectcode, pagename), pageviews, bytes
      FROM pagecounts;
      
      SELECT count(DISTINCT rowkey) FROM pgc_simple;
      => 4268561
      

      Perform same "view" from shell. distinct count is retained.

      $ gzcat pagecounts-20081001-000000.gz | awk '{print $1 "/" $2}' | wc -l
       4268675
      $ gzcat pagecounts-20081001-000000.gz | awk '{print $1 "/" $2}' | sort | uniq | wc -l                                                                                                       
       4268675
      

      Look at some data.

      $ hive -e "SELECT i.* FROM (SELECT rowkey, count(*) AS nbr FROM pgc_simple GROUP BY rowkey) i WHERE i.nbr > 1;" | tail
      OK
      Time taken: 40.493 seconds, Fetched: 37 row(s)
      ru/?�N  2
      ru/�    3
      zh/?��  2
      zh/Category:��  2
      zh/Wikipedia:�� 5
      zh/�    2
      zh/��   11
      zh/��˹��        2
      zh/���� 17
      zh/������       15
      

      Choose 2nd line of output to inspect on the shell. My locale isn't able to find a character for the codepoints, but sort | uniq identify them as different.

      $ gzcat pagecounts-20081001-000000.gz | egrep '^ru \?. ' | cut -d\  -f1,2 | sort | uniq
      ru ??
      ru ??
      ru ??
      

      Print them as C-escape codes. They are indeed distinct.

      $ cat pagecounts-20081001-000000 | egrep '^ru \?. ' | cut -d\  -f1,2 | od -c
      0000000    r   u       ? 240  \n   r   u       ? 244  \n   r   u       ?
      0000020  247  \n
      0000022
      

      Attachments

        1. ru.txt
          0.0 kB
          Nick Dimiduk

        Activity

          People

            Unassigned Unassigned
            ndimiduk Nick Dimiduk
            Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: