Hello all, I think I have outlined a bug in the hive split function:
Summary: When calling split on a string of data, it will only return all array items if the the last array item has a value. For example, if I have a string of text delimited by tab with 7 columns, and the first four are filled, but the last three are blank, split will only return a 4 position array. If any number of "middle" columns are empty, but the last item still has a value, then it will return the proper number of columns. This was tested in Hive 0.9 and hive 0.11.
(Note \t represents a tab char, \x09 the line endings should be \n (UNIX style) not sure what email will do to them). Basically my data is 7 lines of data with the first 7 letters separated by tab. On some lines I've left out certain letters, but kept the number of tabs exactly the same.
I then created a table with one column from that data:
DROP TABLE tmp_jo_tab_test;
CREATE table tmp_jo_tab_test (message_line STRING)
STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH '/tmp/input.txt'
OVERWRITE INTO TABLE tmp_jo_tab_test;
Ok just to validate I created a python counting script:
for line in sys.stdin:
line = line[0:-1]
out = line.split("\t")
The output there is :
$ cat input.txt |./cnt_tabs.py
Based on that information, split on tab should return me 7 for each line as well:
hive -e "select size(split(message_line, '
t')) from tmp_jo_tab_test;"
However it does not. It would appear that the line where only the first four letters are filled in(and blank is passed in on the last three) only returns 4 splits, where there should technically be 7, 4 for letters included, and three blanks.
|Field||Original Value||New Value|
|Assignee||Vikram Dixit K [ vikram.dixit ]|
|Attachment||HIVE-5506.1.patch [ 12608603 ]|
|Attachment||HIVE-5506.2.patch [ 12608808 ]|
|Status||Patch Available [ 10002 ]||Resolved [ 5 ]|
|Fix Version/s||0.13.0 [ 12324986 ]|
|Resolution||Fixed [ 1 ]|
|Transition||Time In Source Status||Execution Times||Last Executer||Last Execution Date|
|22h 23m||3||Vikram Dixit K||16/Oct/13 22:46|
|6d 2h 33m||4||Vikram Dixit K||16/Oct/13 23:14|
|5d 22h 23m||1||Ashutosh Chauhan||22/Oct/13 21:38|