[HIVE-12718] skip.footer.line.count misbehaves on larger text files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 1.1.0
Fix Version/s: None
Component/s: None
Labels:
None
Environment:

The bug was discovered and reproduced on a Cloudera Hadoop 5.4 distribution running on CentOS 6.4.

Description

We noticed that when working on a table backed by a larger (large enough to require splitting) text file, the skip.footer.line.count property of the table misbehaves: the footer is not being ignored.

To reproduce, follow these steps:

1) Create a large file: for i in $(seq 1 100); do cat /usr/share/dict/words; done >large.txt
2) Upload it to HDFS (eg, as /tmp/words)
3) Create an external table with skip.footer.line.count set:

CREATE EXTERNAL TABLE ext_words (word STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE LOCATION '/tmp/words'
tblproperties("skip.header.line.count"="1", "skip.footer.line.count"="1");

4) Count the number of times the last line (in this example, I assume that to be ZZZ) appears: SELECT COUNT( * ) FROM ext_words WHERE word = 'ZZZ';
5) Observe that it returns 100 instead of 99.

Investigation showed that this happens when there are more than one mappers used for the job. If we increase the split size, to force using one mapper only, the problem did not occur.

There may be other related issues as well, like the wrong line being skipped – but we did not reproduce those yet.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Gergely Nagy

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 21/Dec/15 10:46

Updated:: 09/Feb/17 07:09