Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
Impala 1.0.1, Impala 1.1
-
None
Description
Impala seems to have trouble with Snappy-compressed RCFiles containing null values. I'm using Hive 0.11.0; it's possible that the problem is on Hive's side, but Hive can read its own output just fine.
What happens is the following error:
Decompressor: block size is too big. Data is likely corrupt. Size: 0
Example reproduction instructions (with long outputs left out as [...]):
$ echo $'\x1'foo > data.txt $ cat > script.hql <<EOF CREATE TABLE text(a STRING, b STRING) STORED AS TEXTFILE; CREATE TABLE rc(a STRING, b STRING) STORED AS RCFILE; LOAD DATA LOCAL INPATH "data.txt" INTO TABLE text; SET hive.exec.compress.output=true; SET mapred.max.split.size=256000000; SET mapred.output.compression.type=BLOCK; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; INSERT INTO TABLE rc SELECT * FROM text; EOF $ hive -f script.hql [...] $ impala-shell -ri $insertImpalaNodeHere -q 'SELECT * FROM text' [...] Returned 1 row(s) in 0.28s $ impala-shell -ri $insertImpalaNodeHere -q 'SELECT * FROM rc' [...] Returned 0 row(s) in 0.27s
I expect one row with a = NULL and b = "foo", but get 0 rows instead.
impalad's debug output for the query is attached.