Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
ghx-label-9
Description
I ran the following command to generate TPC-H tables in ORC format using SNAPPY compression:
bin/load-data.py -w tpch -e core --table_formats=orc/snap/block
After it succeeded, I realized the compression is still ZLIB:
$ hive --service orcfiledump hdfs://localhost:20500/test-warehouse/tpch.lineitem_orc_snap/000000_0 Processing data file hdfs://localhost:20500/test-warehouse/tpch.lineitem_orc_snap/000000_0 [length: 149783256] Structure for hdfs://localhost:20500/test-warehouse/tpch.lineitem_orc_snap/000000_0 File Version: 0.12 with ORC_135 Rows: 6001215 Compression: ZLIB <-------- not SNAPPY Compression size: 262144 Calendar: Julian/Gregorian
The Hive statements we use to generate data are
SET hive.exec.compress.output=true; SET mapred.output.compression.type=BLOCK; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; SET hive.exec.dynamic.partition.mode=nonstrict; SET hive.exec.dynamic.partition=true; SET hive.exec.max.dynamic.partitions=10000; SET hive.exec.max.dynamic.partitions.pernode=10000; set hive.auto.convert.join=true; SET mapred.max.split.size=256000000; SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; INSERT OVERWRITE TABLE tpch_orc_snap.lineitem SELECT * FROM tpch.lineitem;
Setting mapred.output.compression.codec does not work in ORC format. Instead, we need to set tblproperty "orc.compress" to "SNAPPY".