This feature will save HDFS space used to store the intermediate data used by PIG and potentially improve query execution speed. In general, the more intermediate data generated, the more storage and speedup benefits.
There are no backward compatibility issues as result of this feature.
Two java properties are used to control the behavoir:
pig.tmpfilecompression, default to false, tells if the temporary files should be compressed or not. If true, then
pig.tmpfilecompression.codec specifies which compression codec to use. Currently, PIG only accepts "gz" and "lzo" as possible values. Since LZO is under GPL license, Hadoop may need to be configured to use LZO codec. Please refer to
http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details.
An example is the following "test.pig" script:
register pigperf.jar;
A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links);
B1 = filter A by timespent == 4;
B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term);
C = join B1 by query_term, B by query_term using 'skewed' parallel 300;
D = distinct C parallel 300;
store D into 'output.lzo';
which is launched as follows:
java -cp /path/to/pig/pig.jar -Djava.library.path=/path/to/lzo2/lib/native/Linux-i386-32 -Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo org.apache.pig.Main ./test.pig