Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-1501

need to investigate the impact of compression on pig performance

    XMLWordPrintableJSON

Details

    • Test
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.8.0
    • None
    • None
    • Reviewed
    • Hide
      This feature will save HDFS space used to store the intermediate data used by PIG and potentially improve query execution speed. In general, the more intermediate data generated, the more storage and speedup benefits.

      There are no backward compatibility issues as result of this feature.

      Two java properties are used to control the behavoir:

      pig.tmpfilecompression, default to false, tells if the temporary files should be compressed or not. If true, then

      pig.tmpfilecompression.codec specifies which compression codec to use. Currently, PIG only accepts "gz" and "lzo" as possible values. Since LZO is under GPL license, Hadoop may need to be configured to use LZO codec. Please refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details.


      An example is the following "test.pig" script:

      register pigperf.jar;
      A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
      as (user, action, timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links);
      B1 = filter A by timespent == 4;
      B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term);
      C = join B1 by query_term, B by query_term using 'skewed' parallel 300;
      D = distinct C parallel 300;
      store D into 'output.lzo';

      which is launched as follows:

      java -cp /path/to/pig/pig.jar -Djava.library.path=/path/to/lzo2/lib/native/Linux-i386-32 -Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo org.apache.pig.Main ./test.pig
      Show
      This feature will save HDFS space used to store the intermediate data used by PIG and potentially improve query execution speed. In general, the more intermediate data generated, the more storage and speedup benefits. There are no backward compatibility issues as result of this feature. Two java properties are used to control the behavoir: pig.tmpfilecompression, default to false, tells if the temporary files should be compressed or not. If true, then pig.tmpfilecompression.codec specifies which compression codec to use. Currently, PIG only accepts "gz" and "lzo" as possible values. Since LZO is under GPL license, Hadoop may need to be configured to use LZO codec. Please refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details. An example is the following "test.pig" script: register pigperf.jar; A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B1 = filter A by timespent == 4; B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term); C = join B1 by query_term, B by query_term using 'skewed' parallel 300; D = distinct C parallel 300; store D into 'output.lzo'; which is launched as follows: java -cp /path/to/pig/pig.jar -Djava.library.path=/path/to/lzo2/lib/native/Linux-i386-32 -Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo org.apache.pig.Main ./test.pig

    Description

      We would like to understand how compressing map results as well as well as reducer output in a chain of MR jobs impacts performance. We can use PigMix queries for this investigation.

      Attachments

        1. PIG-1501.patch
          57 kB
          Yan Zhou
        2. PIG-1501.patch
          47 kB
          Yan Zhou
        3. PIG-1501.patch
          45 kB
          Yan Zhou
        4. compress_perf_data_2.txt
          1 kB
          Yan Zhou
        5. compress_perf_data.txt
          1 kB
          Yan Zhou

        Activity

          People

            yanz Yan Zhou
            olgan Olga Natkovich
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: