Pig
  1. Pig
  2. PIG-1501

need to investigate the impact of compression on pig performance

    Details

    • Type: Test Test
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      This feature will save HDFS space used to store the intermediate data used by PIG and potentially improve query execution speed. In general, the more intermediate data generated, the more storage and speedup benefits.

      There are no backward compatibility issues as result of this feature.

      Two java properties are used to control the behavoir:

      pig.tmpfilecompression, default to false, tells if the temporary files should be compressed or not. If true, then

      pig.tmpfilecompression.codec specifies which compression codec to use. Currently, PIG only accepts "gz" and "lzo" as possible values. Since LZO is under GPL license, Hadoop may need to be configured to use LZO codec. Please refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details.


      An example is the following "test.pig" script:

      register pigperf.jar;
      A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
      as (user, action, timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links);
      B1 = filter A by timespent == 4;
      B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term);
      C = join B1 by query_term, B by query_term using 'skewed' parallel 300;
      D = distinct C parallel 300;
      store D into 'output.lzo';

      which is launched as follows:

      java -cp /path/to/pig/pig.jar -Djava.library.path=/path/to/lzo2/lib/native/Linux-i386-32 -Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo org.apache.pig.Main ./test.pig
      Show
      This feature will save HDFS space used to store the intermediate data used by PIG and potentially improve query execution speed. In general, the more intermediate data generated, the more storage and speedup benefits. There are no backward compatibility issues as result of this feature. Two java properties are used to control the behavoir: pig.tmpfilecompression, default to false, tells if the temporary files should be compressed or not. If true, then pig.tmpfilecompression.codec specifies which compression codec to use. Currently, PIG only accepts "gz" and "lzo" as possible values. Since LZO is under GPL license, Hadoop may need to be configured to use LZO codec. Please refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details. An example is the following "test.pig" script: register pigperf.jar; A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B1 = filter A by timespent == 4; B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term); C = join B1 by query_term, B by query_term using 'skewed' parallel 300; D = distinct C parallel 300; store D into 'output.lzo'; which is launched as follows: java -cp /path/to/pig/pig.jar -Djava.library.path=/path/to/lzo2/lib/native/Linux-i386-32 -Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo org.apache.pig.Main ./test.pig

      Description

      We would like to understand how compressing map results as well as well as reducer output in a chain of MR jobs impacts performance. We can use PigMix queries for this investigation.

      1. compress_perf_data.txt
        1 kB
        Yan Zhou
      2. compress_perf_data_2.txt
        1 kB
        Yan Zhou
      3. PIG-1501.patch
        45 kB
        Yan Zhou
      4. PIG-1501.patch
        47 kB
        Yan Zhou
      5. PIG-1501.patch
        57 kB
        Yan Zhou

        Activity

        Olga Natkovich made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Yan Zhou made changes -
        Release Note This feature will save HDFS space used to store the intermediate data used by PIG and potentially improve query execution speed. In general, the more intermediate data generated, the more storage and speedup benefits.

        There are no backward compatibility issues as result of this feature.

        Two java properties are used to control the behavoir:

        pig.tmpfilecompression, default to false, tells if the temporary files should be compressed or not. If true, then

        pig.tmpfilecompression.codec specifies which compression codec to use. Currently, PIG only accepts "gz" and "lzo" as possible values. Since LZO is under GPL license, Hadoop may need to be configured to use LZO codec. Please refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details.


        An example is the following "test.pig" script:

        register pigperf.jar;
        A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
        as (user, action, timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links);
        B1 = filter A by timespent == 4;
        B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term);
        C = join B1 by query_term, B by query_term using 'skewed' parallel 300;
        D = distinct C parallel 300;
        store D into 'output.lzo';

        which is launched as follows:

        java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar -Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 -Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo org.apache.pig.Main ./test.pig
        This feature will save HDFS space used to store the intermediate data used by PIG and potentially improve query execution speed. In general, the more intermediate data generated, the more storage and speedup benefits.

        There are no backward compatibility issues as result of this feature.

        Two java properties are used to control the behavoir:

        pig.tmpfilecompression, default to false, tells if the temporary files should be compressed or not. If true, then

        pig.tmpfilecompression.codec specifies which compression codec to use. Currently, PIG only accepts "gz" and "lzo" as possible values. Since LZO is under GPL license, Hadoop may need to be configured to use LZO codec. Please refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details.


        An example is the following "test.pig" script:

        register pigperf.jar;
        A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
        as (user, action, timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links);
        B1 = filter A by timespent == 4;
        B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term);
        C = join B1 by query_term, B by query_term using 'skewed' parallel 300;
        D = distinct C parallel 300;
        store D into 'output.lzo';

        which is launched as follows:

        java -cp /path/to/pig/pig.jar -Djava.library.path=/path/to/lzo2/lib/native/Linux-i386-32 -Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo org.apache.pig.Main ./test.pig
        Yan Zhou made changes -
        Release Note This feature will save HDFS space used to store the intermediate data used by PIG and potentially improve query execution speed. In general, the more intermediate data generated, the more storage and speedup benefits.

        There are no backward compatibility issues as result of this feature.

        Two java properties are used to control the behavoir:

        pig.tmpfilecompression, default to false, tells if the temporary files should be compressed or not. If true, then

        pig.tmpfilecompression.codec specifies which compression codec to use. Currently, PIG only accepts "gz" and "lzo" as possible values. Since LZO is under GPL license, Hadoop may need to be configured to use LZO codec. Please refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details.


        An example is the following "test.pig" script:

        register pigperf.jar;
        A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
        as (user, action, timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links);
        B1 = filter A by timespent == 4;
        B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term);
        C = join B1 by query_term, B by query_term using 'skewed' parallel 300;
        D = distinct C parallel 300;
        store D into 'output.lzo';

        which is launched as follows:

        java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar -Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 -Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo org.apache.pig.Main ./test.pig

        [ Show » ] Yan Zhou added a comment - 26/Aug/10 11:14 AM This feature will save HDFS space used to store the intermediate data used by PIG and potentially improve query execution speed. In general, the more intermediate data generated, the more storage and speedup benefits. There are no backward compatibility issues as result of this feature. An example is the following "test.pig" script: register pigperf.jar; A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B1 = filter A by timespent == 4; B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term); C = join B1 by query_term, B by query_term using 'skewed' parallel 300; D = distinct C parallel 300; store D into 'output.lzo'; which is launched as follows: java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar -Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 -Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo org.apache.pig.Main ./test.pig
        This feature will save HDFS space used to store the intermediate data used by PIG and potentially improve query execution speed. In general, the more intermediate data generated, the more storage and speedup benefits.

        There are no backward compatibility issues as result of this feature.

        Two java properties are used to control the behavoir:

        pig.tmpfilecompression, default to false, tells if the temporary files should be compressed or not. If true, then

        pig.tmpfilecompression.codec specifies which compression codec to use. Currently, PIG only accepts "gz" and "lzo" as possible values. Since LZO is under GPL license, Hadoop may need to be configured to use LZO codec. Please refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details.


        An example is the following "test.pig" script:

        register pigperf.jar;
        A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
        as (user, action, timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links);
        B1 = filter A by timespent == 4;
        B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term);
        C = join B1 by query_term, B by query_term using 'skewed' parallel 300;
        D = distinct C parallel 300;
        store D into 'output.lzo';

        which is launched as follows:

        java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar -Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 -Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo org.apache.pig.Main ./test.pig
        Yan Zhou made changes -
        Release Note This feature will save HDFS space used to store the intermediate data used by PIG and potentially improve query execution speed. In general, the more intermediate data generated, the more storage and speedup benefits.

        There are no backward compatibility issues as result of this feature.

        Two java properties are used to control the behavoir:

        pig.tmpfilecompression, default to false, tells if the temporary files should be compressed or not. If true, then

        pig.tmpfilecompression.codec specifies which compression codec to use. Currently, PIG only accepts "gz" and "lzo" as possible values. Since LZO is under GPL license, Hadoop may need to be configured to use LZO codec. Please refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details.


        An example is the following "test.pig" script:

        register pigperf.jar;
        A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
        as (user, action, timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links);
        B1 = filter A by timespent == 4;
        B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term);
        C = join B1 by query_term, B by query_term using 'skewed' parallel 300;
        D = distinct C parallel 300;
        store D into 'output.lzo';

        which is launched as follows:

        java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar -Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 -Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo org.apache.pig.Main ./test.pig

        [ Show » ] Yan Zhou added a comment - 26/Aug/10 11:14 AM This feature will save HDFS space used to store the intermediate data used by PIG and potentially improve query execution speed. In general, the more intermediate data generated, the more storage and speedup benefits. There are no backward compatibility issues as result of this feature. An example is the following "test.pig" script: register pigperf.jar; A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B1 = filter A by timespent == 4; B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term); C = join B1 by query_term, B by query_term using 'skewed' parallel 300; D = distinct C parallel 300; store D into 'output.lzo'; which is launched as follows: java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar -Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 -Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo org.apache.pig.Main ./test.pig
        Thejas M Nair made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Hadoop Flags [Reviewed]
        Resolution Fixed [ 1 ]
        Yan Zhou made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Yan Zhou made changes -
        Attachment PIG-1501.patch [ 12453048 ]
        Yan Zhou made changes -
        Attachment PIG-1501.patch [ 12452690 ]
        Yan Zhou made changes -
        Attachment PIG-1501.patch [ 12451722 ]
        Yan Zhou made changes -
        Attachment compress_perf_data_2.txt [ 12451602 ]
        Yan Zhou made changes -
        Field Original Value New Value
        Attachment compress_perf_data.txt [ 12450849 ]
        Olga Natkovich created issue -

          People

          • Assignee:
            Yan Zhou
            Reporter:
            Olga Natkovich
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development