Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-2319

Pig should support snappy as a value for pig.tmpfilecompression.codec

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 0.8.1, 0.9.1
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Utils.tmpFileCompressionCodec() hard-codes support for only "gz" and "lzo" compression. Since support for snappy was added in HADOOP-7206, it would be nice to allow this codec as well.

      A future-proof solution to this problem might let the user provide a full classname (like in the hadoop settings) or the short-hand, in case the short-hand doesn't exist for a given codec.

        Issue Links

          Activity

          Hide
          andrewlook Andrew Look added a comment -

          I'm working on a patch for HADOOP-7990 - any thoughts on which version of hadoop-common this should be applied to?

          Show
          andrewlook Andrew Look added a comment - I'm working on a patch for HADOOP-7990 - any thoughts on which version of hadoop-common this should be applied to?
          Hide
          prkommireddi Prashant Kommireddi added a comment -

          Rakesh, this needs to be done on the Hadoop project. There is a JIRA for this https://issues.apache.org/jira/browse/HADOOP-7990

          Show
          prkommireddi Prashant Kommireddi added a comment - Rakesh, this needs to be done on the Hadoop project. There is a JIRA for this https://issues.apache.org/jira/browse/HADOOP-7990
          Hide
          rkothari Rakesh Kothari added a comment -

          Any updates on this ?

          Show
          rkothari Rakesh Kothari added a comment - Any updates on this ?
          Hide
          prkommireddi Prashant Kommireddi added a comment -

          Hi Dmitriy, I tested read a snappy compressed file with PigStorage and it works just fine.

          grunt> set output.compression.enabled true;                                   
          grunt> set output.compression.codec org.apache.hadoop.io.compress.SnappyCodec;
          grunt>  A = load 'input';                                                     
          grunt> rmf out;
          grunt> STORE A INTO 'out'
          

          Pig generates a snappy compressed file at location "out"

          grunt> C = load 'out';                                                        
          grunt> D = LIMIT C 10;                                                        
          grunt> DUMP D;        
          

          The above successfully reads snappy compressed file, as PigStorage uses the Hadoop TextInputFormat in this case.

          However, this is not the case for temporary files created by Pig between multiple MR jobs because TFile Writer is used which supports only LZO and GZ. Do you see a workaround we could find to support Snappy in this case?

          Show
          prkommireddi Prashant Kommireddi added a comment - Hi Dmitriy, I tested read a snappy compressed file with PigStorage and it works just fine. grunt> set output.compression.enabled true ; grunt> set output.compression.codec org.apache.hadoop.io.compress.SnappyCodec; grunt> A = load 'input'; grunt> rmf out; grunt> STORE A INTO 'out' Pig generates a snappy compressed file at location "out" grunt> C = load 'out'; grunt> D = LIMIT C 10; grunt> DUMP D; The above successfully reads snappy compressed file, as PigStorage uses the Hadoop TextInputFormat in this case. However, this is not the case for temporary files created by Pig between multiple MR jobs because TFile Writer is used which supports only LZO and GZ. Do you see a workaround we could find to support Snappy in this case?
          Hide
          prkommireddi Prashant Kommireddi added a comment -

          TFileStorage is used for writing temporary/intermediate file and it looks only for gz and lzo. Code here will need to be modified for temporary files to be able to use other compression codecs.

          Show
          prkommireddi Prashant Kommireddi added a comment - TFileStorage is used for writing temporary/intermediate file and it looks only for gz and lzo. Code here will need to be modified for temporary files to be able to use other compression codecs.
          Hide
          dvryaboy Dmitriy V. Ryaboy added a comment -

          Ok, so it looks like we'll automatically do the right thing for storage if we have output.compression.enabled and output.compression.codec set. We don't do the same for reading.
          PIG-2143 had a sketch of how to make the whole thing a more flexible implementation (see the first comment). Should be straightforward enough to allow specifying "-compression=$foo" and have the codec for $foo looked up dynamically.

          Show
          dvryaboy Dmitriy V. Ryaboy added a comment - Ok, so it looks like we'll automatically do the right thing for storage if we have output.compression.enabled and output.compression.codec set. We don't do the same for reading. PIG-2143 had a sketch of how to make the whole thing a more flexible implementation (see the first comment). Should be straightforward enough to allow specifying "-compression=$foo" and have the codec for $foo looked up dynamically.
          Hide
          dvryaboy Dmitriy V. Ryaboy added a comment -

          Yeah I even had a patch to do that somewhere (we cleaned up PigStorage a bunch in the fall..)

          Show
          dvryaboy Dmitriy V. Ryaboy added a comment - Yeah I even had a patch to do that somewhere (we cleaned up PigStorage a bunch in the fall..)
          Hide
          prkommireddi Prashant Kommireddi added a comment -

          Seems like a good addition. I tested Snappy with Pig and it works just fine on compressing map output (uses the underlying Hadoop framework). We should extend the default Load/Store func to be able to use Snappy.

          Currently LZO and GZ are hardcoded, but it would be nice to pick up the available codecs from "io.compression.codecs" and make them available. Thoughts?

          Show
          prkommireddi Prashant Kommireddi added a comment - Seems like a good addition. I tested Snappy with Pig and it works just fine on compressing map output (uses the underlying Hadoop framework). We should extend the default Load/Store func to be able to use Snappy. Currently LZO and GZ are hardcoded, but it would be nice to pick up the available codecs from "io.compression.codecs" and make them available. Thoughts?

            People

            • Assignee:
              Unassigned
              Reporter:
              joecrobak Joe Crobak
            • Votes:
              2 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:

                Development