Hive
  1. Hive
  2. HIVE-1138

Hive using lzo comporession returns unexpected results.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Not A Problem
    • Affects Version/s: 0.6.0
    • Fix Version/s: None
    • Component/s: Query Processor
    • Labels:
      None
    • Environment:

      hadoop 0.20.1, hive trunk 2010-02-03

      Description

      I have a tab separated files I have loaded it with "load data inpath" then I do a

      SET hive.exec.compress.output=true;
      SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
      SET mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
      select distinct login_cldr_id as cldr_id from chatsessions_load;

      Ended Job = job_201001151039_1641
      OK
      NULL
      NULL
      NULL
      Time taken: 49.06 seconds

      however if I start it without the set commands I get this:
      Ended Job = job_201001151039_1642
      OK
      2283
      Time taken: 45.308 seconds

      Which is the correct result.

      When I do a "insert overwrite" on a rcfile table it will actually compress the data correctly.
      When I disable compression and query this new table the result is correct.
      When I enable compression it's wrong again.
      I see no errors in the logs.

      1. test.csv
        0.0 kB
        Bennie Schut

        Activity

        Hide
        Bennie Schut added a comment -

        Doesn't seem to happen when I set compression to Gzip.

        set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
        set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;

        Show
        Bennie Schut added a comment - Doesn't seem to happen when I set compression to Gzip. set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
        Hide
        Bennie Schut added a comment -

        on the filesystem I do find a jobfile like this: attempt_201001151039_1841_r_000001_0.lzo_deflate
        which contains the correct value in a compressed format.
        @@^@E@@@ ^V2283
        Q@^@
        Perhaps hive reads this as a non-compressed file?

        Show
        Bennie Schut added a comment - on the filesystem I do find a jobfile like this: attempt_201001151039_1841_r_000001_0.lzo_deflate which contains the correct value in a compressed format. @ @^@ E @ @ @ ^V2283 Q @^@ Perhaps hive reads this as a non-compressed file?
        Hide
        He Yongqiang added a comment -

        Hi Bennie

        Are you seeing this on rcfile table or other fileformat table?

        I have a tab separated files I have loaded it with "load data inpath" then I do a
        SET hive.exec.compress.output=true;
        SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
        SET mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
        select distinct login_cldr_id as cldr_id from chatsessions_load;

        Ended Job = job_201001151039_1641
        OK
        NULL
        NULL
        NULL
        Time taken: 49.06 seconds

        however if I start it without the set commands I get this:
        Ended Job = job_201001151039_1642
        OK
        2283
        Time taken: 45.308 seconds

        What is the file format here?

        Show
        He Yongqiang added a comment - Hi Bennie Are you seeing this on rcfile table or other fileformat table? I have a tab separated files I have loaded it with "load data inpath" then I do a SET hive.exec.compress.output=true; SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec; SET mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec; select distinct login_cldr_id as cldr_id from chatsessions_load; Ended Job = job_201001151039_1641 OK NULL NULL NULL Time taken: 49.06 seconds however if I start it without the set commands I get this: Ended Job = job_201001151039_1642 OK 2283 Time taken: 45.308 seconds What is the file format here?
        Hide
        Bennie Schut added a comment -

        This example is on a text table:
        ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';

        but when I do this:
        ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe' STORED AS RCFILE;
        I get the same result.

        Show
        Bennie Schut added a comment - This example is on a text table: ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'; but when I do this: ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe' STORED AS RCFILE; I get the same result.
        Hide
        He Yongqiang added a comment -

        1) Can you check if LzoCodec is loaded/installed correctly? (Is LzoCodec removed from the Hadoop version you used, if yes, how you installed and used it in Hive?)
        2) If yes for 1, can you upload a small piece of data, no need for real data, i think a test table with some data will ok. Just make sure i can reproduce the problem.

        Show
        He Yongqiang added a comment - 1) Can you check if LzoCodec is loaded/installed correctly? (Is LzoCodec removed from the Hadoop version you used, if yes, how you installed and used it in Hive?) 2) If yes for 1, can you upload a small piece of data, no need for real data, i think a test table with some data will ok. Just make sure i can reproduce the problem.
        Hide
        Bennie Schut added a comment -

        on 0.20.1 lzo is removed. I installed the "hadoop-gpl-compression-read-only" code from googlecode.com and it seems to work correctly on hadoop.

        On the reduce step I see things like this in the logs:

         
        2010-02-08 22:06:36,554 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader: Loaded native gpl library
        2010-02-08 22:06:36,555 INFO com.hadoop.compression.lzo.LzoCodec: Successfully loaded & initialized native-lzo library
        2010-02-08 22:06:36,556 INFO org.apache.hadoop.hive.ql.io.CodecPool: Got brand-new compressor
        2010-02-08 22:06:36,558 INFO org.apache.hadoop.hive.ql.io.CodecPool: Got brand-new compressor
        

        2) I'll add some data+ example code tomorrow morning.

        Thanks for looking at this.

        Show
        Bennie Schut added a comment - on 0.20.1 lzo is removed. I installed the "hadoop-gpl-compression-read-only" code from googlecode.com and it seems to work correctly on hadoop. On the reduce step I see things like this in the logs: 2010-02-08 22:06:36,554 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader: Loaded native gpl library 2010-02-08 22:06:36,555 INFO com.hadoop.compression.lzo.LzoCodec: Successfully loaded & initialized native-lzo library 2010-02-08 22:06:36,556 INFO org.apache.hadoop.hive.ql.io.CodecPool: Got brand-new compressor 2010-02-08 22:06:36,558 INFO org.apache.hadoop.hive.ql.io.CodecPool: Got brand-new compressor 2) I'll add some data+ example code tomorrow morning. Thanks for looking at this.
        Hide
        Zheng Shao added a comment -

        Bennie, I think the problem is that when Hive is reading the data and printing to screen, TextInputFormat didn't pick up the codec for the output file: attempt_201001151039_1841_r_000001_0.lzo_deflate

        I remember TextInputFormat takes the extension of the file name and then decide what codec to use. I think lzocodec is not correctly configured to handle *.lzo_deflate files.

        Show
        Zheng Shao added a comment - Bennie, I think the problem is that when Hive is reading the data and printing to screen, TextInputFormat didn't pick up the codec for the output file: attempt_201001151039_1841_r_000001_0.lzo_deflate I remember TextInputFormat takes the extension of the file name and then decide what codec to use. I think lzocodec is not correctly configured to handle *.lzo_deflate files.
        Hide
        Bennie Schut added a comment -

        I was looking a little bit into that direction but found this in the com.hadoop.compression.lzo.LzoCodec file:

         
          /**
           * Get the default filename extension for this kind of compression.
           * @return the extension including the '.'
           */
          public String getDefaultExtension() {
            return ".lzo_deflate";
          }
        

        Which looks the same as gz is doing:

          public String getDefaultExtension() {
            return ".gz";
          }
        
        Show
        Bennie Schut added a comment - I was looking a little bit into that direction but found this in the com.hadoop.compression.lzo.LzoCodec file: /** * Get the default filename extension for this kind of compression. * @return the extension including the '.' */ public String getDefaultExtension() { return ".lzo_deflate"; } Which looks the same as gz is doing: public String getDefaultExtension() { return ".gz"; }
        Hide
        Bennie Schut added a comment -

        How to reproduce the problem:

         
        CREATE TABLE test_load (
          id     int
        , code   string
        ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
        
        LOAD DATA INPATH '/user/dwh/test.csv' INTO TABLE test_load;
        
        -- this one correctly returns 5 rows.
        select distinct id from test_load;
        
        SET hive.exec.compress.output=true;
        SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
        SET mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
        
        -- this one returns incorrect results.
        select distinct id from test_load;
        
        Show
        Bennie Schut added a comment - How to reproduce the problem: CREATE TABLE test_load ( id int , code string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'; LOAD DATA INPATH '/user/dwh/test.csv' INTO TABLE test_load; -- this one correctly returns 5 rows. select distinct id from test_load; SET hive.exec.compress.output=true; SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec; SET mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec; -- this one returns incorrect results. select distinct id from test_load;
        Hide
        Bennie Schut added a comment -

        Ah a clear case of rtfm
        The codec needs to be in the list of codecs like this:

         
        <property>
         <name>io.compression.codecs</name>
         <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value>
        </property>
        

        So this is a configuration mistake and not a bug in hive.
        I just wouldn't have expected this behavior since it seems to work a little bit.
        Hopefully someone else can learn from my mistake

        Thanks Zheng and He for the support on this.

        Show
        Bennie Schut added a comment - Ah a clear case of rtfm The codec needs to be in the list of codecs like this: <property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value> </property> So this is a configuration mistake and not a bug in hive. I just wouldn't have expected this behavior since it seems to work a little bit. Hopefully someone else can learn from my mistake Thanks Zheng and He for the support on this.

          People

          • Assignee:
            Bennie Schut
            Reporter:
            Bennie Schut
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development