HBase
  1. HBase
  2. HBASE-6040

Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.94.0, 0.95.2
    • Fix Version/s: 0.94.1
    • Component/s: mapreduce
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      Added a new config param "hbase.mapreduce.hfileoutputformat.datablock.encoding" using which we can specify which encoding scheme to be used on disk. Data will get written in to HFiles using this encoding scheme while bulk load. The value of this can be NONE, PREFIX, DIFF, FAST_DIFF as these are the DataBlockEncoding types supported now. [When any new types are added later, corresponding names also will become valid]
      The checksum type and number of bytes per checksum can be configured using the config params hbase.hstore.checksum.algorithm, hbase.hstore.bytes.per.checksum respectively
      Show
      Added a new config param "hbase.mapreduce.hfileoutputformat.datablock.encoding" using which we can specify which encoding scheme to be used on disk. Data will get written in to HFiles using this encoding scheme while bulk load. The value of this can be NONE, PREFIX, DIFF, FAST_DIFF as these are the DataBlockEncoding types supported now. [When any new types are added later, corresponding names also will become valid] The checksum type and number of bytes per checksum can be configured using the config params hbase.hstore.checksum.algorithm, hbase.hstore.bytes.per.checksum respectively
    • Tags:
      bulkload

      Description

      When the data is bulk loaded using HFileOutputFormat, we are not using the block encoding and the HBase handled checksum features.. When the writer is created for making the HFile, I am not seeing any such info passing to the WriterBuilder.
      In HFileOutputFormat.getNewWriter(byte[] family, Configuration conf), we dont have these info and do not pass also to the writer... So those HFiles will not have these optimizations..

      Later in LoadIncrementalHFiles.copyHFileHalf(), where we physically divide one HFile(created by the MR) iff it can not belong to just one region, I can see we pass the datablock encoding details and checksum details to the new HFile writer. But this step wont happen normally I think..

      1. HBASE-6040_94.patch
        3 kB
        Anoop Sam John
      2. HBASE-6040_Trunk.patch
        3 kB
        Anoop Sam John

        Issue Links

          Activity

          Hide
          Anoop Sam John added a comment -

          Thanks Stack.
          Created new bug HBASE-6164

          Show
          Anoop Sam John added a comment - Thanks Stack. Created new bug HBASE-6164
          Hide
          stack added a comment -

          Make a new one at this stage I'd suggest Anoop; this one has been closed for > a day or so. Thanks.

          I feel it is better to be moved to close() in HFile.Writer

          Ok.

          StoreFile is a wrapper around hfile to add 'hbase' stuff (we have tried to keep hfile 'pure', unpolluted by hbase-isms... I don't think we succeeded but that was the idea).

          Show
          stack added a comment - Make a new one at this stage I'd suggest Anoop; this one has been closed for > a day or so. Thanks. I feel it is better to be moved to close() in HFile.Writer Ok. StoreFile is a wrapper around hfile to add 'hbase' stuff (we have tried to keep hfile 'pure', unpolluted by hbase-isms... I don't think we succeeded but that was the idea).
          Hide
          Anoop Sam John added a comment -

          HFileDataBlockEncoder interface usage is mainly at the HFile.Writer level. Why this saveMetadata() we are making from the StoreFile.Writer?
          I feel it is better to be moved to close() in HFile.Writer
          Any way the signature change would be needed here also.

          Note: Handling of bloom we are doing fully at the StoreFile level.

          Show
          Anoop Sam John added a comment - HFileDataBlockEncoder interface usage is mainly at the HFile.Writer level. Why this saveMetadata() we are making from the StoreFile.Writer? I feel it is better to be moved to close() in HFile.Writer Any way the signature change would be needed here also. Note: Handling of bloom we are doing fully at the StoreFile level.
          Hide
          Anoop Sam John added a comment -

          HFileDataBlockEncoder is a private interface. Can we change the signature?

          HFileDataBlockEncoder#saveMetadata(StoreFile.Writer storeFileWriter)

          Show
          Anoop Sam John added a comment - HFileDataBlockEncoder is a private interface. Can we change the signature? HFileDataBlockEncoder#saveMetadata(StoreFile.Writer storeFileWriter)
          Hide
          Anoop Sam John added a comment -

          Oh sorry.. I missed that part.
          It is regarding the HFileDataBlockEncoder#saveMetadata(StoreFile.Writer storeFileWriter)
          In bulk load we deal with HFileWriter directly , not through StoreFileWriter.
          The above call of saveMetadata() is happening from StoreFile.Writer#close() only.
          This call saveMetadata() only writes the encoder type into fileinfo

          We might need to explicitly write this fileInfo from HFileOutputFormat.
          HFileDataBlockEncoder#saveMetadata(StoreFile.Writer storeFileWriter)not sure why this takes StoreFile.Writer rather than HFile.Writer
          Other methods in this interface deals with HFile or HFile blocks.

          @Stack I will reopen this JIRA?

          Thanks Gopi for noticing this and raising. Regarding the point abt usage of bloom we will track through other ticket (If ok)

          Show
          Anoop Sam John added a comment - Oh sorry.. I missed that part. It is regarding the HFileDataBlockEncoder#saveMetadata(StoreFile.Writer storeFileWriter) In bulk load we deal with HFileWriter directly , not through StoreFileWriter. The above call of saveMetadata() is happening from StoreFile.Writer#close() only. This call saveMetadata() only writes the encoder type into fileinfo We might need to explicitly write this fileInfo from HFileOutputFormat. HFileDataBlockEncoder#saveMetadata(StoreFile.Writer storeFileWriter)not sure why this takes StoreFile.Writer rather than HFile.Writer Other methods in this interface deals with HFile or HFile blocks. @Stack I will reopen this JIRA? Thanks Gopi for noticing this and raising. Regarding the point abt usage of bloom we will track through other ticket (If ok)
          Hide
          ramkrishna.s.vasudevan added a comment -

          HBASE-3776 is still open for supporting bloom filter on bulkload.

          Show
          ramkrishna.s.vasudevan added a comment - HBASE-3776 is still open for supporting bloom filter on bulkload.
          Hide
          Gopinathan A added a comment -

          Need to take care some more things for block encoding in case bulk load.

          Getting following exception while scanning the table.

          2012-06-05 15:39:24,771 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Failed openScanner
          java.lang.AssertionError: Expected on-disk data block encoding NONE, got PREFIX
          	at org.apache.hadoop.hbase.io.hfile.HFileDataBlockEncoderImpl.diskToCacheFormat(HFileDataBlockEncoderImpl.java:151)
          	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:329)
          	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.seekTo(HFileReaderV2.java:951)
          	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:229)
          	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:145)
          	at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:130)
          	at org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:2044)
          	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:3307)
          	at org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1630)
          	at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1622)
          	at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1598)
          	at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:2317)
          	at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
          	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          

          Also better to support BloomFilter in bulkload.

          Show
          Gopinathan A added a comment - Need to take care some more things for block encoding in case bulk load. Getting following exception while scanning the table. 2012-06-05 15:39:24,771 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Failed openScanner java.lang.AssertionError: Expected on-disk data block encoding NONE, got PREFIX at org.apache.hadoop.hbase.io.hfile.HFileDataBlockEncoderImpl.diskToCacheFormat(HFileDataBlockEncoderImpl.java:151) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:329) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.seekTo(HFileReaderV2.java:951) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:229) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:145) at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:130) at org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:2044) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:3307) at org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1630) at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1622) at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1598) at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:2317) at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) Also better to support BloomFilter in bulkload.
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94-security #33 (See https://builds.apache.org/job/HBase-0.94-security/33/)
          HBASE-6040 Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat (Revision 1344561)

          Result = FAILURE
          stack :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java
          Show
          Hudson added a comment - Integrated in HBase-0.94-security #33 (See https://builds.apache.org/job/HBase-0.94-security/33/ ) HBASE-6040 Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat (Revision 1344561) Result = FAILURE stack : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #34 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/34/)
          HBASE-6040 Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat (Revision 1344560)

          Result = FAILURE
          stack :
          Files :

          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #34 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/34/ ) HBASE-6040 Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat (Revision 1344560) Result = FAILURE stack : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #2962 (See https://builds.apache.org/job/HBase-TRUNK/2962/)
          HBASE-6040 Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat (Revision 1344560)

          Result = FAILURE
          stack :
          Files :

          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #2962 (See https://builds.apache.org/job/HBase-TRUNK/2962/ ) HBASE-6040 Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat (Revision 1344560) Result = FAILURE stack : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94 #239 (See https://builds.apache.org/job/HBase-0.94/239/)
          HBASE-6040 Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat (Revision 1344561)

          Result = FAILURE
          stack :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java
          Show
          Hudson added a comment - Integrated in HBase-0.94 #239 (See https://builds.apache.org/job/HBase-0.94/239/ ) HBASE-6040 Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat (Revision 1344561) Result = FAILURE stack : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java
          Hide
          stack added a comment -

          Applied to 0.94 branch and to trunk. Thanks for the patch Anoop.

          Show
          stack added a comment - Applied to 0.94 branch and to trunk. Thanks for the patch Anoop.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12530320/HBASE-6040_Trunk.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to cause Findbugs (version 1.3.9) to fail.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in .

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2069//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2069//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12530320/HBASE-6040_Trunk.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to cause Findbugs (version 1.3.9) to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2069//testReport/ Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2069//console This message is automatically generated.
          Hide
          Anoop Sam John added a comment -

          Patch for trunk

          Show
          Anoop Sam John added a comment - Patch for trunk
          Hide
          stack added a comment -

          +1

          Make a trunk patch Anoop and submit it to hadoopqa?

          Show
          stack added a comment - +1 Make a trunk patch Anoop and submit it to hadoopqa?
          Hide
          Ted Yu added a comment -

          TestHFileOutputFormat passed with the patch.
          +1 from me.

          Running through test suite is desirable.

          Show
          Ted Yu added a comment - TestHFileOutputFormat passed with the patch. +1 from me. Running through test suite is desirable.
          Hide
          Anoop Sam John added a comment -

          Patch prepared for 0.94

          Show
          Anoop Sam John added a comment - Patch prepared for 0.94
          Hide
          Anoop Sam John added a comment -

          Will upload a patch tomorrow. Need to test in cluster..

          Show
          Anoop Sam John added a comment - Will upload a patch tomorrow. Need to test in cluster..

            People

            • Assignee:
              Anoop Sam John
              Reporter:
              Anoop Sam John
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development