Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.20.0
    • Fix Version/s: 0.20.1, 0.90.0
    • Component/s: io
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Tags:
      HFile

      Description

      HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
      We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

      1. HFile-v3.patch
        4 kB
        Schubert Zhang
      2. HFile-v4.patch
        2 kB
        Schubert Zhang
      3. HFile-v5.patch
        2 kB
        Schubert Zhang
      4. binary-block-seek.patch
        3 kB
        Matteo Bertozzi

        Activity

        Hide
        Schubert Zhang added a comment -

        Code review comments:

        1. HFile.Writer

        (1) private int totalBytes = 0; ==> private long totalBytes = 0;
        int is too small (2GB) to limit the size of uncompressed data.

        (2) private void finishBlock()
        use int for variable "size".

        (3) private void checkKey(final byte [] key, final int offset, final int length)
        to reject to append duplicated key.

        if (this.comparator.compare(this.lastKeyBuffer, this.lastKeyOffset,
        this.lastKeyLength, key, offset, length) > 0)

        should be:

        if (this.comparator.compare(this.lastKeyBuffer, this.lastKeyOffset,
        this.lastKeyLength, key, offset, length) >= 0)

        (4) private long writeFileInfo(FSDataOutputStream o)

        int avgValueLen = this.entryCount == 0? 0:
        (int)(this.keylength/this.entryCount);

        should be:

        int avgValueLen = this.entryCount == 0? 0:
        (int)(this.valuelength/this.entryCount);

        2. HFile.Reader

        (1) public ByteBuffer getMetaBlock(String metaBlockName)
        it's better to use:
        if (metaIndex.count <= 0)

        { return null; // there are no meta blocks }

        then:

        if (trailer.metaIndexCount == 0)

        { return null; // there are no meta blocks }

        and add:
        if (buf == null)
        return null;
        since decompress() may return null.

        remove following to avoud block copy, and improve performance. But to to pay attention, the current position of the buf is at METABLOCKMAGIC.length (not 0):

        // Toss the header. May have to remove later due to performance.
        buf.compact();
        buf.limit(buf.limit() - METABLOCKMAGIC.length);
        buf.rewind();

        (2) ByteBuffer readBlock(int block, boolean cacheBlock)
        boundary bug.
        if (block < 0 || block > blockIndex.count)

        { throw new IOException("Requested block is out of range: " + block + ", max: " + blockIndex.count); }

        should be:

        if (block < 0 || block >= blockIndex.count)

        { throw new IOException("Requested block is out of range: " + block + ", max: " + blockIndex.count); }

        add:
        if (buf == null)

        { throw new IOException("Decompress block failure " + block); }

        since decompress() may return null.

        remove following to avoud block copy, and improve performance. But to to pay attention, the current position of the buf is at METABLOCKMAGIC.length (not 0), and the cached block also have this position.

        // Toss the header. May have to remove later due to performance.
        buf.compact();
        buf.limit(buf.limit() - METABLOCKMAGIC.length);
        buf.rewind();

        (3) public boolean seekTo()

        if (block != null && currBlock == 0)

        { block.rewind(); block.position(DATABLOCKMAGIC.length); currKeyLen = block.getInt(); currValueLen = block.getInt(); return true; }

        above code add block.position(DATABLOCKMAGIC.length); since we remove buf.compact previously.
        add return ture, I guess the code miss it.

        (4) private void loadBlock(int bloc) throws IOException {
        if (block == null)

        { block = reader.readBlock(bloc, cacheBlocks); currBlock = bloc; blockFetches++; }

        else {
        if (bloc != currBlock)

        { block = reader.readBlock(bloc, cacheBlocks); currBlock = bloc; blockFetches++; }

        else

        { // we are already in the same block, just rewind to seek again. block.rewind(); block.position(DATABLOCKMAGIC.length); // add this code since we remove buf.compact previously. }

        }
        }
        }

        Show
        Schubert Zhang added a comment - Code review comments: 1. HFile.Writer (1) private int totalBytes = 0; ==> private long totalBytes = 0; int is too small (2GB) to limit the size of uncompressed data. (2) private void finishBlock() use int for variable "size". (3) private void checkKey(final byte [] key, final int offset, final int length) to reject to append duplicated key. if (this.comparator.compare(this.lastKeyBuffer, this.lastKeyOffset, this.lastKeyLength, key, offset, length) > 0) should be: if (this.comparator.compare(this.lastKeyBuffer, this.lastKeyOffset, this.lastKeyLength, key, offset, length) >= 0) (4) private long writeFileInfo(FSDataOutputStream o) int avgValueLen = this.entryCount == 0? 0: (int)(this.keylength/this.entryCount); should be: int avgValueLen = this.entryCount == 0? 0: (int)(this.valuelength/this.entryCount); 2. HFile.Reader (1) public ByteBuffer getMetaBlock(String metaBlockName) it's better to use: if (metaIndex.count <= 0) { return null; // there are no meta blocks } then: if (trailer.metaIndexCount == 0) { return null; // there are no meta blocks } and add: if (buf == null) return null; since decompress() may return null. remove following to avoud block copy, and improve performance. But to to pay attention, the current position of the buf is at METABLOCKMAGIC.length (not 0): // Toss the header. May have to remove later due to performance. buf.compact(); buf.limit(buf.limit() - METABLOCKMAGIC.length); buf.rewind(); (2) ByteBuffer readBlock(int block, boolean cacheBlock) boundary bug. if (block < 0 || block > blockIndex.count) { throw new IOException("Requested block is out of range: " + block + ", max: " + blockIndex.count); } should be: if (block < 0 || block >= blockIndex.count) { throw new IOException("Requested block is out of range: " + block + ", max: " + blockIndex.count); } add: if (buf == null) { throw new IOException("Decompress block failure " + block); } since decompress() may return null. remove following to avoud block copy, and improve performance. But to to pay attention, the current position of the buf is at METABLOCKMAGIC.length (not 0), and the cached block also have this position. // Toss the header. May have to remove later due to performance. buf.compact(); buf.limit(buf.limit() - METABLOCKMAGIC.length); buf.rewind(); (3) public boolean seekTo() if (block != null && currBlock == 0) { block.rewind(); block.position(DATABLOCKMAGIC.length); currKeyLen = block.getInt(); currValueLen = block.getInt(); return true; } above code add block.position(DATABLOCKMAGIC.length); since we remove buf.compact previously. add return ture, I guess the code miss it. (4) private void loadBlock(int bloc) throws IOException { if (block == null) { block = reader.readBlock(bloc, cacheBlocks); currBlock = bloc; blockFetches++; } else { if (bloc != currBlock) { block = reader.readBlock(bloc, cacheBlocks); currBlock = bloc; blockFetches++; } else { // we are already in the same block, just rewind to seek again. block.rewind(); block.position(DATABLOCKMAGIC.length); // add this code since we remove buf.compact previously. } } } }
        Hide
        Schubert Zhang added a comment -

        The previous comments are not so readable.
        Please try the patch.

        Show
        Schubert Zhang added a comment - The previous comments are not so readable. Please try the patch.
        Hide
        Schubert Zhang added a comment -

        The previous comments are not so readable.
        Please try the patch.

        Show
        Schubert Zhang added a comment - The previous comments are not so readable. Please try the patch.
        Hide
        Jonathan Gray added a comment -

        Moving to 0.20.1

        Show
        Jonathan Gray added a comment - Moving to 0.20.1
        Hide
        stack added a comment -

        This looks good Schubert.

        Is removing the below safe?

        -      if (trailer.metaIndexCount == 0) {
        -        return null; // there are no meta blocks
        -      }
        

        Or are you removing code that is redundant – if metaIndex == null, the above is null?

        Do all tests pass?

        Show
        stack added a comment - This looks good Schubert. Is removing the below safe? - if (trailer.metaIndexCount == 0) { - return null ; // there are no meta blocks - } Or are you removing code that is redundant – if metaIndex == null, the above is null? Do all tests pass?
        Hide
        ryan rawson added a comment -

        stack, I am nearly 100% sure that code block I added to solve a NPE.

        Show
        ryan rawson added a comment - stack, I am nearly 100% sure that code block I added to solve a NPE.
        Hide
        Schubert Zhang added a comment -

        @stack,

        Thanks, I removed my modification at this point. and new patch is attached.

        I have done a simple verfication test use HBase PerformanceEvaluation tool. It seem the removal of buf.compact() does not improve performance.

        @Ryan,

        could you please share your added code?

        Show
        Schubert Zhang added a comment - @stack, Thanks, I removed my modification at this point. and new patch is attached. I have done a simple verfication test use HBase PerformanceEvaluation tool. It seem the removal of buf.compact() does not improve performance. @Ryan, could you please share your added code?
        Hide
        ryan rawson added a comment -

        as it turns out, buf.compact() isnt actually copying the buffer, it's just adjusting some of the internal variables and pointers. Lets leave it as is.

        Show
        ryan rawson added a comment - as it turns out, buf.compact() isnt actually copying the buffer, it's just adjusting some of the internal variables and pointers. Lets leave it as is.
        Hide
        Schubert Zhang added a comment -

        I am puzzled......

        In the class HeapByteBuffer:

        public ByteBuffer compact()

        Unknown macro: { System.arraycopy(hb, ix(position()), hb, ix(0), remaining()); position(remaining()); limit(capacity()); return this; }

        We can see there is a arraycopy here. But I am puzzled why the performance is no improvement?

        @ryan,

        I am nearly 100% sure that code block I added to solve a NPE.

        Could you please tell me about your added code block to solve the NPE? I am interested in it.

        Show
        Schubert Zhang added a comment - I am puzzled...... In the class HeapByteBuffer: public ByteBuffer compact() Unknown macro: { System.arraycopy(hb, ix(position()), hb, ix(0), remaining()); position(remaining()); limit(capacity()); return this; } We can see there is a arraycopy here. But I am puzzled why the performance is no improvement? @ryan, I am nearly 100% sure that code block I added to solve a NPE. Could you please tell me about your added code block to solve the NPE? I am interested in it.
        Hide
        Schubert Zhang added a comment -

        public ByteBuffer compact()

        { System.arraycopy(hb, ix(position()), hb, ix(0), remaining()); position(remaining()); limit(capacity()); return this; }
        Show
        Schubert Zhang added a comment - public ByteBuffer compact() { System.arraycopy(hb, ix(position()), hb, ix(0), remaining()); position(remaining()); limit(capacity()); return this; }
        Hide
        ryan rawson added a comment -

        you were advocating removing:

        • if (trailer.metaIndexCount == 0) { - return null; // there are no meta blocks - }

        i seem to recall adding that to solve some npe or something.

        Show
        ryan rawson added a comment - you were advocating removing: if (trailer.metaIndexCount == 0) { - return null; // there are no meta blocks - } i seem to recall adding that to solve some npe or something.
        Hide
        stack added a comment -

        @Schubert Is it possible that in the scheme of things, the compact arraycopy is too miniscule to show in overall PE numbers?

        The v2 patch looks good. It does not remove the section flagged above (that Ryan thinks might protect against a NPE).

        I'm moving this issue out of 0.20.1 and will apply to TRUNK. Its an improvement. Lets try and do bug fixes only in the branch.

        Show
        stack added a comment - @Schubert Is it possible that in the scheme of things, the compact arraycopy is too miniscule to show in overall PE numbers? The v2 patch looks good. It does not remove the section flagged above (that Ryan thinks might protect against a NPE). I'm moving this issue out of 0.20.1 and will apply to TRUNK. Its an improvement. Lets try and do bug fixes only in the branch.
        Hide
        stack added a comment -

        Committed to TRUNK. All tests pass. Thanks for the patch Schubert.

        Show
        stack added a comment - Committed to TRUNK. All tests pass. Thanks for the patch Schubert.
        Hide
        stack added a comment -

        Reopening. Tests fail, the TestHFile in particular.

        Show
        stack added a comment - Reopening. Tests fail, the TestHFile in particular.
        Hide
        ryan rawson added a comment -

        Here are some comments from the svn commit:

        if (this.comparator.compare(this.lastKeyBuffer, this.lastKeyOffset,

        • this.lastKeyLength, key, offset, length) > 0) {
          + this.lastKeyLength, key, offset, length) >= 0) {
          throw new IOException("Added a key not lexically larger than" +
          " previous key=" + Bytes.toString(key, offset, length) +
          ", lastkey=" + Bytes.toString(this.lastKeyBuffer, this.lastKeyOffset,

        This change has the Exact opposite effect from the intent. The intent (as I construct it) is to allow duplicate keys in a HFile, yet by adding >= you are explicitly requiring the next key to be larger, but NOT equal to the previous key. This has to stay as is.

        + if (buf == null)
        + return null;

        This is pointless, we'll NPE a few lines later anyways. Don't add checking code that doesnt have any benefits - the caller may not be interpret a null return as 'error' in this case, did you check that?

        • // Toss the header. May have to remove later due to performance.
        • buf.compact();
        • buf.limit(buf.limit() - METABLOCKMAGIC.length);
        • buf.rewind();

        I am not super happy with this change... it doesnt affect performance, but it does reduce code simplicity, and leaves open room for errors further on, and requires everyone to realize that the buf has this hidden header which isnt part of the public API of HFile.

        Show
        ryan rawson added a comment - Here are some comments from the svn commit: if (this.comparator.compare(this.lastKeyBuffer, this.lastKeyOffset, this.lastKeyLength, key, offset, length) > 0) { + this.lastKeyLength, key, offset, length) >= 0) { throw new IOException("Added a key not lexically larger than" + " previous key=" + Bytes.toString(key, offset, length) + ", lastkey=" + Bytes.toString(this.lastKeyBuffer, this.lastKeyOffset, This change has the Exact opposite effect from the intent. The intent (as I construct it) is to allow duplicate keys in a HFile, yet by adding >= you are explicitly requiring the next key to be larger, but NOT equal to the previous key. This has to stay as is. + if (buf == null) + return null; This is pointless, we'll NPE a few lines later anyways. Don't add checking code that doesnt have any benefits - the caller may not be interpret a null return as 'error' in this case, did you check that? // Toss the header. May have to remove later due to performance. buf.compact(); buf.limit(buf.limit() - METABLOCKMAGIC.length); buf.rewind(); I am not super happy with this change... it doesnt affect performance, but it does reduce code simplicity, and leaves open room for errors further on, and requires everyone to realize that the buf has this hidden header which isnt part of the public API of HFile.
        Hide
        Schubert Zhang added a comment -

        @ryan
        1. Regards duplicate keys in a HFile. I am following concern:
        If we allow duplicate keys. Consider following scenario:
        A key="abcd" is append in block1's last key/value pair.
        And the the same key="abcd" is append in block2's first key/value pair. Then in the block index, the key="abcd" will point to block2.
        Then, we want to scan from key="abcd", but the first key="abcd" (in block1's last) will be missed out.
        Can you confirm this scenario is acceptable or required?

        2. + if (buf == null)
        + return null;

        This check is only added in getMetaBlock(...). In this method, there are three points to return null.
        (1) if (trailer.metaIndexCount == 0)

        { return null; // there are no meta blocks }

        (2) if (block == -1)
        return null;
        (3) if (buf == null) //new added by me
        return null;
        If we do not check it, the following buf.get(..) may NPE. because the decompress() method will not throw exception. Do you mean NPE is better than "return null" which same as above two?
        In fact, it is diffcult to make above trade-off for me, maybe I am doing the way as C++.

        3. Regards buf.compact().
        Yes, you may be right. After more test about performance, my patch does not improve the performance (I don't know if it can improve in some other environments.) I agree to remove this modification in my patch to retain the neat of the returned block buffer (position at 0).

        @stack and ryan
        Thanks for your test. I will change the patch according to you comments. To include only bug fixing.
        If the test fail, please just revert to old version.
        Please give me comments about my above questions, then I can make active immediately. Thanks.

        Show
        Schubert Zhang added a comment - @ryan 1. Regards duplicate keys in a HFile. I am following concern: If we allow duplicate keys. Consider following scenario: A key="abcd" is append in block1's last key/value pair. And the the same key="abcd" is append in block2's first key/value pair. Then in the block index, the key="abcd" will point to block2. Then, we want to scan from key="abcd", but the first key="abcd" (in block1's last) will be missed out. Can you confirm this scenario is acceptable or required? 2. + if (buf == null) + return null; This check is only added in getMetaBlock(...). In this method, there are three points to return null. (1) if (trailer.metaIndexCount == 0) { return null; // there are no meta blocks } (2) if (block == -1) return null; (3) if (buf == null) //new added by me return null; If we do not check it, the following buf.get(..) may NPE. because the decompress() method will not throw exception. Do you mean NPE is better than "return null" which same as above two? In fact, it is diffcult to make above trade-off for me, maybe I am doing the way as C++. 3. Regards buf.compact(). Yes, you may be right. After more test about performance, my patch does not improve the performance (I don't know if it can improve in some other environments.) I agree to remove this modification in my patch to retain the neat of the returned block buffer (position at 0). @stack and ryan Thanks for your test. I will change the patch according to you comments. To include only bug fixing. If the test fail, please just revert to old version. Please give me comments about my above questions, then I can make active immediately. Thanks.
        Hide
        Schubert Zhang added a comment -

        I upload the v3 patch, which only include bug fix.

        Show
        Schubert Zhang added a comment - I upload the v3 patch, which only include bug fix.
        Hide
        stack added a comment -

        I buy the argument preventing duplicate keys.

        Patch looks good to me. +1 on commiting (TestHFile passes).

        @Ryan, what you think?

        Show
        stack added a comment - I buy the argument preventing duplicate keys. Patch looks good to me. +1 on commiting (TestHFile passes). @Ryan, what you think?
        Hide
        ryan rawson added a comment -

        So if compress() returns null (which it should not), some kind of error occurred. Such as DFS out to lunch, etc, etc. Returning 'null' - which is interpreted as 'no such meta block' is really ruining the API contract here. Just let it NPE. Better than converting a IO error into a logical error as your code would do.

        As for the duplicate key, the block index is based on the last key in the block. So if we have two duplicate keys that straddle a block boundary, the scenario is like so:

        • last key of block X is key 'A' and that is entered in the index
        • first key of block X+1 is also key 'A', but first key is not part of the index

        Scanning will start at block X, and we will see both.

        The only scenario where we could potentially have an issue is if we have multiple duplicate keys so they span more than 1 block thus causing 2 index entries to have the same key. The binary search algorithm will need to choose the first entry to maintain correct behaviour on scan, and I am not 100% sure if this is what will happen. But even so, this is very rare, since in HBase keys are distinguished with timestamps, it would require either many keys with the same TS or a few really large ones with the same TS. Something to consider testing for.

        More importantly, if we disallow duplicate keys in hfile, that will cause huge problems. Right now there is no other mechanism to prevent duplicate KeyValues from being inserted - its your own lunch if you put multiple values at the same Timestamp. But this change would cause compactions to throw and prevent them from completing. A far worse scenario.

        Show
        ryan rawson added a comment - So if compress() returns null (which it should not), some kind of error occurred. Such as DFS out to lunch, etc, etc. Returning 'null' - which is interpreted as 'no such meta block' is really ruining the API contract here. Just let it NPE. Better than converting a IO error into a logical error as your code would do. As for the duplicate key, the block index is based on the last key in the block. So if we have two duplicate keys that straddle a block boundary, the scenario is like so: last key of block X is key 'A' and that is entered in the index first key of block X+1 is also key 'A', but first key is not part of the index Scanning will start at block X, and we will see both. The only scenario where we could potentially have an issue is if we have multiple duplicate keys so they span more than 1 block thus causing 2 index entries to have the same key. The binary search algorithm will need to choose the first entry to maintain correct behaviour on scan, and I am not 100% sure if this is what will happen. But even so, this is very rare, since in HBase keys are distinguished with timestamps, it would require either many keys with the same TS or a few really large ones with the same TS. Something to consider testing for. More importantly, if we disallow duplicate keys in hfile, that will cause huge problems. Right now there is no other mechanism to prevent duplicate KeyValues from being inserted - its your own lunch if you put multiple values at the same Timestamp. But this change would cause compactions to throw and prevent them from completing. A far worse scenario.
        Hide
        stack added a comment -

        I'm wondering how we'd ever write duplicate keys to an hfile? Wouldn't the most recent overwrite any older keys in memstore, a TreeMap?

        Show
        stack added a comment - I'm wondering how we'd ever write duplicate keys to an hfile? Wouldn't the most recent overwrite any older keys in memstore, a TreeMap?
        Hide
        stack added a comment -

        Ryan held my hand and explained that while not at memstore flush time, it could happen on a compaction – makes sense.

        Show
        stack added a comment - Ryan held my hand and explained that while not at memstore flush time, it could happen on a compaction – makes sense.
        Hide
        Schubert Zhang added a comment -

        @ryan
        1. Your description of "return null" is good, i.e. null means this meta block no exist.
        Accept you comment.
        Maybe we can throw another Exception here? If you think it is not necessary, I will remove this added code.

        2. In the HFile.java, I found the block index is based on the first key in the block (not last key).
        Not only treading HFile as a part of HBase, in fact, we want HFile can be a common file format which can be used in other applications. And in fact, I like to support duplicate keys in an HFile, since my application use HFile directly to store data. But when I checked the code, I found the risk to add duplicate keys. e.g.

        • block 1: firstKey=A, lastKey=B, indexKey=A
        • block 2: firstKey=B, lastKey=C, indexKey=B
          When seek key=B, we go into block 2, and miss the lastKey=B in block 1.

        Yes, you are right, if the index of data block is last key instead of first key. it seems fine:

        • block 1: firstKey=A, lastKey=B, indexKey=B
        • block 2: firstKey=B, lastKey=C, indexKey=C
          When seek key=B, we go into block 1. The Scanner.seekTo() will find key=B in block 1 from the firstKey of block 1, and the Scanner.next() will not miss the key=B in block 2.

        But I double checked the HFile code, the block index is really firstKey now.

        Show
        Schubert Zhang added a comment - @ryan 1. Your description of "return null" is good, i.e. null means this meta block no exist. Accept you comment. Maybe we can throw another Exception here? If you think it is not necessary, I will remove this added code. 2. In the HFile.java, I found the block index is based on the first key in the block (not last key). Not only treading HFile as a part of HBase, in fact, we want HFile can be a common file format which can be used in other applications. And in fact, I like to support duplicate keys in an HFile, since my application use HFile directly to store data. But when I checked the code, I found the risk to add duplicate keys. e.g. block 1: firstKey=A, lastKey=B, indexKey=A block 2: firstKey=B, lastKey=C, indexKey=B When seek key=B, we go into block 2, and miss the lastKey=B in block 1. Yes, you are right, if the index of data block is last key instead of first key. it seems fine: block 1: firstKey=A, lastKey=B, indexKey=B block 2: firstKey=B, lastKey=C, indexKey=C When seek key=B, we go into block 1. The Scanner.seekTo() will find key=B in block 1 from the firstKey of block 1, and the Scanner.next() will not miss the key=B in block 2. But I double checked the HFile code, the block index is really firstKey now.
        Hide
        Schubert Zhang added a comment -

        @ Ryan and Stack,
        Could you please make time to give me a reply about above two question? Then I will go ahead.

        Show
        Schubert Zhang added a comment - @ Ryan and Stack, Could you please make time to give me a reply about above two question? Then I will go ahead.
        Hide
        stack added a comment -

        On hfile as a general format, do think that will work Schubert? It has stuff like KeyValue imports. Is that OK?

        Ryan, what you think of index being first rather than last key in block?

        On the exception in decompress, https://issues.apache.org/jira/browse/HBASE-1809 is an example. Somehow we're using the HFile after a close. Maybe another exception would be good here... at least debugging. Could emit state of the HFile.Reader.

        Show
        stack added a comment - On hfile as a general format, do think that will work Schubert? It has stuff like KeyValue imports. Is that OK? Ryan, what you think of index being first rather than last key in block? On the exception in decompress, https://issues.apache.org/jira/browse/HBASE-1809 is an example. Somehow we're using the HFile after a close. Maybe another exception would be good here... at least debugging. Could emit state of the HFile.Reader.
        Hide
        Schubert Zhang added a comment -

        @stack
        Thanks for you reply. So I will make following decision to close this issue.

        1. Do not check the returned buf of decompress() and let NPE is throwed.
        I have thought to throw IOException, but it seems it is not a RunTimeException.

        2. Leave the HFile can accept duplicated keys. But I think that will let the scenario of Scanner undefined.
        The block index uses firstKey of the block.

        So, finally, the patch only include some minor fix.

        Show
        Schubert Zhang added a comment - @stack Thanks for you reply. So I will make following decision to close this issue. 1. Do not check the returned buf of decompress() and let NPE is throwed. I have thought to throw IOException, but it seems it is not a RunTimeException. 2. Leave the HFile can accept duplicated keys. But I think that will let the scenario of Scanner undefined. The block index uses firstKey of the block. So, finally, the patch only include some minor fix.
        Hide
        stack added a comment -

        Ok on 1. and 2 (I opened new issue to account for the case you discovered in your review where its possible we miss a key if many instances and they span a block – hbase-1841).

        What about this change:

        @@ -860,7 +859,7 @@
               if (trailer.metaIndexCount == 0) {
                 return null; // there are no meta blocks
               }
        -      if (metaIndex == null) {
        +      if ((metaIndex == null) || (metaIndex.count == 0)) {
                 throw new IOException("Meta index not loaded");
               }
               byte [] mbname = Bytes.toBytes(metaBlockName);
        
        

        Is it possible that metaIndex is loaded but empty?

        Thanks Schubert

        Show
        stack added a comment - Ok on 1. and 2 (I opened new issue to account for the case you discovered in your review where its possible we miss a key if many instances and they span a block – hbase-1841). What about this change: @@ -860,7 +859,7 @@ if (trailer.metaIndexCount == 0) { return null ; // there are no meta blocks } - if (metaIndex == null ) { + if ((metaIndex == null ) || (metaIndex.count == 0)) { throw new IOException( "Meta index not loaded" ); } byte [] mbname = Bytes.toBytes(metaBlockName); Is it possible that metaIndex is loaded but empty? Thanks Schubert
        Hide
        ryan rawson added a comment -

        in reference to the block index, yes the scenario with duplicate keys that span a block boundary exists.

        It's possible that we could fix these holes with a different write strategy which doesnt create invalid hfiles like the one you theorized above. Another scenario is when you could have duplicate key entries in the index, which could cause problems with the binary search algorithm.

        There is 2 potential fixes here:

        • fix binary search algorithm to actually find the lower bound in face of duplicates.
        • prevent hfiles like the one indicated above from being created, in this case by extending block 1 larger than the default sizing until we get a different key.

        there might be other solutions too.

        Show
        ryan rawson added a comment - in reference to the block index, yes the scenario with duplicate keys that span a block boundary exists. It's possible that we could fix these holes with a different write strategy which doesnt create invalid hfiles like the one you theorized above. Another scenario is when you could have duplicate key entries in the index, which could cause problems with the binary search algorithm. There is 2 potential fixes here: fix binary search algorithm to actually find the lower bound in face of duplicates. prevent hfiles like the one indicated above from being created, in this case by extending block 1 larger than the default sizing until we get a different key. there might be other solutions too.
        Hide
        Schubert Zhang added a comment -

        Thanks stack for create a new issue (hbase-1841)

        Regards the 2 potential fixes:

        • fix binary search algorithm to actually find the lower bound in face of duplicates.
          I think maybe we need to change to use lastkey as the block index?
        • prevent hfiles like the one indicated above from being created, in this case by extending block 1 larger than the default sizing until we get a different key.
          In fact, we used this way in one of our old product, i.e. only start new block/index at the boundary of different key. In this case, we should ensure the number of the duplicated keys not too large (that will lead big block).
        Show
        Schubert Zhang added a comment - Thanks stack for create a new issue (hbase-1841) Regards the 2 potential fixes: fix binary search algorithm to actually find the lower bound in face of duplicates. I think maybe we need to change to use lastkey as the block index? prevent hfiles like the one indicated above from being created, in this case by extending block 1 larger than the default sizing until we get a different key. In fact, we used this way in one of our old product, i.e. only start new block/index at the boundary of different key. In this case, we should ensure the number of the duplicated keys not too large (that will lead big block).
        Hide
        stack added a comment -

        Applied to branch and trunk (v5 is bug fixes and a small optimization).

        Thanks for the patch Schubert.

        Show
        stack added a comment - Applied to branch and trunk (v5 is bug fixes and a small optimization). Thanks for the patch Schubert.
        Hide
        Matteo Bertozzi added a comment -

        I've rewritten Scanner.blockSeek() to perform a binary search.
        The new method has the same behaviours of the previous one even with duplicate keys.
        Unfortunatly I've used an ArrayList to keep track of keys' offset, ArrayList can be replaced by an array if number of keys in a block is known.

        Show
        Matteo Bertozzi added a comment - I've rewritten Scanner.blockSeek() to perform a binary search. The new method has the same behaviours of the previous one even with duplicate keys. Unfortunatly I've used an ArrayList to keep track of keys' offset, ArrayList can be replaced by an array if number of keys in a block is known.
        Hide
        ryan rawson added a comment -

        according to my own profiling of hbase, blockSeek is not a hot method. Could you provide more justification for the extra memory use?

        Show
        ryan rawson added a comment - according to my own profiling of hbase, blockSeek is not a hot method. Could you provide more justification for the extra memory use?
        Hide
        Matteo Bertozzi added a comment -

        @Ryan
        yep, blockSeek is not a hot method, but with larger keys and blocks it can speedup key lookups.
        Extra memory is just 4K for 1024 keys, and for me would not be bad to have it on disk, at the end of the block.

        Show
        Matteo Bertozzi added a comment - @Ryan yep, blockSeek is not a hot method, but with larger keys and blocks it can speedup key lookups. Extra memory is just 4K for 1024 keys, and for me would not be bad to have it on disk, at the end of the block.

          People

          • Assignee:
            Schubert Zhang
            Reporter:
            Schubert Zhang
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development