Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-16671

LLAP IO: BufferUnderflowException may happen in very rare(?) cases due to ORC end-of-CB estimation

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 2.4.0, 3.0.0
    • None
    • None

    Attachments

      1. HIVE-16671.patch
        0.9 kB
        Sergey Shelukhin
      2. HIVE-16671.01.patch
        4 kB
        Sergey Shelukhin
      3. HIVE-16671.02.patch
        9 kB
        Sergey Shelukhin

      Activity

        vgarg Vineet Garg added a comment -

        Hive 3.0.0 has been released so closing this jira.

        vgarg Vineet Garg added a comment - Hive 3.0.0 has been released so closing this jira.

        Thanks, committed to branches

        sershe Sergey Shelukhin added a comment - Thanks, committed to branches
        hiveqa Hive QA added a comment -

        Here are the results of testing the latest attachment:
        https://issues.apache.org/jira/secure/attachment/12868825/HIVE-16671.02.patch

        SUCCESS: +1 due to 1 test(s) being added or modified.

        ERROR: -1 due to 5 failed/errored test(s), 10732 tests executed
        Failed tests:

        org.apache.hadoop.hive.cli.TestBeeLineDriver.testCliDriver[materialized_view_create_rewrite] (batchId=236)
        org.apache.hadoop.hive.cli.TestBeeLineDriver.testCliDriver[smb_mapjoin_7] (batchId=236)
        org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_3] (batchId=97)
        org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_3] (batchId=97)
        org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query24] (batchId=231)
        

        Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5334/testReport
        Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5334/console
        Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5334/

        Messages:

        Executing org.apache.hive.ptest.execution.TestCheckPhase
        Executing org.apache.hive.ptest.execution.PrepPhase
        Executing org.apache.hive.ptest.execution.ExecutionPhase
        Executing org.apache.hive.ptest.execution.ReportingPhase
        Tests exited with: TestsFailedException: 5 tests failed
        

        This message is automatically generated.

        ATTACHMENT ID: 12868825 - PreCommit-HIVE-Build

        hiveqa Hive QA added a comment - Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12868825/HIVE-16671.02.patch SUCCESS: +1 due to 1 test(s) being added or modified. ERROR: -1 due to 5 failed/errored test(s), 10732 tests executed Failed tests: org.apache.hadoop.hive.cli.TestBeeLineDriver.testCliDriver[materialized_view_create_rewrite] (batchId=236) org.apache.hadoop.hive.cli.TestBeeLineDriver.testCliDriver[smb_mapjoin_7] (batchId=236) org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_3] (batchId=97) org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_3] (batchId=97) org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query24] (batchId=231) Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5334/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5334/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5334/ Messages: Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 5 tests failed This message is automatically generated. ATTACHMENT ID: 12868825 - PreCommit-HIVE-Build

        Updated the method, added the test.

        sershe Sergey Shelukhin added a comment - Updated the method, added the test.

        "next" in this case is the current element that we are reading. WIll refactor to make it more clear

        sershe Sergey Shelukhin added a comment - "next" in this case is the current element that we are reading. WIll refactor to make it more clear
        if (ix == 3) break; // Done, we have 3 bytes.
        

        After this why are we moving on to next DiskRangeList? DiskRangeList will not be 1 or 2 byte long right? If we can't read complete header in current DiskRangeList, worst case it will be in next contiguous DiskRangeList. Am I missing something here?

        Will be good to make it a separate method with tests (fake InStreams) if possible.

        prasanth_j Prasanth Jayachandran added a comment - if (ix == 3) break ; // Done, we have 3 bytes. After this why are we moving on to next DiskRangeList? DiskRangeList will not be 1 or 2 byte long right? If we can't read complete header in current DiskRangeList, worst case it will be in next contiguous DiskRangeList. Am I missing something here? Will be good to make it a separate method with tests (fake InStreams) if possible.
        sershe Sergey Shelukhin added a comment - prasanth_j ping?

        No, ix is across multiple buffers. So, if we have 3 buffers of one bytes, we'd increment ix by 1 in each. Unless I'm missing something.

        sershe Sergey Shelukhin added a comment - No, ix is across multiple buffers. So, if we have 3 buffers of one bytes, we'd increment ix by 1 in each. Unless I'm missing something.
        prasanth_j Prasanth Jayachandran added a comment - - edited
        ix = readLengthBytes(compressed, bytes, ix);
        if (ix == 3) break; // Done, we have 3 bytes.
        

        reset ix=0 inside the while loop? ix could be non-zero value outside while.

        prasanth_j Prasanth Jayachandran added a comment - - edited ix = readLengthBytes(compressed, bytes, ix); if (ix == 3) break ; // Done, we have 3 bytes. reset ix=0 inside the while loop? ix could be non-zero value outside while.
        hiveqa Hive QA added a comment -

        Here are the results of testing the latest attachment:
        https://issues.apache.org/jira/secure/attachment/12868200/HIVE-16671.01.patch

        ERROR: -1 due to no test(s) being added or modified.

        ERROR: -1 due to 3 failed/errored test(s), 10717 tests executed
        Failed tests:

        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[index_auto_mult_tables] (batchId=81)
        org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[table_nonprintable] (batchId=140)
        org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_if_expr] (batchId=144)
        

        Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5273/testReport
        Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5273/console
        Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5273/

        Messages:

        Executing org.apache.hive.ptest.execution.TestCheckPhase
        Executing org.apache.hive.ptest.execution.PrepPhase
        Executing org.apache.hive.ptest.execution.ExecutionPhase
        Executing org.apache.hive.ptest.execution.ReportingPhase
        Tests exited with: TestsFailedException: 3 tests failed
        

        This message is automatically generated.

        ATTACHMENT ID: 12868200 - PreCommit-HIVE-Build

        hiveqa Hive QA added a comment - Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12868200/HIVE-16671.01.patch ERROR: -1 due to no test(s) being added or modified. ERROR: -1 due to 3 failed/errored test(s), 10717 tests executed Failed tests: org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[index_auto_mult_tables] (batchId=81) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[table_nonprintable] (batchId=140) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_if_expr] (batchId=144) Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5273/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5273/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5273/ Messages: Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 3 tests failed This message is automatically generated. ATTACHMENT ID: 12868200 - PreCommit-HIVE-Build

        prasanth_j btw, the first patch fixed the original issue. For that case, the new patch would behave like the original. The corner case to the corner case, where 3 bytes of length are present over multiple buffers, is handled in 02 patch. Unfortunately we don't have repro for that and it might be impossible (perhaps if length falls on ZCR boundary with ZCR enabled?)

        sershe Sergey Shelukhin added a comment - prasanth_j btw, the first patch fixed the original issue. For that case, the new patch would behave like the original. The corner case to the corner case, where 3 bytes of length are present over multiple buffers, is handled in 02 patch. Unfortunately we don't have repro for that and it might be impossible (perhaps if length falls on ZCR boundary with ZCR enabled?)

        A patch to handle all corner cases..

        sershe Sergey Shelukhin added a comment - A patch to handle all corner cases..

        The 3-byte check is related to the 3-byte CB header. I dbl checked, we hit BufferUnderflow (no next byte) on the 3rd get.
        I don't think it's related to the file header, that would have happened much earlier. I may have more details tomorrow about the failure.

        sershe Sergey Shelukhin added a comment - The 3-byte check is related to the 3-byte CB header. I dbl checked, we hit BufferUnderflow (no next byte) on the 3rd get. I don't think it's related to the file header, that would have happened much earlier. I may have more details tomorrow about the failure.

        the 3 bytes check makes me think if this is somehow related to splits starting at 0 vs 3?
        When BI split strategy is chosen, entire file/block could become a split? Say if a file is 1000 bytes. Split offset will be 0 and length will be 1000.
        Whereas for the same file, if ETL split strategy is chosen, split offset will be 3 and length will be 997. First 3 bytes are ignored as that is part of ORC magic header.

        Do you have a repro for this issue? If so could you check the split boundaries to make sure if this is the case.

        prasanth_j Prasanth Jayachandran added a comment - the 3 bytes check makes me think if this is somehow related to splits starting at 0 vs 3? When BI split strategy is chosen, entire file/block could become a split? Say if a file is 1000 bytes. Split offset will be 0 and length will be 1000. Whereas for the same file, if ETL split strategy is chosen, split offset will be 3 and length will be 997. First 3 bytes are ignored as that is part of ORC magic header. Do you have a repro for this issue? If so could you check the split boundaries to make sure if this is the case.

        Actually, I guess this patch should be more involved, if there's a contiguous chunk with more byes, however unlikely that is

        sershe Sergey Shelukhin added a comment - Actually, I guess this patch should be more involved, if there's a contiguous chunk with more byes, however unlikely that is
        sershe Sergey Shelukhin added a comment - - edited

        prasanth_j can you take a look? and also perhaps comment if you have input on ORC estimates. This is basically the same logic as just below where we find out that BufferChunk-s from disk contain a cut off CB, after finding the CB length. What we see is a stable repro of BufferUnderflow in one of the 3 get-s. So I'm assuming that ORC estimate can overshoot a CB boundary by exactly 1-2 bytes, causing the failure to read length; that needs to be handled similarly to a partial CB. Does this make sense?

        sershe Sergey Shelukhin added a comment - - edited prasanth_j can you take a look? and also perhaps comment if you have input on ORC estimates. This is basically the same logic as just below where we find out that BufferChunk-s from disk contain a cut off CB, after finding the CB length. What we see is a stable repro of BufferUnderflow in one of the 3 get-s. So I'm assuming that ORC estimate can overshoot a CB boundary by exactly 1-2 bytes, causing the failure to read length; that needs to be handled similarly to a partial CB. Does this make sense?

        People

          sershe Sergey Shelukhin
          rmutyala Ravi Mutyala
          Votes:
          0 Vote for this issue
          Watchers:
          5 Start watching this issue

          Dates

            Created:
            Updated:
            Resolved: