Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-24266

Committed rows in hflush'd ACID files may be missing from query result

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0.0
    • Component/s: None

      Description

      in HDFS environment if a writer is using hflush to write ORC ACID files during a transaction commit, the results might be seen as missing when reading the table before this file is completely persisted to disk (thus synced)

      This is due to hflush not persisting the new buffers to disk, it rather just ensures that new readers can see the new content. This causes the block information to be incomplete, on which BISplitStrategy relies on. Although the side file (_flush_length) tracks the proper end of the file that is being written, this information is neglected in the favour of block information, and we may end up generating a very short split instead of the larger, available length.
      When ETLSplitStrategy is used there is not even a try to rely on ACID side file when calculating file length, so that needs to fixed too.

      Moreover we might see the newly committed rows not to appear due to OrcTail caching in ETLSplitStrategy. For now I'm just going to recommend turning that cache off to anyone that wants real time row updates to be read in:

      set hive.orc.cache.stripe.details.mem.size=0;  

      ..as tweaking with that code would probably open a can of worms..

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                szita Ádám Szita
                Reporter:
                szita Ádám Szita
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m