Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-9797

Multi row transactions are not atomic for scanners

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • None
    • None

    Description

      Multi row atomic puts, as implemented by the coprocessor API is atomic for gets and multi gets, but not so much for scanners.

      mvcc read point, as of today, is only kept in RS memory. When a client starts the scan, we create a new scanner object and save the mvcc read point of the scan there. Since the scan API is row-based, the scan results are only made visible to clients row-per-row, and the client scanner keep track of the last row seen.

      So, for a multi-row atomic update, the scanner might get an mvcc number which is less than the commit point of the multi-row update, so it will skip some rows in the scan (will not see the rows). However, in case of RS failover, a new scanner will be created which will have a mvcc read number larger than the multi-row update commit number. So the scanner will see the remaining rows from the transaction.

      Example:

      multi put : { {row1, c1, v1}, {row100, c1, v100} } mvcc write number = 2
      scan : scan from row1 to row100  mvcc read number = 1
      

      scanner will not see row1. If RS fails before scanner reaches row100, the new scanner will get mvcc read number > 2, so it will see row100.

      There might be a couple of ways to fix this. First approach (as suggested by Sergey) is that we can wrap the Scanner into an atomic scanner implementation, which will restart the scan in case of a socket timeout or server failure, etc. This will batch up the results so that the rows are not visible. For small scans (like meta) this might be viable.

      The second way to properly fix this is, first finish up the patch at HBASE-8763, then change the scanner to obtain an mvcc number from the RS in scanner open, and save the mvcc number in the client side. Upon failure, the scanner will continue the scan where it is left. We have to keep the low watermark (the smallest mvcc read number of the scanners currently open) differently. Currently that number is already tracked, but not across RS failover. We can do timeouts to manage the low watermark I think.
      This approach also enables us to implement cell-based streaming scan instead of row-based approach we have today.

      Opened the issue, so that it is tracked. Feel free to pick it up if you like.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              enis Enis Soztutar
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: