Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-26465

MemStoreLAB may be released early when its SegmentScanner is scanning

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 2.5.0, 3.0.0-alpha-2
    • 2.5.0, 3.0.0-alpha-2, 2.4.9
    • regionserver
    • None
    • Reviewed

    Description

      HBASE-26144 moved MemStore.clearSnapshot out of write lock of HStore#lock, but because MemStore.clearSnapshot closed DefaultMemStore#snapshot which may be used by DefaultMemStore#getScanners, when DefaultMemStore#getScanners and MemStore flushing execute conurrently, MemStoreLAB used by DefaultMemStore#snapshot may be released early when its SegmentScanner is scanning.

      Considering follow thread sequences:
      1.The flush thread starts DefaultMemStore flushing after some cells have be added to DefaultMemStore.

      2.The flush thread stopping before DefaultMemStore#clearSnapshot in following line 1238 in
      HStore#updateStorefiles after completed flushing memStore to hfile.

      1221    private boolean updateStorefiles(List<HStoreFile> sfs, long snapshotId) throws IOException {
      1222    this.lock.writeLock().lock();
      1223   try {
      1224      this.storeEngine.getStoreFileManager().insertNewFiles(sfs);
      1225    } finally {
      1226      // We need the lock, as long as we are updating the storeFiles
      1227      // or changing the memstore. Let us release it before calling
      1228      // notifyChangeReadersObservers. See HBASE-4485 for a possible
      1229      // deadlock scenario that could have happened if continue to hold
      1230      // the lock.
      1231     this.lock.writeLock().unlock();
      1232   }
      1233    // We do not need to call clearSnapshot method inside the write lock.
      1234    // The clearSnapshot itself is thread safe, which can be called at the same time with other
      1235    // memstore operations expect snapshot and clearSnapshot. And for these two methods, in HRegion
      1236    // we can guarantee that there is only one onging flush, so they will be no race.
      1237    if (snapshotId > 0) {
      1238      this.memstore.clearSnapshot(snapshotId);
      1239    }
      ......
      

      3.The scan thread starts and stopping after DefaultMemStore.snapshot.getAllSegments in following line 141 in
      DefaultMemStore#getScanners,here the scan thread gets the DefaultMemStore#snapshot which is created by the flush thread.

      138 public List<KeyValueScanner> getScanners(long readPt) throws IOException {
      139    List<KeyValueScanner> list = new ArrayList<>();
      140    addToScanners(getActive(), readPt, list);
      141    addToScanners(snapshot.getAllSegments(), readPt, list);
      142    return list;
      143 }
      

      4.The flush thread continues DefaultMemStore#clearSnapshot and close DefaultMemStore#snapshot in following line 253,because the reference count of the corresponding MemStoreLABImpl is decreased to 0, the Chunks in corresponding MemStoreLABImpl are recycled.

      240  public void clearSnapshot(long id) throws UnexpectedStateException {
          ......
      248    Segment oldSnapshot = this.snapshot;
      249    if (!this.snapshot.isEmpty()) {
      250      this.snapshot = SegmentFactory.instance().createImmutableSegment(this.comparator);
      251    }
      252    this.snapshotId = NO_SNAPSHOT_ID;
      253    oldSnapshot.close();
      254  }
      

      5.The scan thread continues DefaultMemStore#getScanners, and create a SegmentScanner for this DefaultMemStore#snapshot, and invokes MemStoreLABImpl.incScannerCount in line 281 to increase the reference count , but here MemStoreLABImpl.incScannerCount does not check that the reference count is already 0 and Chunks are already recycled by step 4, so these Chunks may be overwritten by other write threads when the SegmentScanner is scanning, which may cause serious problem.

      280   public void incScannerCount() {
      281       this.openScannerCount.incrementAndGet();
      282 }
      

      I simulate above case in my PR, and my Fix is as following:
      1.Moving MemStore.clearSnapshot back under write lock of HStore#lock, because DefaultMemStore#getScanners is protected under the read lock of HStore#lock, so DefaultMemStore#getScanners and
      DefaultMemStore#clearSnapshot could not execute concurrently.
      2. Introducing RefCnt into MemStoreLABImpl to replace its flawed reference count implementation, so checking and increasing or decreasing is done in atomicity and if there is illegal state in reference count, it would throw exception rather than using corrupt data, but this modification is not included in PR now because I find that there is another bug
      HBASE-26488 would disrupt the reference count in MemStoreLABImpl , I would Introduce RefCnt after fixing this another bug in HBASE-26494.

      Attachments

        Issue Links

          Activity

            People

              comnetwork chenglei
              comnetwork chenglei
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: