HBase
  1. HBase
  2. HBASE-1569

rare race condition can take down a regionserver.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.20.0
    • Fix Version/s: 0.20.0
    • Component/s: None
    • Labels:
      None

      Description

      this happened after > 24 hours of heavy import load on my cluster. Luckily the shutdown seemed to be clean:

      java.lang.IllegalAccessError: Call open first
      at org.apache.hadoop.hbase.regionserver.StoreFile.getReader(StoreFile.java:356)
      at org.apache.hadoop.hbase.regionserver.Store.getStorefilesIndexSize(Store.java:1378)
      at org.apache.hadoop.hbase.regionserver.HRegionServer.doMetrics(HRegionServer.java:1075)
      at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:454)
      at java.lang.Thread.run(Thread.java:619)

      1. sf.patch
        5 kB
        stack
      2. 1569-v2.patch
        11 kB
        stack

        Activity

        Hide
        ryan rawson added a comment -

        this is a race condition, here is how it happens:

        doMetrics() calls getStorefilesIndexSize() which gets a view of the storefiles ConcurrentSkipListMap at some point in time. Working on this snapshot it calls each store file in turn asking for the index size.

        In another thread, the compaction completion code finishes, first thing it does is:

        • remove store files from the storefiles list.
        • do some stuff
        • close the aforementioned store files, which causes the this.reader to become null.

        Back in thread #1, we run into the this.reader == null, and we throw the exception.

        So we need to do either of:

        • sync on this map, use a synced versin of the map
        • allow the ability to check this metrics without causing a RS abort when we hit an exception. Either catch it, or prevent it from happening.
        Show
        ryan rawson added a comment - this is a race condition, here is how it happens: doMetrics() calls getStorefilesIndexSize() which gets a view of the storefiles ConcurrentSkipListMap at some point in time. Working on this snapshot it calls each store file in turn asking for the index size. In another thread, the compaction completion code finishes, first thing it does is: remove store files from the storefiles list. do some stuff close the aforementioned store files, which causes the this.reader to become null. Back in thread #1, we run into the this.reader == null, and we throw the exception. So we need to do either of: sync on this map, use a synced versin of the map allow the ability to check this metrics without causing a RS abort when we hit an exception. Either catch it, or prevent it from happening.
        Hide
        stack added a comment -

        At first I thought that use of ConcurrentSkipListSet the problem but thinking on it more, rather, we need to make code tolerate fact that a file has been moved or removed. Alternative is syncing around file operations till they complete which is too much to ask.

        A good while a go, an issue in metrics got HRS stuck in an infinite loop.

        Let me try hack up a patch.

        Show
        stack added a comment - At first I thought that use of ConcurrentSkipListSet the problem but thinking on it more, rather, we need to make code tolerate fact that a file has been moved or removed. Alternative is syncing around file operations till they complete which is too much to ask. A good while a go, an issue in metrics got HRS stuck in an infinite loop. Let me try hack up a patch.
        Hide
        stack added a comment -

        Unfinished first attempt.

        Show
        stack added a comment - Unfinished first attempt.
        Hide
        stack added a comment -

        Patch does following:

        + Wraps metrics in a try/catch that catches any exception, logs it and then moves on rather than let it out and kill HRS
        + Changed getReader so it doesn't do IllegalArgumentException with "must open first" but instead lets out the null Reader.
        + Then, changed whereever we get a Reader to check for null. If null log it so we can see extent of phenomeon but in general just keep going.

        Show
        stack added a comment - Patch does following: + Wraps metrics in a try/catch that catches any exception, logs it and then moves on rather than let it out and kill HRS + Changed getReader so it doesn't do IllegalArgumentException with "must open first" but instead lets out the null Reader. + Then, changed whereever we get a Reader to check for null. If null log it so we can see extent of phenomeon but in general just keep going.
        Hide
        Lars George added a comment -

        I ran into the same issue, killed the HRS that hosted the ROOT partition. The "Catalog Tables" tbale in the UI was empty afterwards and the running MR job was failing fast. I synced to trunk and applied this patch and restarted the cluster. With the above patch two pieces are failing to update but they were the removed bloomfilter variables and it looks like it was removed in trunk already so no harm done.

        Will report if I run into the new log outputs or if anything else happens.

        Show
        Lars George added a comment - I ran into the same issue, killed the HRS that hosted the ROOT partition. The "Catalog Tables" tbale in the UI was empty afterwards and the running MR job was failing fast. I synced to trunk and applied this patch and restarted the cluster. With the above patch two pieces are failing to update but they were the removed bloomfilter variables and it looks like it was removed in trunk already so no harm done. Will report if I run into the new log outputs or if anything else happens.
        Hide
        ryan rawson added a comment -

        +1 lgtm

        Show
        ryan rawson added a comment - +1 lgtm
        Hide
        stack added a comment -

        Committed

        Show
        stack added a comment - Committed

          People

          • Assignee:
            stack
            Reporter:
            ryan rawson
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development