HBase
  1. HBase
  2. HBASE-10047

postScannerFilterRow consumes a lot of CPU in tall table scans

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Later
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Continuing my profiling quest, I find that in scanning tall table (and filtering everything on the server) a quarter of the time is now spent in the postScannerFilterRow coprocessor hook.

      1. postScannerFilterRow.png
        79 kB
        Lars Hofhansl
      2. 10047-0.94-sample-v2.txt
        32 kB
        Lars Hofhansl
      3. 10047-0.94-sample.txt
        2 kB
        Lars Hofhansl
      4. 10047-0.94-poc-v4.txt
        27 kB
        Lars Hofhansl
      5. 10047-0.94-poc-v3.txt
        3 kB
        Lars Hofhansl

        Activity

        Hide
        Lars Hofhansl added a comment -

        Can't measure any meaningful perf enhancement in a real world setup. I might to come back to this one later.
        Might need to look at calling nextRow in RegionScannerImpl anyway.

        Show
        Lars Hofhansl added a comment - Can't measure any meaningful perf enhancement in a real world setup. I might to come back to this one later. Might need to look at calling nextRow in RegionScannerImpl anyway.
        Hide
        Lars Hofhansl added a comment -

        Based on your hint Andrew Purtell. Simply removes the endpoints (everything that does not implement RegionObserver).
        Looked into refactoring loadSystemCoprocessors into RegionCoprocessorHost, but would have led to a lot of cut'n'paste.

        What do you think. Will also performance test this, but this is generally cleaner I think.

        Show
        Lars Hofhansl added a comment - Based on your hint Andrew Purtell . Simply removes the endpoints (everything that does not implement RegionObserver). Looked into refactoring loadSystemCoprocessors into RegionCoprocessorHost, but would have led to a lot of cut'n'paste. What do you think. Will also performance test this, but this is generally cleaner I think.
        Hide
        Lars Hofhansl added a comment -

        You're saying we catch ClassCastException and then remove the coprocessor?
        Looking at the code a bit more; in the RegionCoprocessorHost case we add Endpoint to the list of coprocessors, which does not seem to be necessary. Unfortunately only the tableCoprocessors are loaded in RegionCoprocessorHost, the systemCoprocessors are loaded in CoprocessorHost. Could refactor that, or simply just remove the Endpoint from the set of coprocessors, it seems they need (and should?) not to in there.

        Show
        Lars Hofhansl added a comment - You're saying we catch ClassCastException and then remove the coprocessor? Looking at the code a bit more; in the RegionCoprocessorHost case we add Endpoint to the list of coprocessors, which does not seem to be necessary. Unfortunately only the tableCoprocessors are loaded in RegionCoprocessorHost, the systemCoprocessors are loaded in CoprocessorHost. Could refactor that, or simply just remove the Endpoint from the set of coprocessors, it seems they need (and should?) not to in there.
        Hide
        Andrew Purtell added a comment -

        What if we only care about our (as framework) worst case here? I think that is we might encounter a cast exception and error out without limiting the damage to the bad actor and unloading it. So handle that. But, sure, this is getting pretty far from where this issue started.

        Show
        Andrew Purtell added a comment - What if we only care about our (as framework) worst case here? I think that is we might encounter a cast exception and error out without limiting the damage to the bad actor and unloading it. So handle that. But, sure, this is getting pretty far from where this issue started.
        Hide
        Lars Hofhansl added a comment -

        New patch. 100% untested. Checks:

        1. the interface was implemented directly
        2. the any class up the chain either has the hook directly, or BaseRegionCoprocessor is reached.

        This is getting a bit out of hand.
        Also not sure whether I can do the == comparison between the classes, they might come from the different class loaders
        (I think... It might also be the case that BaseRegionObserver is always loaded via the system class loader).
        Then it would need tests to make sure it does the right thing in all situations, etc, etc.

        I think I'm just going to close this.

        Show
        Lars Hofhansl added a comment - New patch. 100% untested. Checks: the interface was implemented directly the any class up the chain either has the hook directly, or BaseRegionCoprocessor is reached. This is getting a bit out of hand. Also not sure whether I can do the == comparison between the classes, they might come from the different class loaders (I think... It might also be the case that BaseRegionObserver is always loaded via the system class loader). Then it would need tests to make sure it does the right thing in all situations, etc, etc. I think I'm just going to close this.
        Hide
        Lars Hofhansl added a comment -

        Just occurred to me, even the first patch is not accurate if somebody has Coprocessor derived from a custom base class, where the base class implements the hook.

        Show
        Lars Hofhansl added a comment - Just occurred to me, even the first patch is not accurate if somebody has Coprocessor derived from a custom base class, where the base class implements the hook.
        Hide
        Lars Hofhansl added a comment -

        Interesting, didn't realize that can happen. After the region is loaded? Ohh, when we detect an error we remove the coprocessor.
        SortedCopyOnWriteSet should have been a hint too

        The first patch is still valid, since we're only removing after the region was loaded.
        I didn't measure any perf improvement with v2 anyway, it seems instanceof is not the issue.

        Show
        Lars Hofhansl added a comment - Interesting, didn't realize that can happen. After the region is loaded? Ohh, when we detect an error we remove the coprocessor. SortedCopyOnWriteSet should have been a hint too The first patch is still valid, since we're only removing after the region was loaded. I didn't measure any perf improvement with v2 anyway, it seems instanceof is not the issue.
        Hide
        Andrew Purtell added a comment -

        The set of installed coprocessors can change at runtime concurrent with iteration of the list.

        Show
        Andrew Purtell added a comment - The set of installed coprocessors can change at runtime concurrent with iteration of the list.
        Hide
        Lars Hofhansl added a comment -

        Something like this. Moves all instances of instanceof into the constructor, at runtime no reflection is necessary.

        Left whitespace untouched so that it is easier to see the diff.

        Show
        Lars Hofhansl added a comment - Something like this. Moves all instances of instanceof into the constructor, at runtime no reflection is necessary. Left whitespace untouched so that it is easier to see the diff.
        Hide
        Lars Hofhansl added a comment -

        Thanks Andrew Purtell. Oops, yes, missed the break there.
        I'll do a few more tests with more columns. If there still is measurable gain, we can consider this.

        Maybe even better - we test all coprocessors at load time and only keep implementors of RegionObserver. That we all casting at "runtime" could be removed.

        Show
        Lars Hofhansl added a comment - Thanks Andrew Purtell . Oops, yes, missed the break there. I'll do a few more tests with more columns. If there still is measurable gain, we can consider this. Maybe even better - we test all coprocessors at load time and only keep implementors of RegionObserver. That we all casting at "runtime" could be removed.
        Hide
        Andrew Purtell added a comment -

        I think this is a reasonable strategy, if the CP framework itself is getting in the way, skip the upcalls if there are no users. We can do it case by case.

        We only get the additional reflection overhead at load time. You can break out of this loop if you find one for a tiny bit of savings:

        +    boolean hasCustomPostScannerFilterRow = false;
        +    for (RegionEnvironment env: coprocessors) {
        +      if (env.getInstance() instanceof RegionObserver) {
        +        try {
        +          env.getInstance().getClass().getDeclaredMethod("postScannerFilterRow", ObserverContext.class,
        +              InternalScanner.class, byte[].class, boolean.class);
        +          hasCustomPostScannerFilterRow = true;
        +        } catch (NoSuchMethodException ignore) {
        +        }
        +      }
        +    }
        +    this.hasCustomPostScannerFilterRow = hasCustomPostScannerFilterRow;
        
        Show
        Andrew Purtell added a comment - I think this is a reasonable strategy, if the CP framework itself is getting in the way, skip the upcalls if there are no users. We can do it case by case. We only get the additional reflection overhead at load time. You can break out of this loop if you find one for a tiny bit of savings: + boolean hasCustomPostScannerFilterRow = false ; + for (RegionEnvironment env: coprocessors) { + if (env.getInstance() instanceof RegionObserver) { + try { + env.getInstance().getClass().getDeclaredMethod( "postScannerFilterRow" , ObserverContext.class, + InternalScanner.class, byte [].class, boolean .class); + hasCustomPostScannerFilterRow = true ; + } catch (NoSuchMethodException ignore) { + } + } + } + this .hasCustomPostScannerFilterRow = hasCustomPostScannerFilterRow;
        Hide
        Lars Hofhansl added a comment -

        Trying with Phoenix I do see a 5-10% improvement when doing a count(1) on a tall table (1 col).
        Scanning the same (1 col) table with the HBase client using a Filter to filter all data at the server I see 25% scan improvement with patch.

        The question is whether the extra complexity in the code is worth it. Andrew Purtell, any opinion?

        Show
        Lars Hofhansl added a comment - Trying with Phoenix I do see a 5-10% improvement when doing a count(1) on a tall table (1 col). Scanning the same (1 col) table with the HBase client using a Filter to filter all data at the server I see 25% scan improvement with patch. The question is whether the extra complexity in the code is worth it. Andrew Purtell , any opinion?
        Hide
        Lars Hofhansl added a comment -

        Sample patch. Checks whether any coprocessor actually has its own postScannerFilterRow method. If not the loop can just be short circuited.

        As I said above, not sure it's even worth it. Just parking the patch here for now.

        Show
        Lars Hofhansl added a comment - Sample patch. Checks whether any coprocessor actually has its own postScannerFilterRow method. If not the loop can just be short circuited. As I said above, not sure it's even worth it. Just parking the patch here for now.
        Hide
        Lars Hofhansl added a comment -

        I think the costly part is the instanceof. So if a coprocessor could make its type and the methods it overloads (as you said) explicit that could save some cycles.
        However, coprocessors are not on the superhot code paths. The one mentioned here being an exception as it could called a lot when we filter a lot of rows at the server. So I am not sure it is actually worth it.

        Show
        Lars Hofhansl added a comment - I think the costly part is the instanceof. So if a coprocessor could make its type and the methods it overloads (as you said) explicit that could save some cycles. However, coprocessors are not on the superhot code paths. The one mentioned here being an exception as it could called a lot when we filter a lot of rows at the server. So I am not sure it is actually worth it.
        Hide
        Nicolas Liochon added a comment -

        I was thinking about

        • allowing the coprocessor to explicitly states the functions it overloaded
        • and/or: by reflexivity, check if the method is implemented in the client coprocessor class or comes from the BaseRegionObserver.

        In both cases, we would construct a list of coprocessorWithPostScannerFilterRow when we load the coprocessor, and we would use this list in the host.

        Show
        Nicolas Liochon added a comment - I was thinking about allowing the coprocessor to explicitly states the functions it overloaded and/or: by reflexivity, check if the method is implemented in the client coprocessor class or comes from the BaseRegionObserver. In both cases, we would construct a list of coprocessorWithPostScannerFilterRow when we load the coprocessor, and we would use this list in the host.
        Hide
        Lars Hofhansl added a comment -

        For all but extreme scenarios the improvement from this is within the noise. I might just close this, especially because I do not see a good way to fix this without loss of functionality.

        Show
        Lars Hofhansl added a comment - For all but extreme scenarios the improvement from this is within the noise. I might just close this, especially because I do not see a good way to fix this without loss of functionality.
        Hide
        Lars Hofhansl added a comment -

        Fengdong Yu the simplest option is to use jvisualvm (ships with JDK), using the sampler.

        Show
        Lars Hofhansl added a comment - Fengdong Yu the simplest option is to use jvisualvm (ships with JDK), using the sampler.
        Hide
        Fengdong Yu added a comment -

        Lars Hofhansl
        Can you share which tool you used for your profiling? I am also has an REMOTE Application consumed a lot of CPU, I want to profile this App remotely.

        Show
        Fengdong Yu added a comment - Lars Hofhansl Can you share which tool you used for your profiling? I am also has an REMOTE Application consumed a lot of CPU, I want to profile this App remotely.
        Hide
        Lars Hofhansl added a comment -

        I would not see this if I did not actually have any coprocessors. (in this case this is the Phoenix coprocessor loaded)

        Show
        Lars Hofhansl added a comment - I would not see this if I did not actually have any coprocessors. (in this case this is the Phoenix coprocessor loaded)
        Hide
        Lars Hofhansl added a comment -

        Better image

        Show
        Lars Hofhansl added a comment - Better image
        Hide
        Lars Hofhansl added a comment -

        Imagine of the profiling session.
        postScannerFilterRow uses more CPU than the ScanQueryMatcher.

        This is from a sampler, so I do not suspect any profiling anomalies.
        Removing that part of the code as a test does indeed yield a 25% improvement in the scan time.

        Now, this is an extreme case: Tall table, one small column, 1 version. So maybe it's not worth spending a lot of time on this.

        Show
        Lars Hofhansl added a comment - Imagine of the profiling session. postScannerFilterRow uses more CPU than the ScanQueryMatcher. This is from a sampler, so I do not suspect any profiling anomalies. Removing that part of the code as a test does indeed yield a 25% improvement in the scan time. Now, this is an extreme case: Tall table, one small column, 1 version. So maybe it's not worth spending a lot of time on this.

          People

          • Assignee:
            Unassigned
            Reporter:
            Lars Hofhansl
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development