Accumulo
  1. Accumulo
  2. ACCUMULO-3067

scan performance degrades after compaction

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: tserver
    • Labels:
      None
    • Environment:

      Macbook Pro 2.6 GHz Intel Core i7, 16GB RAM, SSD, OSX 10.9.4, single tablet server process, single client process

      Description

      I've been running some scan performance tests on 1.6.0, and I'm running into an interesting situation in which query performance starts at a certain level and then degrades by ~15% after an event. The test follows roughly the following scenario:

      1. Single tabletserver instance
      2. Load 100M small (~10byte) key/values into a tablet and let it finish major compacting
      3. Disable the garbage collector (this makes the time to the event longer)
      4. Restart the tabletserver
      5. Repeatedly scan from the beginning to the end of the table in a loop
      6. Something happens on the tablet server, like one of {idle compaction of metadata table, forced flush of metadata table, forced compaction of metadata table, forced flush of trace table}
      7. Observe that scan rates dropped by 15-20%
      8. Observe that restarting the scan will not improve performance back to original level. Performance only gets better upon restarting the tablet server.

      I've been able to get this not to happen by removing iterators from the iterator tree. It doesn't seem to matter which iterators, but removing a certain number both improves performance (significantly) and eliminates the degradation problem. The default iterator tree includes:

      • SourceSwitchingIterator
        • VersioningIterator
          • SynchronizedIterator
            • VisibilityFilter
              • ColumnQualifierFilter
                • ColumnFamilySkippingIterator
                  • DeletingIterator
                    • StatsIterator
                      • MultiIterator
                        • MemoryIterator
                        • ProblemReportingIterator
                          • HeapIterator
                            • RFile.LocalityGroupReader

      We can eliminate the weird condition by narrowing the set of iterators to:

      • SourceSwitchingIterator
        • VisibilityFilter
          • ColumnFamilySkippingIterator
            • DeletingIterator
              • StatsIterator
                • MultiIterator
                  • MemoryIterator
                  • ProblemReportingIterator
                    • HeapIterator
                      • RFile.LocalityGroupReader

      There are other combinations that also perform much better than the default. I haven't been able to isolate this problem to a single iterator, despite removing each iterator one at a time.

      Anybody know what might be happening here? Best theory so far: the JVM learns that iterators can be used in a different way after a compaction, and some JVM optimization like JIT compilation, branch prediction, or automatic inlining stops happening.

      1. jit_log_during_compaction.txt
        5 kB
        Adam Fuchs
      2. accumulo_query_perf_test.tar.gz
        1 kB
        Adam Fuchs
      3. Screen Shot 2014-08-19 at 4.19.37 PM.png
        166 kB
        Adam Fuchs

        Activity

        Hide
        Mike Drob added a comment -

        JIT call optimization on the iterators shouldn't be happening, since you already have lots of targets for the methods. It typically happens when you only have single target, e.g. your code only uses one instance of a Runnable so the JIT knows that every call to run() will resolve to the same object. You have lots (more than 2) of {{SKVI}}s, so the call resolution isn't happening to begin with.

        I haven't looked at your code, but in your screen shot it looks like you have nearly filled DFS. That usually leads to bad things.

        Something happening on a different table indicates that maybe a cache is getting filled with data that is not yours and you're getting more cache misses than you were before. Try dropping OS buffers and see if that makes a difference?

        Show
        Mike Drob added a comment - JIT call optimization on the iterators shouldn't be happening, since you already have lots of targets for the methods. It typically happens when you only have single target, e.g. your code only uses one instance of a Runnable so the JIT knows that every call to run() will resolve to the same object. You have lots (more than 2) of {{SKVI}}s, so the call resolution isn't happening to begin with. I haven't looked at your code, but in your screen shot it looks like you have nearly filled DFS. That usually leads to bad things. Something happening on a different table indicates that maybe a cache is getting filled with data that is not yours and you're getting more cache misses than you were before. Try dropping OS buffers and see if that makes a difference?
        Hide
        Josh Elser added a comment -

        in your screen shot it looks like you have nearly filled DFS. That usually leads to bad things.

        Of the HDFS space being used, Accumulo is using 99.9% of it. This is different than HDFS being full.

        Show
        Josh Elser added a comment - in your screen shot it looks like you have nearly filled DFS. That usually leads to bad things. Of the HDFS space being used, Accumulo is using 99.9% of it. This is different than HDFS being full.
        Hide
        Adam Fuchs added a comment -

        The code is pretty simple: it writes a bunch of keys sequentially in a loader program, then a scan program scans in a loop. The only thing non-standard is that I use a very large batch size (60,000) to improve scan throughput and reduce the number of reseeks.

        Josh is correct that HDFS is nowhere near full (is it time to get rid of that incomprehensible stat on the monitor page yet?).

        Data cache hits are nearly zero before and after the event. The uncompressed data does not fit in the tserver's block cache, and I'm never revisiting a block before it gets ejected. The compressed size of the table is about 80MB (0.8B per K/V), so that fits cleaning into the OS's cache. There is no significant disk I/O before or after the event. Index caches are pegged at about 100% before and after the event.

        Show
        Adam Fuchs added a comment - The code is pretty simple: it writes a bunch of keys sequentially in a loader program, then a scan program scans in a loop. The only thing non-standard is that I use a very large batch size (60,000) to improve scan throughput and reduce the number of reseeks. Josh is correct that HDFS is nowhere near full (is it time to get rid of that incomprehensible stat on the monitor page yet?). Data cache hits are nearly zero before and after the event . The uncompressed data does not fit in the tserver's block cache, and I'm never revisiting a block before it gets ejected. The compressed size of the table is about 80MB (0.8B per K/V), so that fits cleaning into the OS's cache. There is no significant disk I/O before or after the event . Index caches are pegged at about 100% before and after the event .
        Hide
        Adam Fuchs added a comment -

        Another interesting tid bit: CPU utilization doesn't change before and after the event. The tserver runs at about 125% (more than 100% due to multithreaded pipelining) whether it's returning 875,000 e/s before the event or 750,000 e/s after the event. A thorough sampling of the tserver execution using jvisualvm doesn't show any one place where the extra CPU time is spent.

        Show
        Adam Fuchs added a comment - Another interesting tid bit: CPU utilization doesn't change before and after the event . The tserver runs at about 125% (more than 100% due to multithreaded pipelining) whether it's returning 875,000 e/s before the event or 750,000 e/s after the event . A thorough sampling of the tserver execution using jvisualvm doesn't show any one place where the extra CPU time is spent.
        Hide
        Adam Fuchs added a comment -

        Attached is a sample from tserver_localhost.out when -XX:+PrintCompilation is enabled. There are many compilation notifications early on, then a quiet period. At around 192 seconds into the execution of the tserver I kick off a compaction. This log includes all of the compilation events around that time. It shows several methods being deoptimized ("made not entrant") and then some being optimized again. See https://gist.github.com/chrisvest/2932907 for a pretty good explanation of how to read this log.

        Show
        Adam Fuchs added a comment - Attached is a sample from tserver_localhost.out when -XX:+PrintCompilation is enabled. There are many compilation notifications early on, then a quiet period. At around 192 seconds into the execution of the tserver I kick off a compaction. This log includes all of the compilation events around that time. It shows several methods being deoptimized ("made not entrant") and then some being optimized again. See https://gist.github.com/chrisvest/2932907 for a pretty good explanation of how to read this log.
        Hide
        Adam Fuchs added a comment -

        I may be narrowing in on an explanation to this one. When running the tserver with -XX:+PrintCompilation I can watch methods get optimized and deoptimized at runtime. The deoptimization/reoptimization is definitely happening during the event. I suspect what is happening is that when only scans are happening the runtime gathers stats about how the iterators are used and optimizes accordingly. When the methods are reoptimized during the compaction they get different usage stats, and are therefore optimized differently.

        Is this a plausible hypothesis? If so, what can/should be done about it? Anybody want to write a native system iterator stack? Should we come up with a way of tricking or seeding the compiler with better stats?

        Show
        Adam Fuchs added a comment - I may be narrowing in on an explanation to this one. When running the tserver with -XX:+PrintCompilation I can watch methods get optimized and deoptimized at runtime. The deoptimization/reoptimization is definitely happening during the event . I suspect what is happening is that when only scans are happening the runtime gathers stats about how the iterators are used and optimizes accordingly. When the methods are reoptimized during the compaction they get different usage stats, and are therefore optimized differently. Is this a plausible hypothesis? If so, what can/should be done about it? Anybody want to write a native system iterator stack? Should we come up with a way of tricking or seeding the compiler with better stats?
        Hide
        Keith Turner added a comment -

        I wonder if loading the iterators using different class loaders would make a difference. Using different class loaders yields different instances of the class. The question is, would these different class instances be optimized independently? Also are the iterators coming from the same class loader now? I think the system iterators would come from the same class loader.

        Show
        Keith Turner added a comment - I wonder if loading the iterators using different class loaders would make a difference. Using different class loaders yields different instances of the class. The question is, would these different class instances be optimized independently? Also are the iterators coming from the same class loader now? I think the system iterators would come from the same class loader.
        Hide
        Adam Fuchs added a comment -

        These methods are all coming from the primary application class loader right now. The stack for a scan of a default-configured table is mostly made up of a mix of system iterators and abstract iterators (like Filter, SkippingIterator, and WrappingIterator).

        If you were going to split up instances of the iterator by class loader, how would you decide which classloader to use? I would think that the problem here is not so much "scan" vs. "compaction", but more "long scan" vs. "short compaction". I guess we could use a different classloader per session, per table, per operation, or some combination of those. Anybody know of any precedents for this?

        Show
        Adam Fuchs added a comment - These methods are all coming from the primary application class loader right now. The stack for a scan of a default-configured table is mostly made up of a mix of system iterators and abstract iterators (like Filter, SkippingIterator, and WrappingIterator). If you were going to split up instances of the iterator by class loader, how would you decide which classloader to use? I would think that the problem here is not so much "scan" vs. "compaction", but more "long scan" vs. "short compaction". I guess we could use a different classloader per session, per table, per operation, or some combination of those. Anybody know of any precedents for this?
        Hide
        Josh Elser added a comment -

        Did a little bit of digging (by no means do I understand this fully ) and found http://blog.joda.org/2011/08/printcompilation-jvm-flag.html, specifically this quote from the post:

        Zombie methods are methods whose code has been made invalid by class loading. Generally the server compiler makes aggressive inlining decisions of non-final methods. As long as the inlined method is never overridden the code is correct. When a subclass is loaded and the method overridden, the compiled code is broken for all future calls to it. The code gets declared "not entrant" (no future callers to the broken code), but sometimes existing callers can keep using the code. In the case of inlining, that's not good enough; existing callers' stack frames are "deoptimized" when they return to the code from nested calls (or just if they are running in the code). When no more stack frames hold PC's into the broken code it's declared a "zombie" - ready for removal once the GC gets around to it.

        This makes it seem like we add in some extra iterators near the "top" of the stack which override some earlier classes which invalidates the ones we had previously compiled. Not sure if I'm entirely interpreting this correctly, but it seems sort of relevant to what you're seeing.

        Show
        Josh Elser added a comment - Did a little bit of digging (by no means do I understand this fully ) and found http://blog.joda.org/2011/08/printcompilation-jvm-flag.html , specifically this quote from the post: Zombie methods are methods whose code has been made invalid by class loading. Generally the server compiler makes aggressive inlining decisions of non-final methods. As long as the inlined method is never overridden the code is correct. When a subclass is loaded and the method overridden, the compiled code is broken for all future calls to it. The code gets declared "not entrant" (no future callers to the broken code), but sometimes existing callers can keep using the code. In the case of inlining, that's not good enough; existing callers' stack frames are "deoptimized" when they return to the code from nested calls (or just if they are running in the code). When no more stack frames hold PC's into the broken code it's declared a "zombie" - ready for removal once the GC gets around to it. This makes it seem like we add in some extra iterators near the "top" of the stack which override some earlier classes which invalidates the ones we had previously compiled. Not sure if I'm entirely interpreting this correctly, but it seems sort of relevant to what you're seeing.
        Hide
        Adam Fuchs added a comment -

        I saw that post, and I don't think they really got all the reasons for code becoming "not entrant". https://gist.github.com/chrisvest/2932907 contains the following:

          // What condition caused the deoptimization?
          enum DeoptReason {
            Reason_many = -1,             // indicates presence of several reasons
            Reason_none = 0,              // indicates absence of a relevant deopt.
            Reason_null_check,            // saw unexpected null or zero divisor (@bci)
            Reason_null_assert,           // saw unexpected non-null or non-zero (@bci)
            Reason_range_check,           // saw unexpected array index (@bci)
            Reason_class_check,           // saw unexpected object class (@bci)
            Reason_array_check,           // saw unexpected array class (aastore @bci)
            Reason_intrinsic,             // saw unexpected operand to intrinsic (@bci)
            Reason_unloaded,              // unloaded class or constant pool entry
            Reason_uninitialized,         // bad class state (uninitialized)
            Reason_unreached,             // code is not reached, compiler
            Reason_unhandled,             // arbitrary compiler limitation
            Reason_constraint,            // arbitrary runtime constraint violated
            Reason_div0_check,            // a null_check due to division by zero
            Reason_age,                   // nmethod too old; tier threshold reached
            Reason_LIMIT,
            // Note:  Keep this enum in sync. with _trap_reason_name.
            Reason_RECORDED_LIMIT = Reason_unloaded   // some are not recorded per bc
            // Note:  Reason_RECORDED_LIMIT should be < 8 to fit into 3 bits of
            // DataLayout::trap_bits.  This dependency is enforced indirectly
            // via asserts, to avoid excessive direct header-to-header dependencies.
            // See Deoptimization::trap_state_reason and class DataLayout.
          };
        

        This also might not be a complete list of reasons to deoptimize, but I think it might be more likely we're hitting some kind of a null check or range check invariant violation.

        Show
        Adam Fuchs added a comment - I saw that post, and I don't think they really got all the reasons for code becoming "not entrant". https://gist.github.com/chrisvest/2932907 contains the following: // What condition caused the deoptimization? enum DeoptReason { Reason_many = -1, // indicates presence of several reasons Reason_none = 0, // indicates absence of a relevant deopt. Reason_null_check, // saw unexpected null or zero divisor (@bci) Reason_null_assert, // saw unexpected non- null or non-zero (@bci) Reason_range_check, // saw unexpected array index (@bci) Reason_class_check, // saw unexpected object class (@bci) Reason_array_check, // saw unexpected array class (aastore @bci) Reason_intrinsic, // saw unexpected operand to intrinsic (@bci) Reason_unloaded, // unloaded class or constant pool entry Reason_uninitialized, // bad class state (uninitialized) Reason_unreached, // code is not reached, compiler Reason_unhandled, // arbitrary compiler limitation Reason_constraint, // arbitrary runtime constraint violated Reason_div0_check, // a null_check due to division by zero Reason_age, // nmethod too old; tier threshold reached Reason_LIMIT, // Note: Keep this enum in sync. with _trap_reason_name. Reason_RECORDED_LIMIT = Reason_unloaded // some are not recorded per bc // Note: Reason_RECORDED_LIMIT should be < 8 to fit into 3 bits of // DataLayout::trap_bits. This dependency is enforced indirectly // via asserts, to avoid excessive direct header-to-header dependencies. // See Deoptimization::trap_state_reason and class DataLayout. }; This also might not be a complete list of reasons to deoptimize, but I think it might be more likely we're hitting some kind of a null check or range check invariant violation.
        Hide
        Josh Elser added a comment -

        I think it might be more likely we're hitting some kind of a null check or range check invariant violation.

        So, how do we prove the reason? Is there a way that we can get even more info from HotSpot?

        Show
        Josh Elser added a comment - I think it might be more likely we're hitting some kind of a null check or range check invariant violation. So, how do we prove the reason? Is there a way that we can get even more info from HotSpot?

          People

          • Assignee:
            Unassigned
            Reporter:
            Adam Fuchs
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development