Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Fix Version/s: 2.2.0 beta 1
    • Component/s: Tools
    • Labels:

      Description

      I think it would be nice to log repair stats and results like query tracing stores traces to system keyspace. With it, you don't have to lookup each log file to see what was the status and how it performed the repair you invoked. Instead, you can query the repair log with session ID to see the state and stats of all nodes involved in that repair session.

      1. trunk@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt
        7 kB
        Ben Chan
      2. trunk@8ebeee1-5483-v01-002-simple-repair-tracing.txt
        8 kB
        Ben Chan
      3. test-5483-system_traces-events.txt
        8 kB
        Ben Chan
      4. trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch
        7 kB
        Ben Chan
      5. trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch
        8 kB
        Ben Chan
      6. v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch
        22 kB
        Ben Chan
      7. v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch
        26 kB
        Ben Chan
      8. v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch
        24 kB
        Ben Chan
      9. 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch
        21 kB
        Ben Chan
      10. 5483-v06-05-Add-a-command-column-to-system_traces.events.patch
        5 kB
        Ben Chan
      11. 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch
        2 kB
        Ben Chan
      12. ccm-repair-test
        3 kB
        Ben Chan
      13. 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch
        3 kB
        Ben Chan
      14. 5483-v07-08-Fix-brace-style.patch
        2 kB
        Ben Chan
      15. 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch
        12 kB
        Ben Chan
      16. 5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch
        4 kB
        Ben Chan
      17. 5483-full-trunk.txt
        55 kB
        Lyuben Todorov
      18. cqlsh-left-justify-text-columns.patch
        2 kB
        Ben Chan
      19. 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch
        10 kB
        Ben Chan
      20. 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch
        8 kB
        Ben Chan
      21. 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch
        9 kB
        Ben Chan
      22. 5483-v08-14-Poll-system_traces.events.patch
        12 kB
        Ben Chan
      23. 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch
        12 kB
        Ben Chan
      24. prerepair-vs-postbuggedrepair.diff
        4 kB
        Lyuben Todorov
      25. 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch
        1 kB
        Ben Chan
      26. 5483-v10-17-minor-bugfixes-and-changes.patch
        5 kB
        Ben Chan
      27. 5483-v10-rebased-and-squashed-471f5cc.patch
        80 kB
        Ben Chan
      28. 5483-v11-01-squashed.patch
        80 kB
        Ben Chan
      29. 5483-v11-squashed-nits.patch
        63 kB
        Lyuben Todorov
      30. 5483-v12-02-cassandra-yaml-ttl-doc.patch
        0.6 kB
        Ben Chan
      31. 5483-v13-608fb03-May-14-trace-formatting-changes.patch
        9 kB
        Ben Chan
      32. 5483-v14-01-squashed.patch
        84 kB
        Ben Chan
      33. 5483-v15.patch
        67 kB
        Benedict
      34. 5483-v15-05-Move-command-column-to-system_traces.sessions.patch
        6 kB
        Ben Chan
      35. 5483-v15-04-Re-add-old-StorageService-JMX-signatures.patch
        7 kB
        Ben Chan
      36. 5483-v15-03-Exact-doubling-for-exponential-backoff.patch
        1 kB
        Ben Chan
      37. 5483-v15-02-Hook-up-exponential-backoff-functionality.patch
        1 kB
        Ben Chan
      38. 5483-v17.patch
        55 kB
        Ben Chan
      39. 5483-v17-00.patch
        9 kB
        Ben Chan
      40. 5483-v17-01.patch
        2 kB
        Ben Chan

        Activity

        Hide
        Ben Chan added a comment -

        Thanks for helping get this through the final set of speed bumps. As fast as this codebase moves, I'm surprised that last rebase went so smoothly. (NRN)

        Show
        Ben Chan added a comment - Thanks for helping get this through the final set of speed bumps. As fast as this codebase moves, I'm surprised that last rebase went so smoothly. (NRN)
        Hide
        Joshua McKenzie added a comment -

        Committed. Thanks for all the hard work on this Ben!

        Show
        Joshua McKenzie added a comment - Committed. Thanks for all the hard work on this Ben!
        Hide
        Ben Chan added a comment -

        Sorry; limited computer access last week, zero access to dev box (network setup issues), turkey.


        Branch 5483_squashed (currently at commit 78686c61e38e, merges cleanly with trunk at 06f626acd27b) looks fine to me; looks like there were only mechanical changes needed to resolve the merge conflicts.

        Ran my standard simplistic test, and no obvious problems (for both 78686c61e38e and again when merged with 06f626acd27b).

        Any problem is likely to be something I did, or some unforseen interaction with code in the updated trunk. Hopefully neither of those possibilities is true. I'll keep watching this thread/issue just in case.

        Show
        Ben Chan added a comment - Sorry; limited computer access last week, zero access to dev box (network setup issues), turkey. Branch 5483_squashed (currently at commit 78686c61e38e, merges cleanly with trunk at 06f626acd27b) looks fine to me; looks like there were only mechanical changes needed to resolve the merge conflicts. Ran my standard simplistic test, and no obvious problems (for both 78686c61e38e and again when merged with 06f626acd27b). Any problem is likely to be something I did, or some unforseen interaction with code in the updated trunk. Hopefully neither of those possibilities is true. I'll keep watching this thread/issue just in case.
        Hide
        Joshua McKenzie added a comment -

        Sorry for the delay - was at a conference last week. I went ahead and rebased to current trunk as of today as there were some more changes around repair parallelism; your most recent changes look good and I'm not too worried about us having some public final strings in TraceKeyspace.

        If I could get a once-over reverse-review from you to make sure I didn't miss anything in that final rebase, I'll get this committed.

        Rebased and squashed branch available Here

        Show
        Joshua McKenzie added a comment - Sorry for the delay - was at a conference last week. I went ahead and rebased to current trunk as of today as there were some more changes around repair parallelism; your most recent changes look good and I'm not too worried about us having some public final strings in TraceKeyspace. If I could get a once-over reverse-review from you to make sure I didn't miss anything in that final rebase, I'll get this committed. Rebased and squashed branch available Here
        Hide
        Ben Chan added a comment -

        Updated https://github.com/usrbincc/cassandra/tree/5483-review (currently at commit 85d974753dc7f).

        So a few changes to tracing got merged in on Nov 12, making a cut-and-dried manual rebase impossible.

        • Use a default ttl on the system_traces.* CFs. Use that to clean up the code in Tracing, since you no longer have to specify ttls.
        • Move a lot of row-insertion code out of Tracing into a new TraceKeyspace class.
        • Since there is no more need to specify ttl, use CFRowAdder (which doesn't do ttl) for convenience.

        Since the repair tracing patch requires a user-configurable ttl (or at the very least, a different ttl for repair tracing and query tracing), I needed to re-add ttl specification. Since it wasn't much more code, I decided to only use explicitly-expiring cells if the ttl didn't match the default ttl.
        This appears to save two native ints per column from a quick skim of the source code. Not sure if that's really enough to care about (especially since repair tracing, which is likely to insert more rows, has a ttl different from the default), but it was a simple enough change.


        Merge conflicts are a real pain point with git. I attempted to make reviewing the merge conflict resolution changes easier by inserting an intermediate commit that includes the conflict markers from git, unmodified. Feel free to hide all the messiness with a git merge --squash.

        • Fixed a race condition with repair trace sessions sometimes not being properly stopped (see commit 801b6fbf56771).
        • Possible code no-no to note: I made two private fields (SESSIONS_TABLE and EVENTS_TABLE) in TraceKeyspace public. I only use EVENTS_TABLE, but made them both public for symmetry (good? bad? don't know). They're final fields, if that's any consolation.
        Show
        Ben Chan added a comment - Updated https://github.com/usrbincc/cassandra/tree/5483-review (currently at commit 85d974753dc7f). So a few changes to tracing got merged in on Nov 12, making a cut-and-dried manual rebase impossible. Use a default ttl on the system_traces.* CFs. Use that to clean up the code in Tracing, since you no longer have to specify ttls. Move a lot of row-insertion code out of Tracing into a new TraceKeyspace class. Since there is no more need to specify ttl, use CFRowAdder (which doesn't do ttl) for convenience. Since the repair tracing patch requires a user-configurable ttl (or at the very least, a different ttl for repair tracing and query tracing), I needed to re-add ttl specification. Since it wasn't much more code, I decided to only use explicitly-expiring cells if the ttl didn't match the default ttl. This appears to save two native ints per column from a quick skim of the source code. Not sure if that's really enough to care about (especially since repair tracing, which is likely to insert more rows, has a ttl different from the default), but it was a simple enough change. Merge conflicts are a real pain point with git. I attempted to make reviewing the merge conflict resolution changes easier by inserting an intermediate commit that includes the conflict markers from git, unmodified. Feel free to hide all the messiness with a git merge --squash . Fixed a race condition with repair trace sessions sometimes not being properly stopped (see commit 801b6fbf56771). Possible code no-no to note: I made two private fields (SESSIONS_TABLE and EVENTS_TABLE) in TraceKeyspace public. I only use EVENTS_TABLE, but made them both public for symmetry (good? bad? don't know). They're final fields, if that's any consolation.
        Hide
        Joshua McKenzie added a comment -

        Linked branch looks good, renames are solid, tests out fine. 300MB of repairs at RF=3 generated around 300k of event trace data so I'm comfortable with the current level; we can revisit paring that down in the future as appropriate.

        Patch doesn't apply/rebase to current trunk cleanly - if I could get you to do a quick rebase we'll be ready to wrap this up. Thanks for all the hard work on this!

        Show
        Joshua McKenzie added a comment - Linked branch looks good, renames are solid, tests out fine. 300MB of repairs at RF=3 generated around 300k of event trace data so I'm comfortable with the current level; we can revisit paring that down in the future as appropriate. Patch doesn't apply/rebase to current trunk cleanly - if I could get you to do a quick rebase we'll be ready to wrap this up. Thanks for all the hard work on this!
        Hide
        Ben Chan added a comment -

        Sorry; SMART and reallocated sectors. Which only really accounts for about 2-3 days of the delay. But just to say that it wasn't merely a case of hangover from too much Halloween candy.


        Updated https://github.com/usrbincc/cassandra/tree/5483-review (currently at commit 202a2e2e5e602).

        Merges cleanly with trunk (at commit d286ac7d072fe), building and testing cleanly.

        I decided to be a little opinionated and did some refactoring along the lines of my Oct 23 message.

        • Used a TraceState#waitActivity function instead of TraceState#isDone (waitActivity gets closer to doing only what it says on the tin; makes it less hairy to comment).
        • Moved all exponential backoff timeout code to StorageService#createQueryThread

        In addition, I renamed TraceState#enableNotifications to TraceState#enableActivityNotification to attempt (naming is hard) to avoid confusion with TraceState#setNotificationHandle, which is entirely unrelated.

        Note: beyond having made this opinionated edit, I'm not planning to be particularly opinionated about advocating for it. All of that code should eventually go away once there is some way to get notified about table updates instead of having to do all that messy polling.

        Extra note: Cassandra triggers seem to be very close to what is needed, if only they could be specified to run on a given node (i.e. the node that is being repaired). The last time I checked on this, this wasn't possible.


        Unfiltered traces:

        • The extra traces are generic message send-receive traces that existed prior to this patch. They were originally there for query tracing, which benefits from more detailed tracing.
        • These extra traces were filtered out for repair up until v16 of this patch. This means that any discussions of trace messages prior to that point are referring to the filtered traces.

        But I can't say that they're doing any real harm. I mean, it's only 3x the traces (estimated), and not an order of magnitude or more.

        It's probably fine as it is. I certainly can't unequivocally state that there's no use for those extra traces. Besides, extra information can always be filtered out at a higher level (assuming it's tagged appropriately).

        Show
        Ben Chan added a comment - Sorry; SMART and reallocated sectors. Which only really accounts for about 2-3 days of the delay. But just to say that it wasn't merely a case of hangover from too much Halloween candy. Updated https://github.com/usrbincc/cassandra/tree/5483-review (currently at commit 202a2e2e5e602). Merges cleanly with trunk (at commit d286ac7d072fe), building and testing cleanly. I decided to be a little opinionated and did some refactoring along the lines of my Oct 23 message. Used a TraceState#waitActivity function instead of TraceState#isDone (waitActivity gets closer to doing only what it says on the tin; makes it less hairy to comment). Moved all exponential backoff timeout code to StorageService#createQueryThread In addition, I renamed TraceState#enableNotifications to TraceState#enableActivityNotification to attempt (naming is hard) to avoid confusion with TraceState#setNotificationHandle, which is entirely unrelated. Note: beyond having made this opinionated edit, I'm not planning to be particularly opinionated about advocating for it. All of that code should eventually go away once there is some way to get notified about table updates instead of having to do all that messy polling. Extra note: Cassandra triggers seem to be very close to what is needed, if only they could be specified to run on a given node (i.e. the node that is being repaired). The last time I checked on this, this wasn't possible. Unfiltered traces: The extra traces are generic message send-receive traces that existed prior to this patch. They were originally there for query tracing, which benefits from more detailed tracing. These extra traces were filtered out for repair up until v16 of this patch. This means that any discussions of trace messages prior to that point are referring to the filtered traces. But I can't say that they're doing any real harm. I mean, it's only 3x the traces (estimated), and not an order of magnitude or more. It's probably fine as it is. I certainly can't unequivocally state that there's no use for those extra traces. Besides, extra information can always be filtered out at a higher level (assuming it's tagged appropriately).
        Hide
        Joshua McKenzie added a comment -

        Think that was just me not refreshing the tab I had open in Chrome w/this ticket in it - missed the update concerning local build problems.

        Re: isDone: miscommunication in ticket. Changing from bit-magic to named variables w/flow changed isDone into a self-commenting method and addressed my concerns with it. I appreciate the javadoc @ the top to give just general overview as well. The method is simple enough that switching to BlockingQueue isn't really necessary at this time imo.

        Re: unfiltered traces: are these extraneous messages or just more verbosity surrounding the activity of the repair? It does generate a lot of output but this is for tracing and I'd be comfortable with us filtering down as a future effort if we decide it's necessary.

        Last nit: isDone should be renamed since it does more than just check if things are done. We could either split out the isDone check from waitAndDouble() or just rename to waitWithDoubleTimeout() or something like that?

        Show
        Joshua McKenzie added a comment - Think that was just me not refreshing the tab I had open in Chrome w/this ticket in it - missed the update concerning local build problems. Re: isDone: miscommunication in ticket. Changing from bit-magic to named variables w/flow changed isDone into a self-commenting method and addressed my concerns with it. I appreciate the javadoc @ the top to give just general overview as well. The method is simple enough that switching to BlockingQueue isn't really necessary at this time imo. Re: unfiltered traces: are these extraneous messages or just more verbosity surrounding the activity of the repair? It does generate a lot of output but this is for tracing and I'd be comfortable with us filtering down as a future effort if we decide it's necessary. Last nit: isDone should be renamed since it does more than just check if things are done. We could either split out the isDone check from waitAndDouble() or just rename to waitWithDoubleTimeout() or something like that?
        Hide
        Ben Chan added a comment -

        Communication seems to be off by a phase. My Oct 23 message may not have made it across. Either that or I wasn't writing clearly enough (But in my defense, natural language has a fiddly REPL and almost-useless error messages. And there's no source code.).

        Summary:

        • Forget about the removal of the addCallback + sameThreadExecutor code before. It turns out I didn't need to do that. The problem (I think) was some corruption in my build.
        • Commenting isDone results in a convoluted comment. I was considering whether or not to do a refactor in order to make the comment require less complication.

        For now I've gone ahead and commented isDone as best I can and fixed the naming that resulted in wait(wait);, along with various other fixes.

        See https://github.com/usrbincc/cassandra/tree/5483-review (currently at commit 15d8ec0a9fbbf).


        Closing comments:

        • For future reference: If you don't care about handling multiple consumers (as the current isDone does not), a large chunk of isDone (and its supporting code) could easily be replaced with a BlockingQueue and its offer and poll methods.
        • Can I get a confirmation that the unfiltered repair traces (see my Oct 22 message) in the current patch is acceptable?
        Show
        Ben Chan added a comment - Communication seems to be off by a phase. My Oct 23 message may not have made it across. Either that or I wasn't writing clearly enough (But in my defense, natural language has a fiddly REPL and almost-useless error messages. And there's no source code.). Summary: Forget about the removal of the addCallback + sameThreadExecutor code before. It turns out I didn't need to do that. The problem (I think) was some corruption in my build. Commenting isDone results in a convoluted comment. I was considering whether or not to do a refactor in order to make the comment require less complication. For now I've gone ahead and commented isDone as best I can and fixed the naming that resulted in wait(wait); , along with various other fixes. See https://github.com/usrbincc/cassandra/tree/5483-review (currently at commit 15d8ec0a9fbbf). Closing comments: For future reference: If you don't care about handling multiple consumers (as the current isDone does not), a large chunk of isDone (and its supporting code) could easily be replaced with a BlockingQueue and its offer and poll methods. Can I get a confirmation that the unfiltered repair traces (see my Oct 22 message) in the current patch is acceptable?
        Hide
        Joshua McKenzie added a comment -

        DebuggableThreadPoolExecutor:

        Just to confirm (though I'm mostly sure this is what you mean), Do you mean that the function name should be changed to something like createCachedThreadpoolWithMaximumPoolSize?

        Yep - just denote that it's max pool size

        If an older server version ever initiates a trace, then traceType could be null. And if it is null, that means that it's a QUERY traceType (since that's the only kind of trace available in previous versions). If a newer server version ever sends a null traceType, that should probably be an error, though nothing like that is being done yet.

        Totally legit. I'm fine with that.

        I touched on this code change earlier in the paragraph that begins "Note that I switched some code".

        You did, and that's my bad. There's a lot of history on this ticket that I needed to re-read and that was one of those things. I believe the changed code will behave identically to the original implementation with the caveats you listed, and so long as we can hand off that swallowed exception to the JVMStabilityInspector and document that we intend to swallow if things explode, I have no complaints.

        Futures#addCallback uses Uninterruptibles#getUninterruptibly instead of a plain Future#get, the main difference being that it will keep retrying Future#get even in the face of InterruptedException.

        Not sure whether this would be a problem or not, but replicating previous behavior would be the safest route.

        In the interests of not gratuitously rearranging code, I'll merely see about adding better comments and making the code less magical.

        Less magical is almost always a good thing.

         wait(wait); 

        Reads oddly - might want to tweak naming there. Other than that, isDone on the review branch comments itself.

        nit: a more helpful description for trace_repair would be good. Currently reads "Use -tr to trace repair"

        Show
        Joshua McKenzie added a comment - DebuggableThreadPoolExecutor: Just to confirm (though I'm mostly sure this is what you mean), Do you mean that the function name should be changed to something like createCachedThreadpoolWithMaximumPoolSize? Yep - just denote that it's max pool size If an older server version ever initiates a trace, then traceType could be null. And if it is null, that means that it's a QUERY traceType (since that's the only kind of trace available in previous versions). If a newer server version ever sends a null traceType, that should probably be an error, though nothing like that is being done yet. Totally legit. I'm fine with that. I touched on this code change earlier in the paragraph that begins "Note that I switched some code". You did, and that's my bad. There's a lot of history on this ticket that I needed to re-read and that was one of those things. I believe the changed code will behave identically to the original implementation with the caveats you listed, and so long as we can hand off that swallowed exception to the JVMStabilityInspector and document that we intend to swallow if things explode, I have no complaints. Futures#addCallback uses Uninterruptibles#getUninterruptibly instead of a plain Future#get, the main difference being that it will keep retrying Future#get even in the face of InterruptedException. Not sure whether this would be a problem or not, but replicating previous behavior would be the safest route. In the interests of not gratuitously rearranging code, I'll merely see about adding better comments and making the code less magical. Less magical is almost always a good thing. wait(wait); Reads oddly - might want to tweak naming there. Other than that, isDone on the review branch comments itself. nit: a more helpful description for trace_repair would be good. Currently reads "Use -tr to trace repair"
        Hide
        Ben Chan added a comment -

        Review branch up at:

        https://github.com/usrbincc/cassandra/tree/5483-review


        StorageService#createRepairTask:

        So it turns out that my mistaken assumption was this: "Tracing.instance.stopSession() needs to run in the main repair thread." (This was not one of the assumptions that I listed explicitly above).

        Why did I believe this? Because repair kept on hanging at the end until I made this change as a sanity check.

        Why does it not hang currently? I don't know, but I suspect there was something odd in the build that got cleaned out when I did an "ant clean". Or when I manually cleaned out the build/lib/jars directory (there were multiple versions of the same library in there; "ant clean" was not taking care of this).

        So I put back the original addCallback code structure.


        Commenting isDone:

        I started trying to do this, but it quickly devolved into just badly parroting the code. It would simplify commenting if I separated this function into multiple parts, maybe something like this:

        // TraceState
        public enum Status { IDLE, ACTIVE, STOPPED; }
        private Status status;
        /* //// tentative doc comment ////
         * Wait (with timeout) until status is ACTIVE or STOPPED. Reading an ACTIVE
         * status resets it to IDLE. Due to the status reset logic, this function only
         * works properly with a single consumer.
         * @param timeout timeout in milliseconds
         * @return activity status
         */
        public synchronized Status getActivity(long timeout)
        {
            // Note: This is just a placeholder for the general gist.
            while (status == Status.IDLE)
                wait(timeout); // TODO: we need to update timeout if this actually loops.
            Status retStatus = status;
            status = status == Status.ACTIVE ? Status.IDLE : status;
            return retStatus;
        }
        
        // createQueryThread's returned Runnable#run
        while ((r = getActivity(timeout)) != TraceState.Status.STOPPED)
        {
           if (r == TraceState.Status.IDLE) // double timeout if second time through with this timeout
           { ... }
           else // reset timeout to minWait
           { ... }
           // ...
        }
        

        Note that this would end up being structured like a cleaner version of 5483-v14-01-squashed.patch, with further simplifications due to not trying to handle multiple consumers.

        Show
        Ben Chan added a comment - Review branch up at: https://github.com/usrbincc/cassandra/tree/5483-review StorageService#createRepairTask: So it turns out that my mistaken assumption was this: "Tracing.instance.stopSession() needs to run in the main repair thread." (This was not one of the assumptions that I listed explicitly above). Why did I believe this? Because repair kept on hanging at the end until I made this change as a sanity check. Why does it not hang currently? I don't know, but I suspect there was something odd in the build that got cleaned out when I did an "ant clean". Or when I manually cleaned out the build/lib/jars directory (there were multiple versions of the same library in there; "ant clean" was not taking care of this). So I put back the original addCallback code structure. Commenting isDone: I started trying to do this, but it quickly devolved into just badly parroting the code. It would simplify commenting if I separated this function into multiple parts, maybe something like this: // TraceState public enum Status { IDLE, ACTIVE, STOPPED; } private Status status; /* //// tentative doc comment //// * Wait (with timeout) until status is ACTIVE or STOPPED. Reading an ACTIVE * status resets it to IDLE. Due to the status reset logic, this function only * works properly with a single consumer. * @param timeout timeout in milliseconds * @ return activity status */ public synchronized Status getActivity( long timeout) { // Note: This is just a placeholder for the general gist. while (status == Status.IDLE) wait(timeout); // TODO: we need to update timeout if this actually loops. Status retStatus = status; status = status == Status.ACTIVE ? Status.IDLE : status; return retStatus; } // createQueryThread's returned Runnable #run while ((r = getActivity(timeout)) != TraceState.Status.STOPPED) { if (r == TraceState.Status.IDLE) // double timeout if second time through with this timeout { ... } else // reset timeout to minWait { ... } // ... } Note that this would end up being structured like a cleaner version of 5483-v14-01-squashed.patch , with further simplifications due to not trying to handle multiple consumers.
        Hide
        Ben Chan added a comment -

        I'm going to dissapoint you and say that a lot of the reasons for how things are is historical, or due to code transformations intended to precisely maintain some aspect of the old behavior, without trying to question the logic behind that particular behavior.

        In other words, I've just been trying to add this one particular behavior (repair tracing) without otherwise changing program behavior. This is probably most obvious around the trace calls, where (hopefully) the log messages are unchanged, even when they're different from the trace messages. I really need to go back through that with a fine-tooth comb, because I think there were some places where I may have changed capitalization. Or worse.


        Most of your comments were straightforward. Just need some clarification on these:

        DebuggableThreadPoolExecutor:

        Do we intend to create w/Integer.MAX_VALUE as 'size' on this method?

        Yes, in order to preserve the behavior of the previous code (see my 2014-03-08 comment and the discussion surrounding it).

        If so, name should reflect it like other methods in file.

        Just to confirm (though I'm mostly sure this is what you mean), Do you mean that the function name should be changed to something like createCachedThreadpoolWithMaximumPoolSize?

        OutboundTcpConnection:

        Why do we assume that if TraceType is null the type is QUERY?

        If an older server version ever initiates a trace, then traceType could be null. And if it is null, that means that it's a QUERY traceType (since that's the only kind of trace available in previous versions). If a newer server version ever sends a null traceType, that should probably be an error, though nothing like that is being done yet.

        StorageService:

        Swallowing Throwable on line 2828 w/out any logging or comment as to why is unclear. Using exception handling as control flow is step back from the sameThreadExecutor error handling approach taken prior as we lose naming and intention information. Does this gain us something that I'm not aware of?

        Before saying anything else: yes, I probably should have commented the exception-swallowing.

        I touched on this code change earlier in the paragraph that begins "Note that I switched some code". In a nutshell: preserving behavior of previous code.

        Whether I succeeded or not is a different issue, but at least let me run through my thought process so that you can help me find the holes in it. Here is a redacted version of the code:

        // Previous version:
        Futures.addCallback(allSessions, new FutureCallback<Object>()
        {
            public void onSuccess(@Nullable Object result)
            {
                // onSuccess stuff ...
                repairComplete();
            }
        
            public void onFailure(Throwable t)
            {
                // onFailure stuff (empty, except for repairComplete) ...
                repairComplete();
            }
        
            private void repairComplete()
            {
                // repairComplete stuff ...
            }
        }, MoreExecutors.sameThreadExecutor());
        

        Here are my assumptions:

        • // onSuccess stuff ... does not throw any exceptions out of its code block.
        • // repairComplete stuff ... works the same no matter which thread it runs in (assuming it's called at the proper time, of course).
        • The guava library used corresponds to the source code found at build/lib/sources/guava-16.0.1-sources.jar.
        // Review version:
        try
        {
            allSessions.get();
            // onSuccess stuff ...
        }
        catch (Throwable t)
        {
            // onFailure stuff (empty, except for repairComplete) ...
        }
        // repairComplete stuff ...
        

        Since allSessions.get() is the only thing that can throw to the catch block, then its success or failure should (almost, see below) be the only way to trigger the onSuccess and onFailure "branches" that correspond to the original code.

        Intended difference:

        • // repairComplete stuff ... should now run in the main repair thread, instead of in the thread of the last exiting subtask in allSessions.

        Unintended differences:

        • Futures#addCallback uses Uninterruptibles#getUninterruptibly instead of a plain Future#get, the main difference being that it will keep retrying Future#get even in the face of InterruptedException. This may be important, so I could do this also.

        TraceState

        Right now isDone is only called by the query thread but there's nothing in the method signature or comments of that method documenting the doubling-every-second-run algorithm it's using.

        The doubling behavior used to be contained in the query thread; "isDone" used to be a (more or less) simple wait with timeout.

        Agree on needing comments. In the interests of not gratuitously rearranging code, I'll merely see about adding better comments and making the code less magical.


        I can't get to any code-editing at the moment, but probably sometime tomorrow.

        The patch in its current form should still be fine if you're reviewing the functionality, though.

        Show
        Ben Chan added a comment - I'm going to dissapoint you and say that a lot of the reasons for how things are is historical, or due to code transformations intended to precisely maintain some aspect of the old behavior, without trying to question the logic behind that particular behavior. In other words, I've just been trying to add this one particular behavior (repair tracing) without otherwise changing program behavior. This is probably most obvious around the trace calls, where (hopefully) the log messages are unchanged, even when they're different from the trace messages. I really need to go back through that with a fine-tooth comb, because I think there were some places where I may have changed capitalization. Or worse. Most of your comments were straightforward. Just need some clarification on these: DebuggableThreadPoolExecutor: Do we intend to create w/Integer.MAX_VALUE as 'size' on this method? Yes, in order to preserve the behavior of the previous code (see my 2014-03-08 comment and the discussion surrounding it). If so, name should reflect it like other methods in file. Just to confirm (though I'm mostly sure this is what you mean), Do you mean that the function name should be changed to something like createCachedThreadpoolWithMaximumPoolSize ? OutboundTcpConnection: Why do we assume that if TraceType is null the type is QUERY? If an older server version ever initiates a trace, then traceType could be null. And if it is null, that means that it's a QUERY traceType (since that's the only kind of trace available in previous versions). If a newer server version ever sends a null traceType, that should probably be an error, though nothing like that is being done yet. StorageService: Swallowing Throwable on line 2828 w/out any logging or comment as to why is unclear. Using exception handling as control flow is step back from the sameThreadExecutor error handling approach taken prior as we lose naming and intention information. Does this gain us something that I'm not aware of? Before saying anything else: yes, I probably should have commented the exception-swallowing. I touched on this code change earlier in the paragraph that begins "Note that I switched some code". In a nutshell: preserving behavior of previous code. Whether I succeeded or not is a different issue, but at least let me run through my thought process so that you can help me find the holes in it. Here is a redacted version of the code: // Previous version: Futures.addCallback(allSessions, new FutureCallback<Object>() { public void onSuccess(@Nullable Object result) { // onSuccess stuff ... repairComplete(); } public void onFailure(Throwable t) { // onFailure stuff (empty, except for repairComplete) ... repairComplete(); } private void repairComplete() { // repairComplete stuff ... } }, MoreExecutors.sameThreadExecutor()); Here are my assumptions: // onSuccess stuff ... does not throw any exceptions out of its code block. // repairComplete stuff ... works the same no matter which thread it runs in (assuming it's called at the proper time, of course). The guava library used corresponds to the source code found at build/lib/sources/guava-16.0.1-sources.jar . // Review version: try { allSessions.get(); // onSuccess stuff ... } catch (Throwable t) { // onFailure stuff (empty, except for repairComplete) ... } // repairComplete stuff ... Since allSessions.get() is the only thing that can throw to the catch block, then its success or failure should (almost, see below) be the only way to trigger the onSuccess and onFailure "branches" that correspond to the original code. Intended difference: // repairComplete stuff ... should now run in the main repair thread, instead of in the thread of the last exiting subtask in allSessions. Unintended differences: Futures#addCallback uses Uninterruptibles#getUninterruptibly instead of a plain Future#get, the main difference being that it will keep retrying Future#get even in the face of InterruptedException. This may be important, so I could do this also. TraceState Right now isDone is only called by the query thread but there's nothing in the method signature or comments of that method documenting the doubling-every-second-run algorithm it's using. The doubling behavior used to be contained in the query thread; "isDone" used to be a (more or less) simple wait with timeout. Agree on needing comments. In the interests of not gratuitously rearranging code, I'll merely see about adding better comments and making the code less magical. I can't get to any code-editing at the moment, but probably sometime tomorrow. The patch in its current form should still be fine if you're reviewing the functionality, though.
        Hide
        Joshua McKenzie added a comment -

        Initial observations / thoughts:

        90 days on default repair trace TTL seems long to me, especially w/incremental being new default.

        DebuggableThreadPoolExecutor:

        • 'createWithCachedTheadpool' is misspelled -> needs an r in Thread
        • Do we intend to create w/Integer.MAX_VALUE as 'size' on this method? If so, name should reflect it like other methods in file.

        MessageOut:

        • traceParameters isn't used elsewhere so logic can be nested inside constructor where it's called and remove unnecessary method call indirection.

        OutboundTcpConnection:

        • Why do we assume that if TraceType is null the type is QUERY? We should probably be explicitly setting that for all tracing types going forward if we plan on expanding the use of the enum

        StorageService:

        • nit: declaring queryThread in createRepairTask but aren't doing anything with it -> we should just call the method w/out declaration if we don't intend to use it
        • usage of bitwise ops to toggle si for indexing into seen array is unnecessarily complex. A distinct 0/1 and toggle is more clear - there's no need for inline index flipping w/bitwise ops as it's not gaining us any performance or code clarity.
        • nit: If not intending to use Object result from allSessions.get(), just remove the declaration and call the method
        • Swallowing Throwable on line 2828 w/out any logging or comment as to why is unclear. Using exception handling as control flow is step back from the sameThreadExecutor error handling approach taken prior as we lose naming and intention information. Does this gain us something that I'm not aware of?

        TraceState

        • Right now isDone is only called by the query thread but there's nothing in the method signature or comments of that method documenting the doubling-every-second-run algorithm it's using.
        • the implementation of isDone looks like a premature optimization. I would prefer a more clear implementation of 'double every second run' than:
                      else if (haveWaited)
                      {
                          wait = Math.min((wait ^ 1) << (wait & 1), maxWaitMillis);
                          return false;
                      }
                      else
                      {
                          haveWaited = true;
                          try
                          {
                              wait(wait & ~1);
                          }
          

        Now that I'm more familiar with the changes here I'm going to need to re-read the notes on this ticket and test out / inspect output. The above is just from my first read-through of the code yesterday.

        Show
        Joshua McKenzie added a comment - Initial observations / thoughts: 90 days on default repair trace TTL seems long to me, especially w/incremental being new default. DebuggableThreadPoolExecutor: 'createWithCachedTheadpool' is misspelled -> needs an r in Thread Do we intend to create w/Integer.MAX_VALUE as 'size' on this method? If so, name should reflect it like other methods in file. MessageOut: traceParameters isn't used elsewhere so logic can be nested inside constructor where it's called and remove unnecessary method call indirection. OutboundTcpConnection: Why do we assume that if TraceType is null the type is QUERY? We should probably be explicitly setting that for all tracing types going forward if we plan on expanding the use of the enum StorageService: nit: declaring queryThread in createRepairTask but aren't doing anything with it -> we should just call the method w/out declaration if we don't intend to use it usage of bitwise ops to toggle si for indexing into seen array is unnecessarily complex. A distinct 0/1 and toggle is more clear - there's no need for inline index flipping w/bitwise ops as it's not gaining us any performance or code clarity. nit: If not intending to use Object result from allSessions.get(), just remove the declaration and call the method Swallowing Throwable on line 2828 w/out any logging or comment as to why is unclear. Using exception handling as control flow is step back from the sameThreadExecutor error handling approach taken prior as we lose naming and intention information. Does this gain us something that I'm not aware of? TraceState Right now isDone is only called by the query thread but there's nothing in the method signature or comments of that method documenting the doubling-every-second-run algorithm it's using. the implementation of isDone looks like a premature optimization. I would prefer a more clear implementation of 'double every second run' than: else if (haveWaited) { wait = Math .min((wait ^ 1) << (wait & 1), maxWaitMillis); return false ; } else { haveWaited = true ; try { wait(wait & ~1); } Now that I'm more familiar with the changes here I'm going to need to re-read the notes on this ticket and test out / inspect output. The above is just from my first read-through of the code yesterday.
        Hide
        Ben Chan added a comment -

        Hm. Does that mean you've already got some feedback ready?

        I'll be here on and off for a little while. Maybe we can get in one full iteration (or n?) today.

        Show
        Ben Chan added a comment - Hm. Does that mean you've already got some feedback ready? I'll be here on and off for a little while. Maybe we can get in one full iteration (or n?) today.
        Hide
        Joshua McKenzie added a comment -

        I actually started looking at this yesterday, hence my question.

        Show
        Joshua McKenzie added a comment - I actually started looking at this yesterday, hence my question.
        Hide
        Ben Chan added a comment - - edited

        I'm not planning on making any more changes pre-review. Feel free to go ahead.

        (Though it would have been fine even if you had started prior to 5483-v17-00 and 5483-v17-01; the differences are not substantial. Edit: even if they had been substantial, you were well within your rights to start reviewing as soon as I posted up the initial patch.)

        Show
        Ben Chan added a comment - - edited I'm not planning on making any more changes pre-review. Feel free to go ahead. (Though it would have been fine even if you had started prior to 5483-v17-00 and 5483-v17-01; the differences are not substantial. Edit: even if they had been substantial, you were well within your rights to start reviewing as soon as I posted up the initial patch.)
        Hide
        Joshua McKenzie added a comment -

        I've put together a feature branch w/v17 and the 2 patches here. Are we stable and ready for review at this point Ben Chan?

        Also, for subsequent revisions please clone this feature branch, make modifications, and upload to github and link the branch here in the comments. This ticket is already far too cluttered w/attachments and it'll help keep things more clear.

        Show
        Joshua McKenzie added a comment - I've put together a feature branch w/v17 and the 2 patches here . Are we stable and ready for review at this point Ben Chan ? Also, for subsequent revisions please clone this feature branch, make modifications, and upload to github and link the branch here in the comments. This ticket is already far too cluttered w/attachments and it'll help keep things more clear.
        Hide
        Ben Chan added a comment -

        Thanks; "node repair -full" (in conjunction with "ccm clear") does indeed force streaming repair.

        5483-v17-00.patch fixes a few trace calls that got lost/misplaced in the rebase.
        5483-v17-01.patch reorganizes some code; functionally identical.

        # This bash code should work out-of-the-box with a stock setup.
        JMXGET='/jmx_port/{p=$2;} /binary/{split($2,a,/\047/);h=a[2];}
          END{printf("bin/nodetool -h %s -p %s\n",h,p,cmd);}'
        ccm_nodetool() { local N=$1; shift; $(ccm $N show | awk -F= "$JMXGET") "$@"; }
        dl_apply_maybe() { for url; do { [ -e $(basename $url) ] || curl -sO $url; } &&
          ! [ ${url%.patch} = $url ] && git apply $(basename $url); done; }
        
        NEW_BRANCH=$(date +5483-17--%Y%m%d-%H%M%S)
        W=https://issues.apache.org/jira/secure/attachment
        
        git checkout -b $NEW_BRANCH 49833b9 &&
        dl_apply_maybe \
          $W/12633156/ccm-repair-test \
          $W/12675963/5483-v17.patch \
          $W/12675963/5483-v17-00.patch \
          $W/12676340/5483-v17-01.patch &&
        ant clean && ant &&
        chmod +x ./ccm-repair-test && ./ccm-repair-test -kR &&
        ccm node1 stop && ccm node1 clear && ccm node1 start &&
        ccm_nodetool node1 repair -tr -full &&
        ccm node1 showlog | grep "Performing streaming repair"
        

        Note that I switched some code to a different thread in order to facilitate trace handling – see the diff hunk near the end of StorageService#createRepairTask. I needed certain calls to happen in the parent repair thread, and this seemed to be the simplest way.

        As far as I can tell, there shouldn't be any differences in functionality or concurrency level (i.e. number of tasks that are executing concurrently), but someone should examine that section just to make sure.


        The issue with unfiltered traces (see my previous message) still remains, but if push comes to shove, you can consider this as just a special case of having a trace call where it's not needed, or missing where it is needed.

        In other words, a cosmetic problem. Fixing this should not require any changes or additions to the over-wire protocol or the JMX interface.

        Show
        Ben Chan added a comment - Thanks; "node repair -full" (in conjunction with "ccm clear") does indeed force streaming repair. 5483-v17-00.patch fixes a few trace calls that got lost/misplaced in the rebase. 5483-v17-01.patch reorganizes some code; functionally identical. # This bash code should work out-of-the-box with a stock setup. JMXGET='/jmx_port/{p=$2;} /binary/{split($2,a,/\047/);h=a[2];} END{printf("bin/nodetool -h %s -p %s\n",h,p,cmd);}' ccm_nodetool() { local N=$1; shift; $(ccm $N show | awk -F= "$JMXGET") "$@"; } dl_apply_maybe() { for url; do { [ -e $(basename $url) ] || curl -sO $url; } && ! [ ${url%.patch} = $url ] && git apply $(basename $url); done; } NEW_BRANCH=$(date +5483-17--%Y%m%d-%H%M%S) W=https://issues.apache.org/jira/secure/attachment git checkout -b $NEW_BRANCH 49833b9 && dl_apply_maybe \ $W/12633156/ccm-repair-test \ $W/12675963/5483-v17.patch \ $W/12675963/5483-v17-00.patch \ $W/12676340/5483-v17-01.patch && ant clean && ant && chmod +x ./ccm-repair-test && ./ccm-repair-test -kR && ccm node1 stop && ccm node1 clear && ccm node1 start && ccm_nodetool node1 repair -tr -full && ccm node1 showlog | grep "Performing streaming repair" Note that I switched some code to a different thread in order to facilitate trace handling – see the diff hunk near the end of StorageService#createRepairTask. I needed certain calls to happen in the parent repair thread, and this seemed to be the simplest way. As far as I can tell, there shouldn't be any differences in functionality or concurrency level (i.e. number of tasks that are executing concurrently), but someone should examine that section just to make sure. The issue with unfiltered traces (see my previous message) still remains, but if push comes to shove, you can consider this as just a special case of having a trace call where it's not needed, or missing where it is needed. In other words, a cosmetic problem. Fixing this should not require any changes or additions to the over-wire protocol or the JMX interface.
        Hide
        Marcus Eriksson added a comment -

        you probably need to do "nodetool repair -full", incremental repairs are default in trunk

        Show
        Marcus Eriksson added a comment - you probably need to do "nodetool repair -full", incremental repairs are default in trunk
        Hide
        Philip Thompson added a comment -

        No, I know of no changes to that command. I was not even aware of its existence.

        Show
        Philip Thompson added a comment - No, I know of no changes to that command. I was not even aware of its existence.
        Hide
        Michael Shuler added a comment -

        My old method for forcing a streaming repair (yukim's "ccm clear" suggestion) no longer seems to work.

        I peeked at the ccm git repo and I don't see any recent changes around "clear". I did not go through your whole process, but a quick test of clear works out for me with the latest ccm HEAD:

        mshuler@hana:~$ ccm create test -v git:cassandra-2.1.0 -n1 -s
        Fetching Cassandra updates...
        Current cluster is now: test
        mshuler@hana:~$ ls -la ~/.ccm/test/node1/{commitlogs,data/system}/
        /home/mshuler/.ccm/test/node1/commitlogs/:
        total 108
        drwxr-xr-x 2 mshuler mshuler     4096 Oct 20 20:34 .
        drwxr-xr-x 8 mshuler mshuler     4096 Oct 20 20:34 ..
        -rw-r--r-- 1 mshuler mshuler 33554432 Oct 20 20:35 CommitLog-4-1413855299422.log
        -rw-r--r-- 1 mshuler mshuler 33554432 Oct 20 20:34 CommitLog-4-1413855299423.log
        
        /home/mshuler/.ccm/test/node1/data/system/:
        total 72
        drwxr-xr-x 18 mshuler mshuler 4096 Oct 20 20:34 .
        drwxr-xr-x  4 mshuler mshuler 4096 Oct 20 20:35 ..
        drwxr-xr-x  2 mshuler mshuler 4096 Oct 20 20:34 IndexInfo-9f5c6374d48532299a0a5094af9ad1e3
        drwxr-xr-x  2 mshuler mshuler 4096 Oct 20 20:34 batchlog-0290003c977e397cac3efdfdc01d626b
        drwxr-xr-x  2 mshuler mshuler 4096 Oct 20 20:34 compaction_history-b4dbb7b4dc493fb5b3bfce6e434832ca
        drwxr-xr-x  3 mshuler mshuler 4096 Oct 20 20:34 compactions_in_progress-55080ab05d9c388690a4acb25fe1f77b
        drwxr-xr-x  2 mshuler mshuler 4096 Oct 20 20:34 hints-2666e20573ef38b390fefecf96e8f0c7
        drwxr-xr-x  2 mshuler mshuler 4096 Oct 20 20:35 local-7ad54392bcdd35a684174e047860b377
        drwxr-xr-x  2 mshuler mshuler 4096 Oct 20 20:34 paxos-b7b7f0c2fd0a34108c053ef614bb7c2d
        drwxr-xr-x  2 mshuler mshuler 4096 Oct 20 20:34 peer_events-59dfeaea8db2334191ef109974d81484
        drwxr-xr-x  2 mshuler mshuler 4096 Oct 20 20:34 peers-37f71aca7dc2383ba70672528af04d4f
        drwxr-xr-x  2 mshuler mshuler 4096 Oct 20 20:34 range_xfers-55d764384e553f8b9f6e676d4af3976d
        drwxr-xr-x  2 mshuler mshuler 4096 Oct 20 20:35 schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697
        drwxr-xr-x  2 mshuler mshuler 4096 Oct 20 20:35 schema_columns-296e9c049bec3085827dc17d3df2122a
        drwxr-xr-x  2 mshuler mshuler 4096 Oct 20 20:35 schema_keyspaces-b0f2235744583cdb9631c43e59ce3676
        drwxr-xr-x  2 mshuler mshuler 4096 Oct 20 20:34 schema_triggers-0359bc7171233ee19a4ab9dfb11fc125
        drwxr-xr-x  2 mshuler mshuler 4096 Oct 20 20:34 schema_usertypes-3aa752254f82350b8d5c430fa221fa0a
        drwxr-xr-x  2 mshuler mshuler 4096 Oct 20 20:34 sstable_activity-5a1ff267ace03f128563cfae6103c65e
        mshuler@hana:~$ ccm node1 clear
        mshuler@hana:~$ ls -la ~/.ccm/test/node1/{commitlogs,data/system}/
        ls: cannot access /home/mshuler/.ccm/test/node1/data/system/: No such file or directory
        /home/mshuler/.ccm/test/node1/commitlogs/:
        total 8
        drwxr-xr-x 2 mshuler mshuler 4096 Oct 20 20:35 .
        drwxr-xr-x 8 mshuler mshuler 4096 Oct 20 20:35 ..
        

        Philip Thompson do you know of any recent changes around "clear"?

        Show
        Michael Shuler added a comment - My old method for forcing a streaming repair (yukim's "ccm clear" suggestion) no longer seems to work. I peeked at the ccm git repo and I don't see any recent changes around "clear". I did not go through your whole process, but a quick test of clear works out for me with the latest ccm HEAD: mshuler@hana:~$ ccm create test -v git:cassandra-2.1.0 -n1 -s Fetching Cassandra updates... Current cluster is now: test mshuler@hana:~$ ls -la ~/.ccm/test/node1/{commitlogs,data/system}/ /home/mshuler/.ccm/test/node1/commitlogs/: total 108 drwxr-xr-x 2 mshuler mshuler 4096 Oct 20 20:34 . drwxr-xr-x 8 mshuler mshuler 4096 Oct 20 20:34 .. -rw-r--r-- 1 mshuler mshuler 33554432 Oct 20 20:35 CommitLog-4-1413855299422.log -rw-r--r-- 1 mshuler mshuler 33554432 Oct 20 20:34 CommitLog-4-1413855299423.log /home/mshuler/.ccm/test/node1/data/system/: total 72 drwxr-xr-x 18 mshuler mshuler 4096 Oct 20 20:34 . drwxr-xr-x 4 mshuler mshuler 4096 Oct 20 20:35 .. drwxr-xr-x 2 mshuler mshuler 4096 Oct 20 20:34 IndexInfo-9f5c6374d48532299a0a5094af9ad1e3 drwxr-xr-x 2 mshuler mshuler 4096 Oct 20 20:34 batchlog-0290003c977e397cac3efdfdc01d626b drwxr-xr-x 2 mshuler mshuler 4096 Oct 20 20:34 compaction_history-b4dbb7b4dc493fb5b3bfce6e434832ca drwxr-xr-x 3 mshuler mshuler 4096 Oct 20 20:34 compactions_in_progress-55080ab05d9c388690a4acb25fe1f77b drwxr-xr-x 2 mshuler mshuler 4096 Oct 20 20:34 hints-2666e20573ef38b390fefecf96e8f0c7 drwxr-xr-x 2 mshuler mshuler 4096 Oct 20 20:35 local-7ad54392bcdd35a684174e047860b377 drwxr-xr-x 2 mshuler mshuler 4096 Oct 20 20:34 paxos-b7b7f0c2fd0a34108c053ef614bb7c2d drwxr-xr-x 2 mshuler mshuler 4096 Oct 20 20:34 peer_events-59dfeaea8db2334191ef109974d81484 drwxr-xr-x 2 mshuler mshuler 4096 Oct 20 20:34 peers-37f71aca7dc2383ba70672528af04d4f drwxr-xr-x 2 mshuler mshuler 4096 Oct 20 20:34 range_xfers-55d764384e553f8b9f6e676d4af3976d drwxr-xr-x 2 mshuler mshuler 4096 Oct 20 20:35 schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697 drwxr-xr-x 2 mshuler mshuler 4096 Oct 20 20:35 schema_columns-296e9c049bec3085827dc17d3df2122a drwxr-xr-x 2 mshuler mshuler 4096 Oct 20 20:35 schema_keyspaces-b0f2235744583cdb9631c43e59ce3676 drwxr-xr-x 2 mshuler mshuler 4096 Oct 20 20:34 schema_triggers-0359bc7171233ee19a4ab9dfb11fc125 drwxr-xr-x 2 mshuler mshuler 4096 Oct 20 20:34 schema_usertypes-3aa752254f82350b8d5c430fa221fa0a drwxr-xr-x 2 mshuler mshuler 4096 Oct 20 20:34 sstable_activity-5a1ff267ace03f128563cfae6103c65e mshuler@hana:~$ ccm node1 clear mshuler@hana:~$ ls -la ~/.ccm/test/node1/{commitlogs,data/system}/ ls: cannot access /home/mshuler/.ccm/test/node1/data/system/: No such file or directory /home/mshuler/.ccm/test/node1/commitlogs/: total 8 drwxr-xr-x 2 mshuler mshuler 4096 Oct 20 20:35 . drwxr-xr-x 8 mshuler mshuler 4096 Oct 20 20:35 .. Philip Thompson do you know of any recent changes around "clear"?
        Hide
        Ben Chan added a comment -

        Some nice changes to the JMX interface in the last 3 months; I no longer have to add lots of boilerplate function overloads.

        read -r JMXGET <<E
        /jmx_port/{p=\$2;} /binary/{split(\$2,a,/\047/);h=a[2];} \
        END{printf("bin/nodetool -h %s -p %s\n",h,p,cmd);}
        E
        
        ccm_nodetool() { local N=$1; shift; $(ccm $N show | awk -F= "$JMXGET") "$@"; }
        
        # git checkout 49833b9
        W=https://issues.apache.org/jira/secure/attachment
        for url in $W/12675963/5483-v17.patch; do
          { [ -e $(basename $url) ] || curl -sO $url; } && git apply $(basename $url)
        done &&
        ant clean && ant &&
        ./ccm-repair-test -kR &&
        ccm node1 stop &&
        ccm node1 clear &&
        ccm node1 start &&
        ccm_nodetool node1 repair -tr
        

        Issues encountered during testing:

        • My old method for forcing a streaming repair (yukim's "ccm clear" suggestion) no longer seems to work. Which means I can no longer test tracing of streaming. Is there something else I have to do now to force a streaming repair?
        • 5483-16 and 5483-17 no longer filter traces based on traceType. Using my standard simple test, I count 182 traces, versus 61 from a test run fom 2014-06, about 200% more traces (from the query-tracing code paths), though I don't know how this percentage will change with longer repairs. Some representative extraneous traces:
        Sending message to /127.0.0.2
        Sending message to /127.0.0.3
        Message received from /127.0.0.3
        Processing response from /127.0.0.3
        Message received from /127.0.0.2
        Processing response from /127.0.0.2
        
        Parsing SELECT * FROM system.sstable_activity WHERE keyspace_name=? and columnfamily_name=? and generation=?
        Preparing statement
        Executing single-partition query on sstable_activity
        Acquiring sstable references
        Merging memtable tombstones
        Skipped 0/0 non-slice-intersecting sstables, included 0 due to tombstones
        
        Show
        Ben Chan added a comment - Some nice changes to the JMX interface in the last 3 months; I no longer have to add lots of boilerplate function overloads. read -r JMXGET <<E /jmx_port/{p=\$2;} /binary/{split(\$2,a,/\047/);h=a[2];} \ END{printf("bin/nodetool -h %s -p %s\n",h,p,cmd);} E ccm_nodetool() { local N=$1; shift; $(ccm $N show | awk -F= "$JMXGET") "$@"; } # git checkout 49833b9 W=https://issues.apache.org/jira/secure/attachment for url in $W/12675963/5483-v17.patch; do { [ -e $(basename $url) ] || curl -sO $url; } && git apply $(basename $url) done && ant clean && ant && ./ccm-repair-test -kR && ccm node1 stop && ccm node1 clear && ccm node1 start && ccm_nodetool node1 repair -tr Issues encountered during testing: My old method for forcing a streaming repair (yukim's "ccm clear" suggestion) no longer seems to work. Which means I can no longer test tracing of streaming. Is there something else I have to do now to force a streaming repair? 5483-16 and 5483-17 no longer filter traces based on traceType. Using my standard simple test, I count 182 traces, versus 61 from a test run fom 2014-06, about 200% more traces (from the query-tracing code paths), though I don't know how this percentage will change with longer repairs. Some representative extraneous traces: Sending message to /127.0.0.2 Sending message to /127.0.0.3 Message received from /127.0.0.3 Processing response from /127.0.0.3 Message received from /127.0.0.2 Processing response from /127.0.0.2 Parsing SELECT * FROM system.sstable_activity WHERE keyspace_name=? and columnfamily_name=? and generation=? Preparing statement Executing single-partition query on sstable_activity Acquiring sstable references Merging memtable tombstones Skipped 0/0 non-slice-intersecting sstables, included 0 due to tombstones
        Hide
        Jonathan Ellis added a comment -

        That should be it!

        Show
        Jonathan Ellis added a comment - That should be it!
        Hide
        Ben Chan added a comment -

        Okay, that covers all the questions that I had.

        TODO list:

        • Rebase to trunk (hopefully the last 3 months haven't made this too painful).
        • Miscellaneous remaining cruft from the 5483-16 cleanup. Possibly some code simplification (low-hanging fruit only).
        • Guard out-of-bounds-array access in TraceState#deserialize. The simplest thing I can think of is to have a "NONE" tracetype and return that for any unknown tracetype.
        • Fix format string "Sending completed merkle tree to %s for %s.%s" being traced verbatim.

        Barring anything unforseen, is that about it before it's good enough for a merge?

        It may be a few days before I actually get around to this. I had some computer trouble recently, so I'm going to have to set up my build environment from scratch.

        Show
        Ben Chan added a comment - Okay, that covers all the questions that I had. TODO list: Rebase to trunk (hopefully the last 3 months haven't made this too painful). Miscellaneous remaining cruft from the 5483-16 cleanup. Possibly some code simplification (low-hanging fruit only). Guard out-of-bounds-array access in TraceState#deserialize. The simplest thing I can think of is to have a "NONE" tracetype and return that for any unknown tracetype. Fix format string "Sending completed merkle tree to %s for %s.%s" being traced verbatim. Barring anything unforseen, is that about it before it's good enough for a merge? It may be a few days before I actually get around to this. I had some computer trouble recently, so I'm going to have to set up my build environment from scratch.
        Hide
        Jonathan Ellis added a comment -

        If remote nodes have different config values for tracetype_repair_ttl, then remote traces will end up with a different ttl

        I'm fine with the limitation that weird things happen if you configure it wrong.

        If server versions are ever mixed, then an unknown trace type could be sent over the wire, causing an out-of-bounds array access during deserialization.

        We definitely need to support mixed-version repairs, so it sounds like this needs to be fixed.

        Activity notification for every local repair trace starts a new exponential backoff polling cycle more often than necessary... I was thinking forward to with bitmasks was flexible subsetting of traces and notifications.

        I'm a big fan of not oversolving the problem. Let's start with a simple solution and evolve from there.

        Show
        Jonathan Ellis added a comment - If remote nodes have different config values for tracetype_repair_ttl, then remote traces will end up with a different ttl I'm fine with the limitation that weird things happen if you configure it wrong. If server versions are ever mixed, then an unknown trace type could be sent over the wire, causing an out-of-bounds array access during deserialization. We definitely need to support mixed-version repairs, so it sounds like this needs to be fixed. Activity notification for every local repair trace starts a new exponential backoff polling cycle more often than necessary... I was thinking forward to with bitmasks was flexible subsetting of traces and notifications. I'm a big fan of not oversolving the problem. Let's start with a simple solution and evolve from there.
        Hide
        Ben Chan added a comment -

        A few things to note about 5483-16:

        Just to emphasize: these are corner-cases and deoptimizations that may not ever matter in practice. Even if the decision is to not fix them, I wanted to note their existence.

        (1)
        If remote nodes have different config values for tracetype_repair_ttl, then remote traces will end up with a different ttl. Having the trace for a given session partially disappear may be surprising to the user.

        The approach 5483-15 took was to use the ttl configured on the node being repaired.

        (2)
        I don't know whether this will ever arise in practice, but I thought I'd note: If server versions are ever mixed, then an unknown trace type could be sent over the wire, causing an out-of-bounds array access during deserialization.

        The approach 5483-15 took was almost a no-op on the remote end (the tracing session was created, but nothing was traced). This is still not quite right. Better would be to log an error and not create the session at all.

        It might still possibly make sense to create the session anyway in the case of A->B->C tracestate propagation, but that might be stretching things.

        (3)
        Activity notification for every local repair trace starts a new exponential backoff polling cycle more often than necessary.

        Of course, I can't say whether this extra resource usage will actually matter in practice (only practice can tell).


        Okay, so what you meant was to change tracetype to be an actual Enum type (I can be a little dense sometimes, and code leaves far less room for ambiguity).

        One thing I was thinking forward to with bitmasks was flexible subsetting of traces and notifications. Tracing may not ever need trace levels or flexible subsetting, but I thought I'd note the idea for future reference.

        // pseudocode; please excuse the bad naming sense.
        REPAIR_0 = 0b001 << MIN_SHIFT;
        REPAIR_1 = 0b010 << MIN_SHIFT;
        REPAIR_2 = 0b100 << MIN_SHIFT;
        
        // read "zero and up", etc
        REPAIR_0U = REPAIR_0 | REPAIR_1 | REPAIR_2;
        REPAIR_1U =            REPAIR_1 | REPAIR_2;
        
        NOTIFY_0 = 0b001;
        NOTIFY_1 = 0b010;
        NOTIFY_2 = 0b100;
        NOTIFY_0U = ...;
        NOTIFY_1U = ...;
        
        val = config.get_val();
        
        trace(REPAIR_0, "only traced for REPAIR_0; superseded in the more verbose traces");
        trace(REPAIR_0U, "always traced for repair");
        trace(REPAIR_1U, "traced at REPAIR_1 and up");
        trace(REPAIR_2 | val, "only at REPAIR_2, or during the tracetypes specified in val");
        trace(REPAIR_1U & ~val, "only at REPAIR_1 and up, excluding the tracetypes specified in val");
        trace(REPAIR_2 | REPAIR_0, "for some reason, we don't want to trace this at REPAIR_1");
        
        // you could even have semi-orthogonal subsetting and levels
        trace(REPAIR_0U | NOTIFY_1, "notify at NOTIFY_1 only");
        trace(REPAIR_1U | NOTIFY_0U, "notify at all NOTIFY levels, but only when tracing REPAIR_1 and up");
        
        awaitNotification(NOTIFY_1U); // notify me for NOTIFY_1 and up.
        

        I believe log4j does something similar with what they call markers.

        Of course, you can do this with Enum and EnumSet too (though more verbosely).


        Aside from lyubent's nit (and possibly the issues in section 1), are there any other issues that prevent this patch from being minimally useful?

        Or even better, is there at least an alpha period where trunk is still allowed to make breaking changes to the api, protocol, etc? It's hard to forsee everything.

        Show
        Ben Chan added a comment - A few things to note about 5483-16: Just to emphasize: these are corner-cases and deoptimizations that may not ever matter in practice. Even if the decision is to not fix them, I wanted to note their existence. (1) If remote nodes have different config values for tracetype_repair_ttl , then remote traces will end up with a different ttl. Having the trace for a given session partially disappear may be surprising to the user. The approach 5483-15 took was to use the ttl configured on the node being repaired. (2) I don't know whether this will ever arise in practice, but I thought I'd note: If server versions are ever mixed, then an unknown trace type could be sent over the wire, causing an out-of-bounds array access during deserialization. The approach 5483-15 took was almost a no-op on the remote end (the tracing session was created, but nothing was traced). This is still not quite right. Better would be to log an error and not create the session at all. It might still possibly make sense to create the session anyway in the case of A->B->C tracestate propagation, but that might be stretching things. (3) Activity notification for every local repair trace starts a new exponential backoff polling cycle more often than necessary. Of course, I can't say whether this extra resource usage will actually matter in practice (only practice can tell). Okay, so what you meant was to change tracetype to be an actual Enum type (I can be a little dense sometimes, and code leaves far less room for ambiguity). One thing I was thinking forward to with bitmasks was flexible subsetting of traces and notifications. Tracing may not ever need trace levels or flexible subsetting, but I thought I'd note the idea for future reference. // pseudocode; please excuse the bad naming sense. REPAIR_0 = 0b001 << MIN_SHIFT; REPAIR_1 = 0b010 << MIN_SHIFT; REPAIR_2 = 0b100 << MIN_SHIFT; // read "zero and up", etc REPAIR_0U = REPAIR_0 | REPAIR_1 | REPAIR_2; REPAIR_1U = REPAIR_1 | REPAIR_2; NOTIFY_0 = 0b001; NOTIFY_1 = 0b010; NOTIFY_2 = 0b100; NOTIFY_0U = ...; NOTIFY_1U = ...; val = config.get_val(); trace(REPAIR_0, "only traced for REPAIR_0; superseded in the more verbose traces"); trace(REPAIR_0U, "always traced for repair"); trace(REPAIR_1U, "traced at REPAIR_1 and up"); trace(REPAIR_2 | val, "only at REPAIR_2, or during the tracetypes specified in val"); trace(REPAIR_1U & ~val, "only at REPAIR_1 and up, excluding the tracetypes specified in val"); trace(REPAIR_2 | REPAIR_0, "for some reason, we don't want to trace this at REPAIR_1"); // you could even have semi-orthogonal subsetting and levels trace(REPAIR_0U | NOTIFY_1, "notify at NOTIFY_1 only"); trace(REPAIR_1U | NOTIFY_0U, "notify at all NOTIFY levels, but only when tracing REPAIR_1 and up"); awaitNotification(NOTIFY_1U); // notify me for NOTIFY_1 and up. I believe log4j does something similar with what they call markers. Of course, you can do this with Enum and EnumSet too (though more verbosely). Aside from lyubent's nit (and possibly the issues in section 1), are there any other issues that prevent this patch from being minimally useful? Or even better, is there at least an alpha period where trunk is still allowed to make breaking changes to the api, protocol, etc? It's hard to forsee everything.
        Hide
        Lyuben Todorov added a comment -

        +1 but with a nit:
        Sending completed merkle tree to %s for %s.%s appears in the logs because of:
        Tracing.traceRepair("Sending completed merkle tree to %s for %s.%s", initiator, desc.keyspace, desc.columnFamily); in Validator#run.

        A switch can be made from %s to {} instead of using String.format for the sake of minimalism.

        Show
        Lyuben Todorov added a comment - +1 but with a nit: Sending completed merkle tree to %s for %s.%s appears in the logs because of: Tracing.traceRepair("Sending completed merkle tree to %s for %s.%s", initiator, desc.keyspace, desc.columnFamily); in Validator#run . A switch can be made from %s to {} instead of using String.format for the sake of minimalism.
        Hide
        Jonathan Ellis added a comment -

        Posted v16 to https://github.com/jbellis/cassandra/commits/5483-16.

        Simplifies TraceState as follows:

        • TraceType is an enum
        • always use the tracetype the state was created for
        • don't try to overgeneralize the notification handle
        • encapsulate notification fields
        • ttl is uniquely determined by trace type
        Show
        Jonathan Ellis added a comment - Posted v16 to https://github.com/jbellis/cassandra/commits/5483-16 . Simplifies TraceState as follows: TraceType is an enum always use the tracetype the state was created for don't try to overgeneralize the notification handle encapsulate notification fields ttl is uniquely determined by trace type
        Hide
        Ben Chan added a comment -

        the exponential backoff is maintained, it's only defined more simply (as far as I could read you doubled the wait time every second timeout, which is what happens now also).

        Sorry, I was speed-reading the code; I saw wait(); (i.e. with no timeout), so I immediately knew there couldn't be a timeout, exponential or otherwise.

        I mostly ignored what wouldn't affect execution (stupid mental code optimizer).

        If you have wait(wait); and reset wait (the long) to minWaitMillis in if (hasNotifications) then that should be about right. I've attached a patch that does this (5483-v15-02-Hook-up-exponential-backoff-functionality.patch).

        I apologize for reading-code-while-distracted, thus causing me to go on about the exponential backoff. I'll try to learn from this experience.

        I liked your exponential backoff calculation code; I took the liberty of tweaking it to double exactly in 5483-v15-03-Exact-doubling-for-exponential-backoff.patch.

        As the code is defined now we explicitly create a repair session and spawn a specific thread to process it, so we have a guarantee there's only one thread. It doesn't matter if the consumer is in its method when notifyAll is called; if it isn't it will receive the notification as soon as it next enters the method.

        I agree with you.

        When I said "only ever" before, I was talking about future code, not the code in its current state. Had I considered the alternate interpretation of "edge case in the current code", I would have clarified better, but there we are.

        Just to reiterate, I'm reasonably cognizant that for the code in its current state, only a single thread is accessing the notification for a given TraceState, and how this allows for a simpler implementation.

        All I was trying to say in my previous comment was that notification upon activity is a general capability, and in future code, should there ever be a use case for multiple threads accessing the same activity notification, then the implementation will have to be changed. (This also holds, should there ever be a use case that requires something other than exponential backoff.)

        Our differences turn out to be philosophical only, and I'm not all that attached to mine.

        I've already accepted 5483-v15.patch and have built my fixes for (some of) the Jun 10 comments on top of it. I need some more clarification before I can be sure what is required in order to resolve the remaining issues.

        Test code, modified and simplified because of the updated system_traces.* schemas (I may upload an updated ccm-repair-test if we have many more patch iterations).

        read -r JMXGET <<E
        /jmx_port/{p=\$2;} \
        /binary/{split(\$2,a,/\047/);h=a[2];} \
        END{printf("bin/nodetool -h %s -p %s\n",h,p,cmd);}
        E
        
        ccm_nodetool() {
          local N=$1
          shift
          $(ccm $N show | awk -F= "$JMXGET") "$@"
        }
        
        # git checkout 85956ae
        W=https://issues.apache.org/jira/secure/attachment
        for url in \
          $W/12649785/5483-v15.patch \
          $W/12650175/5483-v15-02-Hook-up-exponential-backoff-functionality.patch \
          $W/12650174/5483-v15-03-Exact-doubling-for-exponential-backoff.patch \
          $W/12650173/5483-v15-04-Re-add-old-StorageService-JMX-signatures.patch \
          $W/12650172/5483-v15-05-Move-command-column-to-system_traces.sessions.patch
        do
          { [ -e $(basename $url) ] || curl -sO $url; } && git apply $(basename $url)
        done &&
        ant clean && ant &&
        ./ccm-repair-test -kR &&
        ccm node1 stop &&
        ccm node1 clear &&
        ccm node1 start &&
        ccm_nodetool node1 repair -tr
        
        Show
        Ben Chan added a comment - the exponential backoff is maintained, it's only defined more simply (as far as I could read you doubled the wait time every second timeout, which is what happens now also). Sorry, I was speed-reading the code; I saw wait(); (i.e. with no timeout), so I immediately knew there couldn't be a timeout, exponential or otherwise. I mostly ignored what wouldn't affect execution (stupid mental code optimizer). If you have wait(wait); and reset wait (the long ) to minWaitMillis in if (hasNotifications) then that should be about right. I've attached a patch that does this ( 5483-v15-02-Hook-up-exponential-backoff-functionality.patch ). I apologize for reading-code-while-distracted, thus causing me to go on about the exponential backoff. I'll try to learn from this experience. — I liked your exponential backoff calculation code; I took the liberty of tweaking it to double exactly in 5483-v15-03-Exact-doubling-for-exponential-backoff.patch . — As the code is defined now we explicitly create a repair session and spawn a specific thread to process it, so we have a guarantee there's only one thread. It doesn't matter if the consumer is in its method when notifyAll is called; if it isn't it will receive the notification as soon as it next enters the method. I agree with you. When I said "only ever" before, I was talking about future code, not the code in its current state. Had I considered the alternate interpretation of "edge case in the current code", I would have clarified better, but there we are. Just to reiterate, I'm reasonably cognizant that for the code in its current state, only a single thread is accessing the notification for a given TraceState, and how this allows for a simpler implementation. All I was trying to say in my previous comment was that notification upon activity is a general capability, and in future code, should there ever be a use case for multiple threads accessing the same activity notification, then the implementation will have to be changed. (This also holds, should there ever be a use case that requires something other than exponential backoff.) Our differences turn out to be philosophical only, and I'm not all that attached to mine. I've already accepted 5483-v15.patch and have built my fixes for (some of) the Jun 10 comments on top of it. I need some more clarification before I can be sure what is required in order to resolve the remaining issues. — Test code, modified and simplified because of the updated system_traces.* schemas (I may upload an updated ccm-repair-test if we have many more patch iterations). read -r JMXGET <<E /jmx_port/{p=\$2;} \ /binary/{split(\$2,a,/\047/);h=a[2];} \ END{printf("bin/nodetool -h %s -p %s\n",h,p,cmd);} E ccm_nodetool() { local N=$1 shift $(ccm $N show | awk -F= "$JMXGET") "$@" } # git checkout 85956ae W=https://issues.apache.org/jira/secure/attachment for url in \ $W/12649785/5483-v15.patch \ $W/12650175/5483-v15-02-Hook-up-exponential-backoff-functionality.patch \ $W/12650174/5483-v15-03-Exact-doubling-for-exponential-backoff.patch \ $W/12650173/5483-v15-04-Re-add-old-StorageService-JMX-signatures.patch \ $W/12650172/5483-v15-05-Move-command-column-to-system_traces.sessions.patch do { [ -e $(basename $url) ] || curl -sO $url; } && git apply $(basename $url) done && ant clean && ant && ./ccm-repair-test -kR && ccm node1 stop && ccm node1 clear && ccm node1 start && ccm_nodetool node1 repair -tr
        Hide
        Benedict added a comment -

        Like I said, I only replicated functionality - the exponential backoff is maintained, it's only defined more simply (as far as I could read you doubled the wait time every second timeout, which is what happens now also). As the code is defined now we explicitly create a repair session and spawn a specific thread to process it, so we have a guarantee there's only one thread. It doesn't matter if the consumer is in its method when notifyAll is called; if it isn't it will receive the notification as soon as it next enters the method.

        When I have some time I'll take a look and see if there's an alternative approach.

        Show
        Benedict added a comment - Like I said, I only replicated functionality - the exponential backoff is maintained, it's only defined more simply (as far as I could read you doubled the wait time every second timeout, which is what happens now also). As the code is defined now we explicitly create a repair session and spawn a specific thread to process it, so we have a guarantee there's only one thread. It doesn't matter if the consumer is in its method when notifyAll is called; if it isn't it will receive the notification as soon as it next enters the method. When I have some time I'll take a look and see if there's an alternative approach.
        Hide
        Ben Chan added a comment - - edited

        I'm sorry I can't think this through very deeply at the moment, so please allow a little slack if I say something incorrect. I'm writing this during a small window of time (where I can't do anything else) in a crisis I'm dealing with.

        Why is command part of events instead of sessions?

        It was a somewhat arbitrary design decision. I can move it over to sessions. The only real reason was historical (see the Mar 3 comment); it was a proof-of-concept that never got followed up upon until just now.

        Also: should use an enum internally. Logging as string representation is fine.

        Just to be clear, you mean Java code should work with an enum, and the actual cql table column is fine as a string?

        The code actually does use an enum (of sorts; not an Enum proper), the traceType. The traceType is passed to Tracing#getCommandName to look up the String for command name.

        It makes people grumpy when we break JMX signatures. Can we add a new overload instead, preserving the old? This should cut down on some of the code churn in StorageService as well.

        I will admit that I didn't really consider the entire ecosystem of tools that use JMX to control Cassandra (though note that I did mention the JMX api change in a Mar 8 comment "... the server isn't going to work with old versions of nodetool. And a newer nodetool still won't work with older server versions.").

        It's a minor thing to get hung up on, but I'm not wild about all the work needed to propagate TTLs around. Is it really super important to persist repair traces much longer than queries? If so, what if we used a separate table and just allowed users to modify the default ttl? (The original trace code predates default TTLs, I think, or we would have made use of it there.)

        I guess the question is how many different types of things (bootstrap, repair, query, decommission, ... anything else?) might eventually end up being traced. If n is small, then having n tables may be fine.

        The logic was this (see Mar 1-3 discussion): Repair can take a long time, so 24 hours may be too short of a ttl.

        I recall reading about problematic repairs taking several days, which wouldn't mix well with a 24 hour ttl.

        Also have a nagging feeling that the notify/wait logic in StorageService + TraceState is more complex than necessary.

        If there is guaranteed to only ever be one consumer of notifications at a time, then the updated v15 logic seems fine. But if there are ever two threads polling the same TraceState, then there could be dropped notifications. The problem (I believe) is that all consumers may not be in a wait state at the moment when notifyAll is signalled. This means a notification could be missed, right? I'm not experienced in Java concurrency, and this isn't the best time for me to slowly think things through, so it's quite possible I'm wrong here.

        But it does seem reasonable there will only ever be one polling thread for any given tracing session, so the v15 code should work fine in that respect, as far as my cursory inspection goes.

        Note, however, that the polling in this case is a heuristic only. Meaning that it's likely that an external trace happened somewhere around this time plus or minus (as far as I know, there is no way in Cassandra to be notified of cf updates). Which means that the actual trace may only arrive after the notification, meaning that for two notifications ~maxWait seconds apart, your logging of events might be maxWait seconds late:

        time  action                result
        ----  ------                ------
           0  receive notification  no events found
          10  event A
        1000  receive notification  sendNotification(event A)
        

        This is why I had an exponential backoff. Because I wanted to poll with high frequency for a likely event, polling less and less often as the notification recedes into the past. There are, of course, endless tweaks possible to this basic algorithm. It depends upon what you can assume about the likely time distribution of the events hinted at by the notification.

        time  action                result
        ----  ------                ------
           0  receive notification  no events found
           2  poll                  no events found
           4  poll                  no events found
           8  poll                  no events found
          10  event A
          16  poll                  sendNotification(event A)
          32  poll                  no events found
        1000  receive notification  no events found
        

        I should note I've made no attempt to corroborate this behaviour is sensible; I've only simplified it.

        Any feedback would be welcome. As I've said before, heuristics are messy. I talked about the reasoning behind my design decisions, and a possibility for an alternate implementation (with attendant tradeoffs) in a Mar 17 comment. I honestly thought I'd get more comments on it at the time, but it's possible the patch had already gotten into TL; DR territory even then.

        Okay my short break from reality is over. Time to panic.

        Edit: some more dead time (hurry up and wait). Thought some more about trace notifications.

        Show
        Ben Chan added a comment - - edited I'm sorry I can't think this through very deeply at the moment, so please allow a little slack if I say something incorrect. I'm writing this during a small window of time (where I can't do anything else) in a crisis I'm dealing with. Why is command part of events instead of sessions? It was a somewhat arbitrary design decision. I can move it over to sessions. The only real reason was historical (see the Mar 3 comment); it was a proof-of-concept that never got followed up upon until just now. Also: should use an enum internally. Logging as string representation is fine. Just to be clear, you mean Java code should work with an enum, and the actual cql table column is fine as a string? The code actually does use an enum (of sorts; not an Enum proper), the traceType. The traceType is passed to Tracing#getCommandName to look up the String for command name. It makes people grumpy when we break JMX signatures. Can we add a new overload instead, preserving the old? This should cut down on some of the code churn in StorageService as well. I will admit that I didn't really consider the entire ecosystem of tools that use JMX to control Cassandra (though note that I did mention the JMX api change in a Mar 8 comment "... the server isn't going to work with old versions of nodetool. And a newer nodetool still won't work with older server versions."). It's a minor thing to get hung up on, but I'm not wild about all the work needed to propagate TTLs around. Is it really super important to persist repair traces much longer than queries? If so, what if we used a separate table and just allowed users to modify the default ttl? (The original trace code predates default TTLs, I think, or we would have made use of it there.) I guess the question is how many different types of things (bootstrap, repair, query, decommission, ... anything else?) might eventually end up being traced. If n is small, then having n tables may be fine. The logic was this (see Mar 1-3 discussion): Repair can take a long time, so 24 hours may be too short of a ttl. I recall reading about problematic repairs taking several days, which wouldn't mix well with a 24 hour ttl. Also have a nagging feeling that the notify/wait logic in StorageService + TraceState is more complex than necessary. If there is guaranteed to only ever be one consumer of notifications at a time, then the updated v15 logic seems fine. But if there are ever two threads polling the same TraceState, then there could be dropped notifications. The problem (I believe) is that all consumers may not be in a wait state at the moment when notifyAll is signalled. This means a notification could be missed, right? I'm not experienced in Java concurrency, and this isn't the best time for me to slowly think things through, so it's quite possible I'm wrong here. But it does seem reasonable there will only ever be one polling thread for any given tracing session, so the v15 code should work fine in that respect, as far as my cursory inspection goes. Note, however, that the polling in this case is a heuristic only. Meaning that it's likely that an external trace happened somewhere around this time plus or minus (as far as I know, there is no way in Cassandra to be notified of cf updates). Which means that the actual trace may only arrive after the notification, meaning that for two notifications ~maxWait seconds apart, your logging of events might be maxWait seconds late: time action result ---- ------ ------ 0 receive notification no events found 10 event A 1000 receive notification sendNotification(event A) This is why I had an exponential backoff. Because I wanted to poll with high frequency for a likely event, polling less and less often as the notification recedes into the past. There are, of course, endless tweaks possible to this basic algorithm. It depends upon what you can assume about the likely time distribution of the events hinted at by the notification. time action result ---- ------ ------ 0 receive notification no events found 2 poll no events found 4 poll no events found 8 poll no events found 10 event A 16 poll sendNotification(event A) 32 poll no events found 1000 receive notification no events found I should note I've made no attempt to corroborate this behaviour is sensible; I've only simplified it. Any feedback would be welcome. As I've said before, heuristics are messy. I talked about the reasoning behind my design decisions, and a possibility for an alternate implementation (with attendant tradeoffs) in a Mar 17 comment. I honestly thought I'd get more comments on it at the time, but it's possible the patch had already gotten into TL; DR territory even then. — Okay my short break from reality is over. Time to panic. — Edit: some more dead time (hurry up and wait). Thought some more about trace notifications.
        Hide
        Benedict added a comment - - edited

        Agreed. Uploaded a patch that has identical behaviour, but is a bit easier to understand.

        I should note I've made no attempt to corroborate this behaviour is sensible; I've only simplified it

        Show
        Benedict added a comment - - edited Agreed. Uploaded a patch that has identical behaviour, but is a bit easier to understand. I should note I've made no attempt to corroborate this behaviour is sensible; I've only simplified it
        Hide
        Jonathan Ellis added a comment -

        Also have a nagging feeling that the notify/wait logic in StorageService + TraceState is more complex than necessary. WDYT Benedict?

        Show
        Jonathan Ellis added a comment - Also have a nagging feeling that the notify/wait logic in StorageService + TraceState is more complex than necessary. WDYT Benedict ?
        Hide
        Jonathan Ellis added a comment -

        Some comments having applied the patch:

        1. Why is command part of events instead of sessions? Also: should use an enum internally. Logging as string representation is fine.
        2. It makes people grumpy when we break JMX signatures. Can we add a new overload instead, preserving the old? This should cut down on some of the code churn in StorageService as well.
        3. It's a minor thing to get hung up on, but I'm not wild about all the work needed to propagate TTLs around. Is it really super important to persist repair traces much longer than queries? If so, what if we used a separate table and just allowed users to modify the default ttl? (The original trace code predates default TTLs, I think, or we would have made use of it there.)
        Show
        Jonathan Ellis added a comment - Some comments having applied the patch: Why is command part of events instead of sessions? Also: should use an enum internally. Logging as string representation is fine. It makes people grumpy when we break JMX signatures. Can we add a new overload instead, preserving the old? This should cut down on some of the code churn in StorageService as well. It's a minor thing to get hung up on, but I'm not wild about all the work needed to propagate TTLs around. Is it really super important to persist repair traces much longer than queries? If so, what if we used a separate table and just allowed users to modify the default ttl? (The original trace code predates default TTLs, I think, or we would have made use of it there.)
        Hide
        Ben Chan added a comment -
        # git checkout a61ef51
        W=https://issues.apache.org/jira/secure/attachment
        for url in \
          $W/12648746/5483-v14-01-squashed.patch
        do
          { [ -e $(basename $url) ] || curl -sO $url; } && git apply $(basename $url)
        done &&
        ant clean && ant &&
        ./ccm-repair-test -kR &&
        ccm node1 stop &&
        ccm node1 clear &&
        ccm node1 start &&
        ./ccm-repair-test -rt
        

        No changes, just a (mostly) straightforward rebase onto a61ef51 (recent trunk). Some issues only show up under actual use, so I fully expect this will need some followup.

        An idea for the timely tracing of exceptions: you might be able to kludge most of it by modifying WrappedRunnable to trace exceptions. I haven't looked too closely at it, though.

        Show
        Ben Chan added a comment - # git checkout a61ef51 W=https://issues.apache.org/jira/secure/attachment for url in \ $W/12648746/5483-v14-01-squashed.patch do { [ -e $(basename $url) ] || curl -sO $url; } && git apply $(basename $url) done && ant clean && ant && ./ccm-repair-test -kR && ccm node1 stop && ccm node1 clear && ccm node1 start && ./ccm-repair-test -rt No changes, just a (mostly) straightforward rebase onto a61ef51 (recent trunk). Some issues only show up under actual use, so I fully expect this will need some followup. An idea for the timely tracing of exceptions: you might be able to kludge most of it by modifying WrappedRunnable to trace exceptions. I haven't looked too closely at it, though.
        Hide
        Jonathan Ellis added a comment -

        All right, I think we're at a point where this is certainly more useful than what we have now. The perfect is the enemy of the good and all that.

        Ben, can you post a branch or a squashed patch?

        Show
        Jonathan Ellis added a comment - All right, I think we're at a point where this is certainly more useful than what we have now. The perfect is the enemy of the good and all that. Ben, can you post a branch or a squashed patch?
        Hide
        Ben Chan added a comment -

        Is that because the traces are asynchronous? Because I think session 2 only starts after session 1 finishes.

        Here's my high level understanding.

        • Repair command #1 and #2, etc are serial.
        • Each repair session "Syncing range ..." is "technically" concurrent, since each is submitted to a ThreadPoolExecutor.
          • However, differencing is serialized, so if there is no streaming going on, you won't see very much overlap between the sessions, except at the beginning and end (which is exactly what we see with these simple tests).
          • Conversely, this means you will see much more interleaving when heavy streaming is going on.

        So at the very least, it might be good to eventually disambiguate the streaming portion.

        The easiest thing would be to make them non-redundant. Can we make the tracing "extra detail" on top of the normal ones instead of competing with them?

        I think it may be a conceptual block on my part. I tend to think of traces as a kind of profiling mechanism.

        • Most of the sendNotification calls in StorageService#createRepairTask consist of reporting any errors from the results of RepairFuture objects. So the timing on those is not really that useful for profiling. They're not really what I'd usually think of as a trace.
        • Some are request validation reporting before the repair proper even starts.
        • The rest are informational sendNotification messages which are redundant when tracing is active (this is the easy case).

        In pseudocode:

        if (some error #1 in repair request)
          sendNotification("NO #1!");
        if (some error #2 in repair request)
          sendNotification("NO #2!");
        for (r : ranges)
        {
          f = something.submitRepairSession(new RepairSession(r));
          futures.add(f);
          try
          {
            // this serializes the differencing part.
            f.waitForDifferencing()
          }
          catch (SomeException)
          {
            // handle, sendNotification
          }
        }
        
        try
        {
          for (f : futures)
          {
            r = f.get();
            sendNotification("done: %s", r);
          }
        }
        catch (ExecutionException ee)
        {
          // handle, sendNotification
        }
        catch (Exception e)
        {
          // handle, sendNotification
        }
        

        The main point being that I can't be sure that every single interesting exception is caught and traced in the thread where it's thrown, then rethrown. Most likely, this is not the case, and some exceptions are only reported at the StorageService#createRepairTask level. I believe most (?) cases are already caught and traced, though.

        So after going through all that, I'm thinking that the easiest thing is to just accept the possibility of redundancy and delayed reporting, and just trace all sendNotification in StorageService#createRepairTask (unless it's demonstrably redundant, or already being traced through some other mechanism).

        Show
        Ben Chan added a comment - Is that because the traces are asynchronous? Because I think session 2 only starts after session 1 finishes. Here's my high level understanding. Repair command #1 and #2, etc are serial. Each repair session "Syncing range ..." is "technically" concurrent, since each is submitted to a ThreadPoolExecutor. However, differencing is serialized, so if there is no streaming going on, you won't see very much overlap between the sessions, except at the beginning and end (which is exactly what we see with these simple tests). Conversely, this means you will see much more interleaving when heavy streaming is going on. So at the very least, it might be good to eventually disambiguate the streaming portion. The easiest thing would be to make them non-redundant. Can we make the tracing "extra detail" on top of the normal ones instead of competing with them? I think it may be a conceptual block on my part. I tend to think of traces as a kind of profiling mechanism. Most of the sendNotification calls in StorageService#createRepairTask consist of reporting any errors from the results of RepairFuture objects. So the timing on those is not really that useful for profiling. They're not really what I'd usually think of as a trace. Some are request validation reporting before the repair proper even starts. The rest are informational sendNotification messages which are redundant when tracing is active (this is the easy case). In pseudocode: if (some error #1 in repair request) sendNotification("NO #1!"); if (some error #2 in repair request) sendNotification("NO #2!"); for (r : ranges) { f = something.submitRepairSession(new RepairSession(r)); futures.add(f); try { // this serializes the differencing part. f.waitForDifferencing() } catch (SomeException) { // handle, sendNotification } } try { for (f : futures) { r = f.get(); sendNotification("done: %s", r); } } catch (ExecutionException ee) { // handle, sendNotification } catch (Exception e) { // handle, sendNotification } The main point being that I can't be sure that every single interesting exception is caught and traced in the thread where it's thrown, then rethrown. Most likely, this is not the case, and some exceptions are only reported at the StorageService#createRepairTask level. I believe most (?) cases are already caught and traced, though. So after going through all that, I'm thinking that the easiest thing is to just accept the possibility of redundancy and delayed reporting, and just trace all sendNotification in StorageService#createRepairTask (unless it's demonstrably redundant, or already being traced through some other mechanism).
        Hide
        Jonathan Ellis added a comment -

        To make the point clearer, I've marked each sub-task graphically in the nodetool trace output below

        You're right that it's a bit tough to keep the sessions straight. Is that because the traces are asynchronous? Because I think session 2 only starts after session 1 finishes. Either way I think we can live with it.

        not all nodetool messages are rendered redundant by trace output. Since we can't just suppress all non-trace sendNotification, how can we unambiguously tell nodetool trace output from normal sendNotification messages?

        The easiest thing would be to make them non-redundant. Can we make the tracing "extra detail" on top of the normal ones instead of competing with them?

        Show
        Jonathan Ellis added a comment - To make the point clearer, I've marked each sub-task graphically in the nodetool trace output below You're right that it's a bit tough to keep the sessions straight. Is that because the traces are asynchronous? Because I think session 2 only starts after session 1 finishes. Either way I think we can live with it. not all nodetool messages are rendered redundant by trace output. Since we can't just suppress all non-trace sendNotification, how can we unambiguously tell nodetool trace output from normal sendNotification messages? The easiest thing would be to make them non-redundant. Can we make the tracing "extra detail" on top of the normal ones instead of competing with them?
        Hide
        Ben Chan added a comment - - edited

        May 14 formatting changes in 5483-v13-608fb03-May-14-trace-formatting-changes.patch (based off of commit 608fb03).

        I think the session log messages are still confusing, especially since we use the same term for "repairing a subrange" and "streaming data."

        Currently the "session" terminology is baked into the source code, in StreamSession.java and RepairSession.java. If the messages are changed to reflect different terminology, hopefully the source code can eventually be changed to match (fewer special cases to remember). Perhaps the best thing is to always qualify them, e.g. "stream session" and "repair session"?

        I don't actually see the session uuid being used in the logs except at start/finish.

        Sorry, that was another inadvertent mixing of nodetool messages and trace output. [2014-05-13 23:49:52,283] Repair session cd6aad80-db1a-11e3-b0e7-f94811c7b860 for range (3074457345618258602,-9223372036854775808] finished is not a trace, but a separate (pre-patch) sendNotification in StorageService.java. This message (and some of the error messages, I think) is redundant when combined with trace output. It should have been either one or the other, not both. In the trace proper, the session UUID only shows up at the start.

        But note: not all nodetool messages are rendered redundant by trace output. Since we can't just suppress all non-trace sendNotification, how can we unambiguously tell nodetool trace output from normal sendNotification messages? I'm currently leaning towards just marking all sendNotification trace output with a TRACE: tag.

        The repair session UUIDs used to be prepended to everything, but were removed in 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch. Without them, things are less verbose, but it's sometimes hard to unambiguously follow traces for concurrent repair sessions. To make the point clearer, I've marked each sub-task graphically in the nodetool trace output below (I've cross-checked this with the logs, which do retain the UUIDs). If you cover up the left side, it's harder to figure out which trace goes with which sub-task. Real-world repair traces will probably be even more confusing.

        Note: indentation here does not denote nesting; the column roughly indicates task identity, though I reuse columns when it's not ambiguous.

        1         [2014-05-15 11:31:37,839] Starting repair command #1, repairing 3 ranges for s1.users (seq=true, full=true)
          x       [2014-05-15 11:31:37,922] Syncing range (-3074457345618258603,3074457345618258602]
          x       [2014-05-15 11:31:38,108] Requesting merkle trees for users from [/127.0.0.2, /127.0.0.3, /127.0.0.1]
          x       [2014-05-15 11:31:38,833] /127.0.0.2: Sending completed merkle tree to /127.0.0.1 for s1.users
          x       [2014-05-15 11:31:39,953] Received merkle tree for users from /127.0.0.2
          x       [2014-05-15 11:31:40,939] /127.0.0.3: Sending completed merkle tree to /127.0.0.1 for s1.users
          x       [2014-05-15 11:31:41,279] Received merkle tree for users from /127.0.0.3
          x       [2014-05-15 11:31:42,632] Received merkle tree for users from /127.0.0.1
            x     [2014-05-15 11:31:42,671] Syncing range (-9223372036854775808,-3074457345618258603]
            x     [2014-05-15 11:31:42,766] Requesting merkle trees for users from [/127.0.0.2, /127.0.0.3, /127.0.0.1]
          x       [2014-05-15 11:31:42,905] Endpoint /127.0.0.2 is consistent with /127.0.0.3 for users
          x       [2014-05-15 11:31:43,044] Endpoint /127.0.0.2 is consistent with /127.0.0.1 for users
          x       [2014-05-15 11:31:43,047] Endpoint /127.0.0.3 is consistent with /127.0.0.1 for users
          x       [2014-05-15 11:31:43,084] Completed sync of range (-3074457345618258603,3074457345618258602]
            x     [2014-05-15 11:31:43,251] /127.0.0.2: Sending completed merkle tree to /127.0.0.1 for s1.users
            x     [2014-05-15 11:31:43,422] Received merkle tree for users from /127.0.0.2
            x     [2014-05-15 11:31:44,495] /127.0.0.3: Sending completed merkle tree to /127.0.0.1 for s1.users
            x     [2014-05-15 11:31:44,637] Received merkle tree for users from /127.0.0.3
            x     [2014-05-15 11:31:45,474] Received merkle tree for users from /127.0.0.1
              x   [2014-05-15 11:31:45,494] Syncing range (3074457345618258602,-9223372036854775808]
            x     [2014-05-15 11:31:45,499] Endpoint /127.0.0.3 is consistent with /127.0.0.1 for users
            x     [2014-05-15 11:31:45,520] Endpoint /127.0.0.2 is consistent with /127.0.0.1 for users
            x     [2014-05-15 11:31:45,544] Endpoint /127.0.0.2 is consistent with /127.0.0.3 for users
            x     [2014-05-15 11:31:45,564] Completed sync of range (-9223372036854775808,-3074457345618258603]
              x   [2014-05-15 11:31:45,581] Requesting merkle trees for users from [/127.0.0.2, /127.0.0.3, /127.0.0.1]
              x   [2014-05-15 11:31:46,214] Received merkle tree for users from /127.0.0.2
              x   [2014-05-15 11:31:46,233] /127.0.0.2: Sending completed merkle tree to /127.0.0.1 for s1.users
              x   [2014-05-15 11:31:46,459] Received merkle tree for users from /127.0.0.3
              x   [2014-05-15 11:31:46,472] /127.0.0.3: Sending completed merkle tree to /127.0.0.1 for s1.users
              x   [2014-05-15 11:31:46,545] Received merkle tree for users from /127.0.0.1
              x   [2014-05-15 11:31:46,574] Endpoint /127.0.0.3 has 1 range(s) out of sync with /127.0.0.1 for users
              x   [2014-05-15 11:31:46,575] Endpoint /127.0.0.2 is consistent with /127.0.0.3 for users
              x   [2014-05-15 11:31:46,581] Endpoint /127.0.0.2 has 1 range(s) out of sync with /127.0.0.1 for users
              x   [2014-05-15 11:31:46,601] Performing streaming repair of 1 ranges with /127.0.0.2
              x   [2014-05-15 11:31:46,601] Performing streaming repair of 1 ranges with /127.0.0.3
              x   [2014-05-15 11:31:46,910] Streaming session with /127.0.0.2 prepared.
              x   [2014-05-15 11:31:46,949] Streaming session with /127.0.0.3 prepared.
              x   [2014-05-15 11:31:46,991] 87/87 bytes (100%) received from idx:0/127.0.0.3
              x   [2014-05-15 11:31:47,004] 87/87 bytes (100%) received from idx:0/127.0.0.2
              x   [2014-05-15 11:31:47,053] Streaming session with /127.0.0.3 completed successfully.
              x   [2014-05-15 11:31:47,112] Streaming session with /127.0.0.2 completed successfully.
              x   [2014-05-15 11:31:47,131] Completed sync of range (3074457345618258602,-9223372036854775808]
        1         [2014-05-15 11:31:47,133] Repair command #1 finished
        2         [2014-05-15 11:31:47,180] Starting repair command #2, repairing 2 ranges for system_traces.sessions, system_traces.events (seq=true, full=true)
          x       [2014-05-15 11:31:47,189] Syncing range (-3074457345618258603,3074457345618258602]
          x       [2014-05-15 11:31:47,260] Requesting merkle trees for sessions from [/127.0.0.3, /127.0.0.1]
          x       [2014-05-15 11:31:47,344] /127.0.0.3: Sending completed merkle tree to /127.0.0.1 for system_traces.sessions
          x       [2014-05-15 11:31:47,400] Received merkle tree for sessions from /127.0.0.3
          x       [2014-05-15 11:31:47,577] Received merkle tree for sessions from /127.0.0.1
          x       [2014-05-15 11:31:47,617] Endpoint /127.0.0.3 is consistent with /127.0.0.1 for sessions
          x       [2014-05-15 11:31:47,743] Requesting merkle trees for events from [/127.0.0.3, /127.0.0.1]
          x       [2014-05-15 11:31:48,010] Received merkle tree for events from /127.0.0.3
          x       [2014-05-15 11:31:48,027] /127.0.0.3: Sending completed merkle tree to /127.0.0.1 for system_traces.events
          x       [2014-05-15 11:31:48,245] Received merkle tree for events from /127.0.0.1
            x     [2014-05-15 11:31:48,253] Syncing range (3074457345618258602,-9223372036854775808]
            x     [2014-05-15 11:31:48,294] Requesting merkle trees for sessions from [/127.0.0.2, /127.0.0.1]
          x       [2014-05-15 11:31:48,313] Endpoint /127.0.0.3 is consistent with /127.0.0.1 for events
          x       [2014-05-15 11:31:48,322] Completed sync of range (-3074457345618258603,3074457345618258602]
            x     [2014-05-15 11:31:48,372] /127.0.0.2: Sending completed merkle tree to /127.0.0.1 for system_traces.sessions
            x     [2014-05-15 11:31:48,424] Received merkle tree for sessions from /127.0.0.2
            x     [2014-05-15 11:31:48,582] Received merkle tree for sessions from /127.0.0.1
            x     [2014-05-15 11:31:48,613] Endpoint /127.0.0.2 is consistent with /127.0.0.1 for sessions
            x     [2014-05-15 11:31:48,641] Requesting merkle trees for events from [/127.0.0.2, /127.0.0.1]
            x     [2014-05-15 11:31:48,765] Received merkle tree for events from /127.0.0.2
            x     [2014-05-15 11:31:48,792] /127.0.0.2: Sending completed merkle tree to /127.0.0.1 for system_traces.events
            x     [2014-05-15 11:31:48,942] Received merkle tree for events from /127.0.0.1
            x     [2014-05-15 11:31:48,973] Endpoint /127.0.0.2 is consistent with /127.0.0.1 for events
            x     [2014-05-15 11:31:48,985] Completed sync of range (3074457345618258602,-9223372036854775808]
        2         [2014-05-15 11:31:48,990] Repair command #2 finished
        

        Edit: fixed patch link. Also note: I forgot that I rebased lyubent's fa4089e onto 33939ca, making its hash 608fb03. Obviously, you won't find that hash in any public Cassandra repository. This shouldn't affect patch application, though. Sorry for any confusion this may have caused.

        Show
        Ben Chan added a comment - - edited May 14 formatting changes in 5483-v13-608fb03-May-14-trace-formatting-changes.patch (based off of commit 608fb03). I think the session log messages are still confusing, especially since we use the same term for "repairing a subrange" and "streaming data." Currently the "session" terminology is baked into the source code, in StreamSession.java and RepairSession.java . If the messages are changed to reflect different terminology, hopefully the source code can eventually be changed to match (fewer special cases to remember). Perhaps the best thing is to always qualify them, e.g. "stream session" and "repair session"? I don't actually see the session uuid being used in the logs except at start/finish. Sorry, that was another inadvertent mixing of nodetool messages and trace output. [2014-05-13 23:49:52,283] Repair session cd6aad80-db1a-11e3-b0e7-f94811c7b860 for range (3074457345618258602,-9223372036854775808] finished is not a trace, but a separate (pre-patch) sendNotification in StorageService.java . This message (and some of the error messages, I think) is redundant when combined with trace output. It should have been either one or the other, not both. In the trace proper, the session UUID only shows up at the start. But note: not all nodetool messages are rendered redundant by trace output. Since we can't just suppress all non-trace sendNotification, how can we unambiguously tell nodetool trace output from normal sendNotification messages? I'm currently leaning towards just marking all sendNotification trace output with a TRACE: tag. The repair session UUIDs used to be prepended to everything, but were removed in 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch . Without them, things are less verbose, but it's sometimes hard to unambiguously follow traces for concurrent repair sessions. To make the point clearer, I've marked each sub-task graphically in the nodetool trace output below (I've cross-checked this with the logs, which do retain the UUIDs). If you cover up the left side, it's harder to figure out which trace goes with which sub-task. Real-world repair traces will probably be even more confusing. Note: indentation here does not denote nesting; the column roughly indicates task identity, though I reuse columns when it's not ambiguous. 1 [2014-05-15 11:31:37,839] Starting repair command #1, repairing 3 ranges for s1.users (seq=true, full=true) x [2014-05-15 11:31:37,922] Syncing range (-3074457345618258603,3074457345618258602] x [2014-05-15 11:31:38,108] Requesting merkle trees for users from [/127.0.0.2, /127.0.0.3, /127.0.0.1] x [2014-05-15 11:31:38,833] /127.0.0.2: Sending completed merkle tree to /127.0.0.1 for s1.users x [2014-05-15 11:31:39,953] Received merkle tree for users from /127.0.0.2 x [2014-05-15 11:31:40,939] /127.0.0.3: Sending completed merkle tree to /127.0.0.1 for s1.users x [2014-05-15 11:31:41,279] Received merkle tree for users from /127.0.0.3 x [2014-05-15 11:31:42,632] Received merkle tree for users from /127.0.0.1 x [2014-05-15 11:31:42,671] Syncing range (-9223372036854775808,-3074457345618258603] x [2014-05-15 11:31:42,766] Requesting merkle trees for users from [/127.0.0.2, /127.0.0.3, /127.0.0.1] x [2014-05-15 11:31:42,905] Endpoint /127.0.0.2 is consistent with /127.0.0.3 for users x [2014-05-15 11:31:43,044] Endpoint /127.0.0.2 is consistent with /127.0.0.1 for users x [2014-05-15 11:31:43,047] Endpoint /127.0.0.3 is consistent with /127.0.0.1 for users x [2014-05-15 11:31:43,084] Completed sync of range (-3074457345618258603,3074457345618258602] x [2014-05-15 11:31:43,251] /127.0.0.2: Sending completed merkle tree to /127.0.0.1 for s1.users x [2014-05-15 11:31:43,422] Received merkle tree for users from /127.0.0.2 x [2014-05-15 11:31:44,495] /127.0.0.3: Sending completed merkle tree to /127.0.0.1 for s1.users x [2014-05-15 11:31:44,637] Received merkle tree for users from /127.0.0.3 x [2014-05-15 11:31:45,474] Received merkle tree for users from /127.0.0.1 x [2014-05-15 11:31:45,494] Syncing range (3074457345618258602,-9223372036854775808] x [2014-05-15 11:31:45,499] Endpoint /127.0.0.3 is consistent with /127.0.0.1 for users x [2014-05-15 11:31:45,520] Endpoint /127.0.0.2 is consistent with /127.0.0.1 for users x [2014-05-15 11:31:45,544] Endpoint /127.0.0.2 is consistent with /127.0.0.3 for users x [2014-05-15 11:31:45,564] Completed sync of range (-9223372036854775808,-3074457345618258603] x [2014-05-15 11:31:45,581] Requesting merkle trees for users from [/127.0.0.2, /127.0.0.3, /127.0.0.1] x [2014-05-15 11:31:46,214] Received merkle tree for users from /127.0.0.2 x [2014-05-15 11:31:46,233] /127.0.0.2: Sending completed merkle tree to /127.0.0.1 for s1.users x [2014-05-15 11:31:46,459] Received merkle tree for users from /127.0.0.3 x [2014-05-15 11:31:46,472] /127.0.0.3: Sending completed merkle tree to /127.0.0.1 for s1.users x [2014-05-15 11:31:46,545] Received merkle tree for users from /127.0.0.1 x [2014-05-15 11:31:46,574] Endpoint /127.0.0.3 has 1 range(s) out of sync with /127.0.0.1 for users x [2014-05-15 11:31:46,575] Endpoint /127.0.0.2 is consistent with /127.0.0.3 for users x [2014-05-15 11:31:46,581] Endpoint /127.0.0.2 has 1 range(s) out of sync with /127.0.0.1 for users x [2014-05-15 11:31:46,601] Performing streaming repair of 1 ranges with /127.0.0.2 x [2014-05-15 11:31:46,601] Performing streaming repair of 1 ranges with /127.0.0.3 x [2014-05-15 11:31:46,910] Streaming session with /127.0.0.2 prepared. x [2014-05-15 11:31:46,949] Streaming session with /127.0.0.3 prepared. x [2014-05-15 11:31:46,991] 87/87 bytes (100%) received from idx:0/127.0.0.3 x [2014-05-15 11:31:47,004] 87/87 bytes (100%) received from idx:0/127.0.0.2 x [2014-05-15 11:31:47,053] Streaming session with /127.0.0.3 completed successfully. x [2014-05-15 11:31:47,112] Streaming session with /127.0.0.2 completed successfully. x [2014-05-15 11:31:47,131] Completed sync of range (3074457345618258602,-9223372036854775808] 1 [2014-05-15 11:31:47,133] Repair command #1 finished 2 [2014-05-15 11:31:47,180] Starting repair command #2, repairing 2 ranges for system_traces.sessions, system_traces.events (seq=true, full=true) x [2014-05-15 11:31:47,189] Syncing range (-3074457345618258603,3074457345618258602] x [2014-05-15 11:31:47,260] Requesting merkle trees for sessions from [/127.0.0.3, /127.0.0.1] x [2014-05-15 11:31:47,344] /127.0.0.3: Sending completed merkle tree to /127.0.0.1 for system_traces.sessions x [2014-05-15 11:31:47,400] Received merkle tree for sessions from /127.0.0.3 x [2014-05-15 11:31:47,577] Received merkle tree for sessions from /127.0.0.1 x [2014-05-15 11:31:47,617] Endpoint /127.0.0.3 is consistent with /127.0.0.1 for sessions x [2014-05-15 11:31:47,743] Requesting merkle trees for events from [/127.0.0.3, /127.0.0.1] x [2014-05-15 11:31:48,010] Received merkle tree for events from /127.0.0.3 x [2014-05-15 11:31:48,027] /127.0.0.3: Sending completed merkle tree to /127.0.0.1 for system_traces.events x [2014-05-15 11:31:48,245] Received merkle tree for events from /127.0.0.1 x [2014-05-15 11:31:48,253] Syncing range (3074457345618258602,-9223372036854775808] x [2014-05-15 11:31:48,294] Requesting merkle trees for sessions from [/127.0.0.2, /127.0.0.1] x [2014-05-15 11:31:48,313] Endpoint /127.0.0.3 is consistent with /127.0.0.1 for events x [2014-05-15 11:31:48,322] Completed sync of range (-3074457345618258603,3074457345618258602] x [2014-05-15 11:31:48,372] /127.0.0.2: Sending completed merkle tree to /127.0.0.1 for system_traces.sessions x [2014-05-15 11:31:48,424] Received merkle tree for sessions from /127.0.0.2 x [2014-05-15 11:31:48,582] Received merkle tree for sessions from /127.0.0.1 x [2014-05-15 11:31:48,613] Endpoint /127.0.0.2 is consistent with /127.0.0.1 for sessions x [2014-05-15 11:31:48,641] Requesting merkle trees for events from [/127.0.0.2, /127.0.0.1] x [2014-05-15 11:31:48,765] Received merkle tree for events from /127.0.0.2 x [2014-05-15 11:31:48,792] /127.0.0.2: Sending completed merkle tree to /127.0.0.1 for system_traces.events x [2014-05-15 11:31:48,942] Received merkle tree for events from /127.0.0.1 x [2014-05-15 11:31:48,973] Endpoint /127.0.0.2 is consistent with /127.0.0.1 for events x [2014-05-15 11:31:48,985] Completed sync of range (3074457345618258602,-9223372036854775808] 2 [2014-05-15 11:31:48,990] Repair command #2 finished Edit: fixed patch link. Also note: I forgot that I rebased lyubent's fa4089e onto 33939ca, making its hash 608fb03. Obviously, you won't find that hash in any public Cassandra repository. This shouldn't affect patch application, though. Sorry for any confusion this may have caused.
        Hide
        Jonathan Ellis added a comment -

        I think the session log messages are still confusing, especially since we use the same term for "repairing a subrange" and "streaming data."

        I don't actually see the session uuid being used in the logs except at start/finish. Can we simplify from

        [2014-05-14 17:58:53,768] New session 442daf50-db78-11e3-bad5-c3afa2948904 will sync range (-3074457345618258603,3074457345618258602]
        
        [2014-05-14 17:58:56,309] Session completed successfully
        [2014-05-14 17:58:56,235] Repair session 442daf50-db78-11e3-bad5-c3afa2948904 for range (-3074457345618258603,3074457345618258602] finished
        

        to

        [2014-05-14 17:58:53,768] Syncing range (-3074457345618258603,3074457345618258602]
        ...
        [2014-05-14 17:58:56,235] Completed sync of range (-3074457345618258603,3074457345618258602]
        

        Bikeshedding this just a little more:

        Endpoints /127.0.0.2 and /127.0.0.1 are consistent for events -> Endpoint /127.0.0.2 is consistent with /127.0.0.1 for events

        Show
        Jonathan Ellis added a comment - I think the session log messages are still confusing, especially since we use the same term for "repairing a subrange" and "streaming data." I don't actually see the session uuid being used in the logs except at start/finish. Can we simplify from [2014-05-14 17:58:53,768] New session 442daf50-db78-11e3-bad5-c3afa2948904 will sync range (-3074457345618258603,3074457345618258602] [2014-05-14 17:58:56,309] Session completed successfully [2014-05-14 17:58:56,235] Repair session 442daf50-db78-11e3-bad5-c3afa2948904 for range (-3074457345618258603,3074457345618258602] finished to [2014-05-14 17:58:53,768] Syncing range (-3074457345618258603,3074457345618258602] ... [2014-05-14 17:58:56,235] Completed sync of range (-3074457345618258603,3074457345618258602] Bikeshedding this just a little more: Endpoints /127.0.0.2 and /127.0.0.1 are consistent for events -> Endpoint /127.0.0.2 is consistent with /127.0.0.1 for events
        Hide
        Jonathan Ellis added a comment -

        Pretty sure we always do this comparison pair wise? So there will always only be two ip's in it.

        Ah, right. Let's leave that alone then.

        Show
        Jonathan Ellis added a comment - Pretty sure we always do this comparison pair wise? So there will always only be two ip's in it. Ah, right. Let's leave that alone then.
        Hide
        Lyuben Todorov added a comment -

        Why do we log Repair command #1 finished with no merkle trees requested for db.tweet? Is it because all sstables are already repaired? If so we should log that.

        Ah, I used a Keyspace with RF 1, Here's output from a 3 node cluster with the db keyspace having RF = 3 after node 1 was nuked and bought back. The output has a few changes:

        1. System ks is no longer displayed as its not part of the repair.
        2. Removed keyspace from "Starting repair command #1, repairing 1 ranges for keyspace db.{tweet} (seq=true, full=true)}"
        3. Reformated "Requesting merkle trees for sessions (to [/127.0.0.3, /127.0.0.1])"

        Changes for the above are on 43eaaae..fa4089e

        Show
        Lyuben Todorov added a comment - Why do we log Repair command #1 finished with no merkle trees requested for db.tweet? Is it because all sstables are already repaired? If so we should log that. Ah, I used a Keyspace with RF 1, Here's output from a 3 node cluster with the db keyspace having RF = 3 after node 1 was nuked and bought back. The output has a few changes: System ks is no longer displayed as its not part of the repair. Removed keyspace from "Starting repair command #1, repairing 1 ranges for keyspace db.{tweet} (seq=true, full=true)}" Reformated "Requesting merkle trees for sessions (to [/127.0.0.3, /127.0.0.1])" Changes for the above are on 43eaaae..fa4089e
        Hide
        Jeremiah Jordan added a comment -

        How does Endpoints /127.0.0.2 and /127.0.0.1 are consistent for events scale up to more replicas? Should we switch to using [..] notation instead?

        Pretty sure we always do this comparison pair wise? So there will always only be two ip's in it.

        Show
        Jeremiah Jordan added a comment - How does Endpoints /127.0.0.2 and /127.0.0.1 are consistent for events scale up to more replicas? Should we switch to using [..] notation instead? Pretty sure we always do this comparison pair wise? So there will always only be two ip's in it.
        Hide
        Ben Chan added a comment -

        I seem to be having problems with JIRA email notifications. May 12 arrived fine, May 7 never arrived, May 8 arrived on May 13, but was dated May 10. Moving on...

        Just skip system keyspace entirely and save the logspam (use Keyspace.nonSystem instead of Keyspace.all)

        This patch has grown enough parts to be a little unwieldy. Just to be clear, the output from https://gist.github.com/lyubent/bfc133fe92ef1afb9dd4 is the verbose output from nodetool, which means there is some extra output aside from the traces themselves. (TODO to self, I need to make the nodetool verbose output optional.). That particular message comes from src/java/org/apache/cassandra/tools/NodeProbe.java, in a part of the code untouched by this patch. I can go ahead and nuke that particular message for the system keyspace.

        How does Endpoints /127.0.0.2 and /127.0.0.1 are consistent for events scale up to more replicas? Should we switch to using [..] notation instead?

        n * (n - 1) differences calculated for n replicas, so n * (n - 1) "are consistent" messages. I haven't dug deep enough into the code to be certain, but on the face of it, it seems like there should be some (possibly not-simple) way to reduce this to O(n * log(n)). Enough speculation, though.

        One edge case for the proposed notation would be a consistency partition:

        A == B == C
        A != D
        D == E == F
        
        =>
        
        # We need a separate message for each partition.
        Endpoints [A, B, C] are consistent for events
        Endpoints [D, E, F] are consistent for events
        

        Even with the edge case, it seems messy, but doable. You do lose trace timing information on the calculation of individual differences (the consistent ones, at least). On the other hand, comparing matching merkle trees should be a consistently fast operation, so you're probably not missing out on too much information.

        I'm a little lost in the commands and sessions, e.g. does [2014-05-08 23:27:45,368] Session completed successfully refer to session 3617e3f0-d6ef-11e3-a493-7d438369d7fc or 36a49390-d6ef-11e3-a493-7d438369d7fc? Is there exactly one session per command? If so let's merge the "starting repair command" + "new session" output, and the "completed" + "finished".

        Each repair command seems to consist of multiple repair sessions (one per range). The sessions go semi-sequentially; there's a semi-random overlap between the end of one session and the start of another, like so (using small integers instead of UUIDs, and some labels on the left for readability):

        [command #1           ] Starting repair command #1
        [command #1, session 1] New session 1 will sync range ...
        [command #1, session 1] Requesting merkle tree for ...
        [command #1, session 1] Received merkle tree for ...
        [command #1, session 2] New session 2 will sync range ...
        [command #1, session 2] Requesting merkle tree for ...
        [command #1, session 1] Endpoints ... consistent.
        [command #1, session 1] Session 1 completed successfully
        [command #1, session 2] Received merkle tree for ...
        [command #1, session 2] Endpoints ... consistent.
        [command #1, session 3] New session 3 will sync range ...
        [command #1, session 2] Session 2 completed successfully
        [command #1, session 3] Requesting merkle tree for ...
        [command #1, session 3] Received merkle tree for ...
        [command #1, session 3] Endpoints ... consistent.
        [command #1, session 3] Session 3 completed successfully
        [command #1           ] Repair command #1 finished
        

        Most of the time it's obvious from context, but during that overlap, having the repair session UUID helps to disambiguate. I suspect the overlap is even greater (and more confusing) when you have heavy streaming.

        Why do we log Repair command #1 finished with no merkle trees requested for db.tweet? Is it because all sstables are already repaired? If so we should log that.

        I've never encountered a trace like that in my testing. I always seem to get merkle trees exchanged (see the log below), even if no streaming is needed. I'm hoping lyubent can provide enough information for me to be able to recreate this situation locally.

        Does this actually show any streaming? If so I'm missing it.

        lyubent's sample run didn't need streaming, so no streaming to trace. Here's how I usually test streaming (using ccm-repair-test and yukim's method.):

        ./ccm-repair-test -kR &&
        ccm node1 stop &&
        ccm node1 clear &&
        ccm node1 start &&
        ./ccm-repair-test -rt
        

        Note that this sample run uses the codebase as of 5483-v12-02-cassandra-yaml-ttl-doc.patch; I haven't got around to doing the May 12 changes yet.

        I should also warn you (if recent feedback is any guide) that the trace messages for streaming are completely unpolished. Though I think if I just remove the superlong filenames entirely, it might be close to acceptable.

        [2014-05-13 23:49:42,561] Nothing to repair for keyspace 'system'
        [2014-05-13 23:49:42,624] Starting repair command #1, repairing 3 ranges for keyspace s1.{users} (seq=true, full=true)
        [2014-05-13 23:49:42,730] New session c84ae130-db1a-11e3-b0e7-f94811c7b860 will sync range (-3074457345618258603,3074457345618258602]
        [2014-05-13 23:49:42,973] Requesting merkle trees for users (to [/127.0.0.2, /127.0.0.3, /127.0.0.1])
        [2014-05-13 23:49:44,058] /127.0.0.2: Sending completed merkle tree to /127.0.0.1 for s1/users
        [2014-05-13 23:49:44,616] Received merkle tree for users from /127.0.0.2
        [2014-05-13 23:49:45,565] /127.0.0.3: Sending completed merkle tree to /127.0.0.1 for s1/users
        [2014-05-13 23:49:45,706] Received merkle tree for users from /127.0.0.3
        [2014-05-13 23:49:46,378] Received merkle tree for users from /127.0.0.1
        [2014-05-13 23:49:46,412] New session ca7d5190-db1a-11e3-b0e7-f94811c7b860 will sync range (-9223372036854775808,-3074457345618258603]
        [2014-05-13 23:49:46,483] Requesting merkle trees for users (to [/127.0.0.2, /127.0.0.3, /127.0.0.1])
        [2014-05-13 23:49:46,674] Endpoints /127.0.0.2 and /127.0.0.3 are consistent for users
        [2014-05-13 23:49:46,774] Endpoints /127.0.0.3 and /127.0.0.1 are consistent for users
        [2014-05-13 23:49:46,781] Endpoints /127.0.0.2 and /127.0.0.1 are consistent for users
        [2014-05-13 23:49:46,817] Session completed successfully
        [2014-05-13 23:49:47,046] /127.0.0.2: Sending completed merkle tree to /127.0.0.1 for s1/users
        [2014-05-13 23:49:47,116] Received merkle tree for users from /127.0.0.2
        [2014-05-13 23:49:47,528] /127.0.0.3: Sending completed merkle tree to /127.0.0.1 for s1/users
        [2014-05-13 23:49:47,668] Received merkle tree for users from /127.0.0.3
        [2014-05-13 23:49:48,149] Received merkle tree for users from /127.0.0.1
        [2014-05-13 23:49:48,180] Endpoints /127.0.0.2 and /127.0.0.1 are consistent for users
        [2014-05-13 23:49:48,181] New session cb8ac9f0-db1a-11e3-b0e7-f94811c7b860 will sync range (3074457345618258602,-9223372036854775808]
        [2014-05-13 23:49:48,184] Endpoints /127.0.0.2 and /127.0.0.3 are consistent for users
        [2014-05-13 23:49:48,185] Endpoints /127.0.0.3 and /127.0.0.1 are consistent for users
        [2014-05-13 23:49:48,209] Session completed successfully
        [2014-05-13 23:49:48,258] Requesting merkle trees for users (to [/127.0.0.2, /127.0.0.3, /127.0.0.1])
        [2014-05-13 23:49:48,525] /127.0.0.2: Sending completed merkle tree to /127.0.0.1 for s1/users
        [2014-05-13 23:49:48,531] Received merkle tree for users from /127.0.0.2
        [2014-05-13 23:49:48,610] /127.0.0.3: Sending completed merkle tree to /127.0.0.1 for s1/users
        [2014-05-13 23:49:48,653] Received merkle tree for users from /127.0.0.3
        [2014-05-13 23:49:48,750] Received merkle tree for users from /127.0.0.1
        [2014-05-13 23:49:48,776] Repair session c84ae130-db1a-11e3-b0e7-f94811c7b860 for range (-3074457345618258603,3074457345618258602] finished
        [2014-05-13 23:49:48,777] Repair session ca7d5190-db1a-11e3-b0e7-f94811c7b860 for range (-9223372036854775808,-3074457345618258603] finished
        [2014-05-13 23:49:48,786] Endpoints /127.0.0.2 and /127.0.0.3 are consistent for users
        [2014-05-13 23:49:48,794] Endpoints /127.0.0.2 and /127.0.0.1 have 1 range(s) out of sync for users
        [2014-05-13 23:49:48,798] Endpoints /127.0.0.3 and /127.0.0.1 have 1 range(s) out of sync for users
        [2014-05-13 23:49:48,832] Performing streaming repair of 1 ranges with /127.0.0.3
        [2014-05-13 23:49:48,850] Performing streaming repair of 1 ranges with /127.0.0.2
        [2014-05-13 23:49:49,216] Streaming session with /127.0.0.2 prepared.
        [2014-05-13 23:49:49,221] Streaming session with /127.0.0.3 prepared.
        [2014-05-13 23:49:49,292] /home/usrbincc/.ccm/test-5483-7Fz/node1/data/s1/users-b78c3a60db1a11e384aff94811c7b860/s1-users-tmp-ka-2-Data.db 87/87 bytes(100%) received from idx:0/127.0.0.2
        [2014-05-13 23:49:49,306] /home/usrbincc/.ccm/test-5483-7Fz/node1/data/s1/users-b78c3a60db1a11e384aff94811c7b860/s1-users-tmp-ka-1-Data.db 87/87 bytes(100%) received from idx:0/127.0.0.3
        [2014-05-13 23:49:49,355] Streaming session with /127.0.0.2 completed successfully.
        [2014-05-13 23:49:49,404] Streaming session with /127.0.0.3 completed successfully.
        [2014-05-13 23:49:49,443] Session completed successfully
        [2014-05-13 23:49:49,445] Repair session cb8ac9f0-db1a-11e3-b0e7-f94811c7b860 for range (3074457345618258602,-9223372036854775808] finished
        [2014-05-13 23:49:49,445] Repair command #1 finished
        [2014-05-13 23:49:49,531] Starting repair command #2, repairing 2 ranges for keyspace system_traces.{events, sessions} (seq=true, full=true)
        [2014-05-13 23:49:49,541] New session cc5aeb30-db1a-11e3-b0e7-f94811c7b860 will sync range (-3074457345618258603,3074457345618258602]
        [2014-05-13 23:49:49,650] Requesting merkle trees for events (to [/127.0.0.3, /127.0.0.1])
        [2014-05-13 23:49:49,920] Received merkle tree for events from /127.0.0.3
        [2014-05-13 23:49:49,933] /127.0.0.3: Sending completed merkle tree to /127.0.0.1 for system_traces/events
        [2014-05-13 23:49:50,190] Received merkle tree for events from /127.0.0.1
        [2014-05-13 23:49:50,253] Endpoints /127.0.0.3 and /127.0.0.1 have 1 range(s) out of sync for events
        [2014-05-13 23:49:50,267] Performing streaming repair of 1 ranges with /127.0.0.3
        [2014-05-13 23:49:50,311] Requesting merkle trees for sessions (to [/127.0.0.3, /127.0.0.1])
        [2014-05-13 23:49:50,515] Streaming session with /127.0.0.3 prepared.
        [2014-05-13 23:49:50,575] /home/usrbincc/.ccm/test-5483-7Fz/node1/data/system_traces/events-8826e8e9e16a372887533bc1fc713c25/system_traces-events-tmp-ka-3-Data.db 1054/1054 bytes(100%) received from idx:0/127.0.0.3
        [2014-05-13 23:49:50,599] /home/usrbincc/.ccm/test-5483-7Fz/node1/flush/system_traces/events-8826e8e9e16a372887533bc1fc713c25/system_traces-events-ka-2-Data.db 815/815 bytes(100%) sent to idx:0/127.0.0.3
        [2014-05-13 23:49:50,652] /home/usrbincc/.ccm/test-5483-7Fz/node1/data/system_traces/events-8826e8e9e16a372887533bc1fc713c25/system_traces-events-tmp-ka-4-Data.db 4254/4254 bytes(100%) received from idx:0/127.0.0.3
        [2014-05-13 23:49:50,669] /home/usrbincc/.ccm/test-5483-7Fz/node1/flush/system_traces/events-8826e8e9e16a372887533bc1fc713c25/system_traces-events-ka-1-Data.db 4335/4335 bytes(100%) sent to idx:0/127.0.0.3
        [2014-05-13 23:49:50,776] Received merkle tree for sessions from /127.0.0.3
        [2014-05-13 23:49:50,876] /127.0.0.3: Sending completed merkle tree to /127.0.0.1 for system_traces/sessions
        [2014-05-13 23:49:50,920] Streaming session with /127.0.0.3 completed successfully.
        [2014-05-13 23:49:51,312] Received merkle tree for sessions from /127.0.0.1
        [2014-05-13 23:49:51,321] New session cd6aad80-db1a-11e3-b0e7-f94811c7b860 will sync range (3074457345618258602,-9223372036854775808]
        [2014-05-13 23:49:51,339] Endpoints /127.0.0.3 and /127.0.0.1 are consistent for sessions
        [2014-05-13 23:49:51,425] Session completed successfully
        [2014-05-13 23:49:51,442] Requesting merkle trees for events (to [/127.0.0.2, /127.0.0.1])
        [2014-05-13 23:49:51,613] Received merkle tree for events from /127.0.0.2
        [2014-05-13 23:49:51,627] /127.0.0.2: Sending completed merkle tree to /127.0.0.1 for system_traces/events
        [2014-05-13 23:49:51,734] Received merkle tree for events from /127.0.0.1
        [2014-05-13 23:49:51,779] Requesting merkle trees for sessions (to [/127.0.0.2, /127.0.0.1])
        [2014-05-13 23:49:51,803] Endpoints /127.0.0.2 and /127.0.0.1 are consistent for events
        [2014-05-13 23:49:51,935] /127.0.0.2: Sending completed merkle tree to /127.0.0.1 for system_traces/sessions
        [2014-05-13 23:49:51,972] Received merkle tree for sessions from /127.0.0.2
        [2014-05-13 23:49:52,243] Received merkle tree for sessions from /127.0.0.1
        [2014-05-13 23:49:52,258] Repair session cc5aeb30-db1a-11e3-b0e7-f94811c7b860 for range (-3074457345618258603,3074457345618258602] finished
        [2014-05-13 23:49:52,268] Endpoints /127.0.0.2 and /127.0.0.1 are consistent for sessions
        [2014-05-13 23:49:52,279] Session completed successfully
        [2014-05-13 23:49:52,283] Repair session cd6aad80-db1a-11e3-b0e7-f94811c7b860 for range (3074457345618258602,-9223372036854775808] finished
        [2014-05-13 23:49:52,283] Repair command #2 finished
        
        Show
        Ben Chan added a comment - I seem to be having problems with JIRA email notifications. May 12 arrived fine, May 7 never arrived, May 8 arrived on May 13, but was dated May 10. Moving on... Just skip system keyspace entirely and save the logspam (use Keyspace.nonSystem instead of Keyspace.all) This patch has grown enough parts to be a little unwieldy. Just to be clear, the output from https://gist.github.com/lyubent/bfc133fe92ef1afb9dd4 is the verbose output from nodetool , which means there is some extra output aside from the traces themselves. (TODO to self, I need to make the nodetool verbose output optional.). That particular message comes from src/java/org/apache/cassandra/tools/NodeProbe.java , in a part of the code untouched by this patch. I can go ahead and nuke that particular message for the system keyspace. How does Endpoints /127.0.0.2 and /127.0.0.1 are consistent for events scale up to more replicas? Should we switch to using [..] notation instead? n * (n - 1) differences calculated for n replicas, so n * (n - 1) "are consistent" messages. I haven't dug deep enough into the code to be certain, but on the face of it, it seems like there should be some (possibly not-simple) way to reduce this to O(n * log(n)) . Enough speculation, though. One edge case for the proposed notation would be a consistency partition: A == B == C A != D D == E == F => # We need a separate message for each partition. Endpoints [A, B, C] are consistent for events Endpoints [D, E, F] are consistent for events Even with the edge case, it seems messy, but doable. You do lose trace timing information on the calculation of individual differences (the consistent ones, at least). On the other hand, comparing matching merkle trees should be a consistently fast operation, so you're probably not missing out on too much information. I'm a little lost in the commands and sessions, e.g. does [2014-05-08 23:27:45,368] Session completed successfully refer to session 3617e3f0-d6ef-11e3-a493-7d438369d7fc or 36a49390-d6ef-11e3-a493-7d438369d7fc? Is there exactly one session per command? If so let's merge the "starting repair command" + "new session" output, and the "completed" + "finished". Each repair command seems to consist of multiple repair sessions (one per range). The sessions go semi-sequentially; there's a semi-random overlap between the end of one session and the start of another, like so (using small integers instead of UUIDs, and some labels on the left for readability): [command #1 ] Starting repair command #1 [command #1, session 1] New session 1 will sync range ... [command #1, session 1] Requesting merkle tree for ... [command #1, session 1] Received merkle tree for ... [command #1, session 2] New session 2 will sync range ... [command #1, session 2] Requesting merkle tree for ... [command #1, session 1] Endpoints ... consistent. [command #1, session 1] Session 1 completed successfully [command #1, session 2] Received merkle tree for ... [command #1, session 2] Endpoints ... consistent. [command #1, session 3] New session 3 will sync range ... [command #1, session 2] Session 2 completed successfully [command #1, session 3] Requesting merkle tree for ... [command #1, session 3] Received merkle tree for ... [command #1, session 3] Endpoints ... consistent. [command #1, session 3] Session 3 completed successfully [command #1 ] Repair command #1 finished Most of the time it's obvious from context, but during that overlap, having the repair session UUID helps to disambiguate. I suspect the overlap is even greater (and more confusing) when you have heavy streaming. Why do we log Repair command #1 finished with no merkle trees requested for db.tweet? Is it because all sstables are already repaired? If so we should log that. I've never encountered a trace like that in my testing. I always seem to get merkle trees exchanged (see the log below), even if no streaming is needed. I'm hoping lyubent can provide enough information for me to be able to recreate this situation locally. Does this actually show any streaming? If so I'm missing it. lyubent's sample run didn't need streaming, so no streaming to trace. Here's how I usually test streaming (using ccm-repair-test and yukim's method.): ./ccm-repair-test -kR && ccm node1 stop && ccm node1 clear && ccm node1 start && ./ccm-repair-test -rt Note that this sample run uses the codebase as of 5483-v12-02-cassandra-yaml-ttl-doc.patch ; I haven't got around to doing the May 12 changes yet. I should also warn you (if recent feedback is any guide) that the trace messages for streaming are completely unpolished. Though I think if I just remove the superlong filenames entirely, it might be close to acceptable. [2014-05-13 23:49:42,561] Nothing to repair for keyspace 'system' [2014-05-13 23:49:42,624] Starting repair command #1, repairing 3 ranges for keyspace s1.{users} (seq=true, full=true) [2014-05-13 23:49:42,730] New session c84ae130-db1a-11e3-b0e7-f94811c7b860 will sync range (-3074457345618258603,3074457345618258602] [2014-05-13 23:49:42,973] Requesting merkle trees for users (to [/127.0.0.2, /127.0.0.3, /127.0.0.1]) [2014-05-13 23:49:44,058] /127.0.0.2: Sending completed merkle tree to /127.0.0.1 for s1/users [2014-05-13 23:49:44,616] Received merkle tree for users from /127.0.0.2 [2014-05-13 23:49:45,565] /127.0.0.3: Sending completed merkle tree to /127.0.0.1 for s1/users [2014-05-13 23:49:45,706] Received merkle tree for users from /127.0.0.3 [2014-05-13 23:49:46,378] Received merkle tree for users from /127.0.0.1 [2014-05-13 23:49:46,412] New session ca7d5190-db1a-11e3-b0e7-f94811c7b860 will sync range (-9223372036854775808,-3074457345618258603] [2014-05-13 23:49:46,483] Requesting merkle trees for users (to [/127.0.0.2, /127.0.0.3, /127.0.0.1]) [2014-05-13 23:49:46,674] Endpoints /127.0.0.2 and /127.0.0.3 are consistent for users [2014-05-13 23:49:46,774] Endpoints /127.0.0.3 and /127.0.0.1 are consistent for users [2014-05-13 23:49:46,781] Endpoints /127.0.0.2 and /127.0.0.1 are consistent for users [2014-05-13 23:49:46,817] Session completed successfully [2014-05-13 23:49:47,046] /127.0.0.2: Sending completed merkle tree to /127.0.0.1 for s1/users [2014-05-13 23:49:47,116] Received merkle tree for users from /127.0.0.2 [2014-05-13 23:49:47,528] /127.0.0.3: Sending completed merkle tree to /127.0.0.1 for s1/users [2014-05-13 23:49:47,668] Received merkle tree for users from /127.0.0.3 [2014-05-13 23:49:48,149] Received merkle tree for users from /127.0.0.1 [2014-05-13 23:49:48,180] Endpoints /127.0.0.2 and /127.0.0.1 are consistent for users [2014-05-13 23:49:48,181] New session cb8ac9f0-db1a-11e3-b0e7-f94811c7b860 will sync range (3074457345618258602,-9223372036854775808] [2014-05-13 23:49:48,184] Endpoints /127.0.0.2 and /127.0.0.3 are consistent for users [2014-05-13 23:49:48,185] Endpoints /127.0.0.3 and /127.0.0.1 are consistent for users [2014-05-13 23:49:48,209] Session completed successfully [2014-05-13 23:49:48,258] Requesting merkle trees for users (to [/127.0.0.2, /127.0.0.3, /127.0.0.1]) [2014-05-13 23:49:48,525] /127.0.0.2: Sending completed merkle tree to /127.0.0.1 for s1/users [2014-05-13 23:49:48,531] Received merkle tree for users from /127.0.0.2 [2014-05-13 23:49:48,610] /127.0.0.3: Sending completed merkle tree to /127.0.0.1 for s1/users [2014-05-13 23:49:48,653] Received merkle tree for users from /127.0.0.3 [2014-05-13 23:49:48,750] Received merkle tree for users from /127.0.0.1 [2014-05-13 23:49:48,776] Repair session c84ae130-db1a-11e3-b0e7-f94811c7b860 for range (-3074457345618258603,3074457345618258602] finished [2014-05-13 23:49:48,777] Repair session ca7d5190-db1a-11e3-b0e7-f94811c7b860 for range (-9223372036854775808,-3074457345618258603] finished [2014-05-13 23:49:48,786] Endpoints /127.0.0.2 and /127.0.0.3 are consistent for users [2014-05-13 23:49:48,794] Endpoints /127.0.0.2 and /127.0.0.1 have 1 range(s) out of sync for users [2014-05-13 23:49:48,798] Endpoints /127.0.0.3 and /127.0.0.1 have 1 range(s) out of sync for users [2014-05-13 23:49:48,832] Performing streaming repair of 1 ranges with /127.0.0.3 [2014-05-13 23:49:48,850] Performing streaming repair of 1 ranges with /127.0.0.2 [2014-05-13 23:49:49,216] Streaming session with /127.0.0.2 prepared. [2014-05-13 23:49:49,221] Streaming session with /127.0.0.3 prepared. [2014-05-13 23:49:49,292] /home/usrbincc/.ccm/test-5483-7Fz/node1/data/s1/users-b78c3a60db1a11e384aff94811c7b860/s1-users-tmp-ka-2-Data.db 87/87 bytes(100%) received from idx:0/127.0.0.2 [2014-05-13 23:49:49,306] /home/usrbincc/.ccm/test-5483-7Fz/node1/data/s1/users-b78c3a60db1a11e384aff94811c7b860/s1-users-tmp-ka-1-Data.db 87/87 bytes(100%) received from idx:0/127.0.0.3 [2014-05-13 23:49:49,355] Streaming session with /127.0.0.2 completed successfully. [2014-05-13 23:49:49,404] Streaming session with /127.0.0.3 completed successfully. [2014-05-13 23:49:49,443] Session completed successfully [2014-05-13 23:49:49,445] Repair session cb8ac9f0-db1a-11e3-b0e7-f94811c7b860 for range (3074457345618258602,-9223372036854775808] finished [2014-05-13 23:49:49,445] Repair command #1 finished [2014-05-13 23:49:49,531] Starting repair command #2, repairing 2 ranges for keyspace system_traces.{events, sessions} (seq=true, full=true) [2014-05-13 23:49:49,541] New session cc5aeb30-db1a-11e3-b0e7-f94811c7b860 will sync range (-3074457345618258603,3074457345618258602] [2014-05-13 23:49:49,650] Requesting merkle trees for events (to [/127.0.0.3, /127.0.0.1]) [2014-05-13 23:49:49,920] Received merkle tree for events from /127.0.0.3 [2014-05-13 23:49:49,933] /127.0.0.3: Sending completed merkle tree to /127.0.0.1 for system_traces/events [2014-05-13 23:49:50,190] Received merkle tree for events from /127.0.0.1 [2014-05-13 23:49:50,253] Endpoints /127.0.0.3 and /127.0.0.1 have 1 range(s) out of sync for events [2014-05-13 23:49:50,267] Performing streaming repair of 1 ranges with /127.0.0.3 [2014-05-13 23:49:50,311] Requesting merkle trees for sessions (to [/127.0.0.3, /127.0.0.1]) [2014-05-13 23:49:50,515] Streaming session with /127.0.0.3 prepared. [2014-05-13 23:49:50,575] /home/usrbincc/.ccm/test-5483-7Fz/node1/data/system_traces/events-8826e8e9e16a372887533bc1fc713c25/system_traces-events-tmp-ka-3-Data.db 1054/1054 bytes(100%) received from idx:0/127.0.0.3 [2014-05-13 23:49:50,599] /home/usrbincc/.ccm/test-5483-7Fz/node1/flush/system_traces/events-8826e8e9e16a372887533bc1fc713c25/system_traces-events-ka-2-Data.db 815/815 bytes(100%) sent to idx:0/127.0.0.3 [2014-05-13 23:49:50,652] /home/usrbincc/.ccm/test-5483-7Fz/node1/data/system_traces/events-8826e8e9e16a372887533bc1fc713c25/system_traces-events-tmp-ka-4-Data.db 4254/4254 bytes(100%) received from idx:0/127.0.0.3 [2014-05-13 23:49:50,669] /home/usrbincc/.ccm/test-5483-7Fz/node1/flush/system_traces/events-8826e8e9e16a372887533bc1fc713c25/system_traces-events-ka-1-Data.db 4335/4335 bytes(100%) sent to idx:0/127.0.0.3 [2014-05-13 23:49:50,776] Received merkle tree for sessions from /127.0.0.3 [2014-05-13 23:49:50,876] /127.0.0.3: Sending completed merkle tree to /127.0.0.1 for system_traces/sessions [2014-05-13 23:49:50,920] Streaming session with /127.0.0.3 completed successfully. [2014-05-13 23:49:51,312] Received merkle tree for sessions from /127.0.0.1 [2014-05-13 23:49:51,321] New session cd6aad80-db1a-11e3-b0e7-f94811c7b860 will sync range (3074457345618258602,-9223372036854775808] [2014-05-13 23:49:51,339] Endpoints /127.0.0.3 and /127.0.0.1 are consistent for sessions [2014-05-13 23:49:51,425] Session completed successfully [2014-05-13 23:49:51,442] Requesting merkle trees for events (to [/127.0.0.2, /127.0.0.1]) [2014-05-13 23:49:51,613] Received merkle tree for events from /127.0.0.2 [2014-05-13 23:49:51,627] /127.0.0.2: Sending completed merkle tree to /127.0.0.1 for system_traces/events [2014-05-13 23:49:51,734] Received merkle tree for events from /127.0.0.1 [2014-05-13 23:49:51,779] Requesting merkle trees for sessions (to [/127.0.0.2, /127.0.0.1]) [2014-05-13 23:49:51,803] Endpoints /127.0.0.2 and /127.0.0.1 are consistent for events [2014-05-13 23:49:51,935] /127.0.0.2: Sending completed merkle tree to /127.0.0.1 for system_traces/sessions [2014-05-13 23:49:51,972] Received merkle tree for sessions from /127.0.0.2 [2014-05-13 23:49:52,243] Received merkle tree for sessions from /127.0.0.1 [2014-05-13 23:49:52,258] Repair session cc5aeb30-db1a-11e3-b0e7-f94811c7b860 for range (-3074457345618258603,3074457345618258602] finished [2014-05-13 23:49:52,268] Endpoints /127.0.0.2 and /127.0.0.1 are consistent for sessions [2014-05-13 23:49:52,279] Session completed successfully [2014-05-13 23:49:52,283] Repair session cd6aad80-db1a-11e3-b0e7-f94811c7b860 for range (3074457345618258602,-9223372036854775808] finished [2014-05-13 23:49:52,283] Repair command #2 finished
        Hide
        Jonathan Ellis added a comment -

        Suggestions based on the rest:

        • Just skip system keyspace entirely and save the logspam (use Keyspace.nonSystem instead of Keyspace.all)
        • Be consistent about using ks.cf notation (instead of ks/cf)
        • Word "keyspace" is redundant in repair command log: {{Starting repair command #1, repairing 1 ranges for keyspace db. {tweet}

          (seq=true, full=true)}}

        • Reformat Requesting merkle trees for sessions (to [/127.0.0.3, /127.0.0.1]) to Requesting merkle trees for sessions from [/127.0.0.3, /127.0.0.1]
        • How does Endpoints /127.0.0.2 and /127.0.0.1 are consistent for events scale up to more replicas? Should we switch to using [..] notation instead?
        • I'm a little lost in the commands and sessions, e.g. does [2014-05-08 23:27:45,368] Session completed successfully refer to session 3617e3f0-d6ef-11e3-a493-7d438369d7fc or 36a49390-d6ef-11e3-a493-7d438369d7fc? Is there exactly one session per command? If so let's merge the "starting repair command" + "new session" output, and the "completed" + "finished".
        • Why do we log Repair command #1 finished with no merkle trees requested for db.tweet? Is it because all sstables are already repaired? If so we should log that.
        Show
        Jonathan Ellis added a comment - Suggestions based on the rest: Just skip system keyspace entirely and save the logspam (use Keyspace.nonSystem instead of Keyspace.all) Be consistent about using ks.cf notation (instead of ks/cf) Word "keyspace" is redundant in repair command log: {{Starting repair command #1, repairing 1 ranges for keyspace db. {tweet} (seq=true, full=true)}} Reformat Requesting merkle trees for sessions (to [/127.0.0.3, /127.0.0.1] ) to Requesting merkle trees for sessions from [/127.0.0.3, /127.0.0.1] How does Endpoints /127.0.0.2 and /127.0.0.1 are consistent for events scale up to more replicas? Should we switch to using [..] notation instead? I'm a little lost in the commands and sessions, e.g. does [2014-05-08 23:27:45,368] Session completed successfully refer to session 3617e3f0-d6ef-11e3-a493-7d438369d7fc or 36a49390-d6ef-11e3-a493-7d438369d7fc? Is there exactly one session per command? If so let's merge the "starting repair command" + "new session" output, and the "completed" + "finished". Why do we log Repair command #1 finished with no merkle trees requested for db.tweet? Is it because all sstables are already repaired? If so we should log that.
        Hide
        Jonathan Ellis added a comment -

        Does this actually show any streaming? If so I'm missing it.

        Show
        Jonathan Ellis added a comment - Does this actually show any streaming? If so I'm missing it.
        Hide
        Lyuben Todorov added a comment -

        Jonathan Ellis Output available here with vnode output at the top and non-vnode output at the bottom.

        Show
        Lyuben Todorov added a comment - Jonathan Ellis Output available here with vnode output at the top and non-vnode output at the bottom.
        Hide
        Jonathan Ellis added a comment -

        Can you post an example trace session w/ the new patch?

        Show
        Jonathan Ellis added a comment - Can you post an example trace session w/ the new patch?
        Hide
        Ben Chan added a comment -

        5483-v06-04-Allow-tracing-ttl-to-be-configured.patch implemented (very simple) configurable ttl, but omitted cassandra.yaml documentation. Fixed (with possibly cryptic wording) in 5483-v12-02-cassandra-yaml-ttl-doc.patch.

        # sanity check for working ttl configuration.
        ccm create $(mktemp -u '5483-XXX') &&
        ccm populate -n 1 &&
        ccm updateconf 'tracetype_query_ttl: 12345678' &&
        ccm updateconf 'tracetype_repair_ttl: 87654321' &&
        ccm start &&
        ccm node1 showlog | grep --color=auto '[a-z_]*_ttl\>' &&
        ccm stop
        
        Show
        Ben Chan added a comment - 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch implemented (very simple) configurable ttl, but omitted cassandra.yaml documentation. Fixed (with possibly cryptic wording) in 5483-v12-02-cassandra-yaml-ttl-doc.patch . # sanity check for working ttl configuration. ccm create $(mktemp -u '5483-XXX') && ccm populate -n 1 && ccm updateconf 'tracetype_query_ttl: 12345678' && ccm updateconf 'tracetype_repair_ttl: 87654321' && ccm start && ccm node1 showlog | grep --color=auto '[a-z_]*_ttl\>' && ccm stop
        Hide
        Lyuben Todorov added a comment -

        Im attaching v11-nits with these nits fixed:

        • Removed unused imports in o.a.c.net.MessageOut, o.a.c.repair.RepairSession, o.a.c.repair.StreamingRepairTask, o.a.c.service.SS, o.a.c.tracing.TraceState.
        • Removed "redundant" null initialisation in SS#createRepairTask

        The patch LGTM, the last thing worth thinking about is whether the hardcoded tracing TTLs should be made flexible by adding an option in cassandra.yaml. If not then +1

        Show
        Lyuben Todorov added a comment - Im attaching v11-nits with these nits fixed: Removed unused imports in o.a.c.net.MessageOut, o.a.c.repair.RepairSession, o.a.c.repair.StreamingRepairTask, o.a.c.service.SS, o.a.c.tracing.TraceState. Removed "redundant" null initialisation in SS#createRepairTask The patch LGTM, the last thing worth thinking about is whether the hardcoded tracing TTLs should be made flexible by adding an option in cassandra.yaml. If not then +1
        Hide
        Ben Chan added a comment -

        Patch updated to work with current trunk (08d5f26).

        # git checkout 08d5f26
        W=https://issues.apache.org/jira/secure/attachment
        for url in \
          $W/12643077/5483-v11-01-squashed.patch
        do
          { [ -e $(basename $url) ] || curl -sO $url; } && git apply $(basename $url)
        done &&
        ant clean && ant &&
        ./ccm-repair-test -kR &&
        ccm node1 stop &&
        ccm node1 clear &&
        ccm node1 start &&
        ./ccm-repair-test -rt
        
        Show
        Ben Chan added a comment - Patch updated to work with current trunk (08d5f26). # git checkout 08d5f26 W=https://issues.apache.org/jira/secure/attachment for url in \ $W/12643077/5483-v11-01-squashed.patch do { [ -e $(basename $url) ] || curl -sO $url; } && git apply $(basename $url) done && ant clean && ant && ./ccm-repair-test -kR && ccm node1 stop && ccm node1 clear && ccm node1 start && ./ccm-repair-test -rt
        Hide
        Ben Chan added a comment - - edited

        I made some additional changes. Everything is included in 5483-v10-rebased-and-squashed-471f5cc.patch, but I attached 5483-v10-17-minor-bugfixes-and-changes.patch to make it more more convenient to review.

        Repair fails without 7000-2.1-v2.txt so I've included that patch in the test code.

        Overview:

        • Limit exponential backoff.
        • Handle the case where traceType is negative.
        • Reimplement log2 in BitUtil style.
        • Forgot to add a trace parameter.
        # rebased against trunk @ 471f5cc, tested against cbb3c8f
        # git checkout cbb3c8f
        W=https://issues.apache.org/jira/secure/attachment
        for url in \
          $W/12639821/5483-v10-rebased-and-squashed-471f5cc.patch
        do
          { [ -e $(basename $url) ] || curl -sO $url; } && git apply $(basename $url)
        done &&
        ant clean && ant &&
        ./ccm-repair-test -kR &&
        ccm node1 stop &&
        ccm node1 clear &&
        ccm node1 start &&
        ./ccm-repair-test -rt
        

        Edit: 7000 landed.

        Show
        Ben Chan added a comment - - edited I made some additional changes. Everything is included in 5483-v10-rebased-and-squashed-471f5cc.patch , but I attached 5483-v10-17-minor-bugfixes-and-changes.patch to make it more more convenient to review. Repair fails without 7000-2.1-v2.txt so I've included that patch in the test code. Overview: Limit exponential backoff. Handle the case where traceType is negative. Reimplement log2 in BitUtil style. Forgot to add a trace parameter. # rebased against trunk @ 471f5cc, tested against cbb3c8f # git checkout cbb3c8f W=https://issues.apache.org/jira/secure/attachment for url in \ $W/12639821/5483-v10-rebased-and-squashed-471f5cc.patch do { [ -e $(basename $url) ] || curl -sO $url; } && git apply $(basename $url) done && ant clean && ant && ./ccm-repair-test -kR && ccm node1 stop && ccm node1 clear && ccm node1 start && ./ccm-repair-test -rt Edit: 7000 landed.
        Hide
        Lyuben Todorov added a comment -

        Ben Chan Can you post the squashed rebase? Would really appreciate it!

        Show
        Lyuben Todorov added a comment - Ben Chan Can you post the squashed rebase? Would really appreciate it!
        Hide
        Ben Chan added a comment -

        Rebased, but can't test properly, because the trunk as of 64bc45849fd2 (my git rebase base) is throwing an error:

        java.lang.ExceptionInInitializerError: null
                at org.apache.cassandra.config.KSMetaData.systemKeyspace(KSMetaData.java:92) ~[main/:na]
                at org.apache.cassandra.config.DatabaseDescriptor.applyConfig(DatabaseDescriptor.java:545) ~[main/:na]
                at org.apache.cassandra.config.DatabaseDescriptor.<clinit>(DatabaseDescriptor.java:128) ~[main/:na]
                at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:109) [main/:na]
                at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:454) [main/:na]
                at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:543) [main/:na]
        Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.SyntaxException: line 1:127 mismatched character 'T' expecting set null
                at org.apache.cassandra.config.CFMetaData.compile(CFMetaData.java:518) ~[main/:na]
                at org.apache.cassandra.config.CFMetaData.compile(CFMetaData.java:501) ~[main/:na]
                at org.apache.cassandra.config.CFMetaData.<clinit>(CFMetaData.java:99) ~[main/:na]
                ... 6 common frames omitted
        Caused by: org.apache.cassandra.exceptions.SyntaxException: line 1:127 mismatched character 'T' expecting set null
                at org.apache.cassandra.cql3.CqlLexer.throwLastRecognitionError(CqlLexer.java:201) ~[main/:na]
                at org.apache.cassandra.cql3.QueryProcessor.parseStatement(QueryProcessor.java:348) ~[main/:na]
                at org.apache.cassandra.config.CFMetaData.compile(CFMetaData.java:509) ~[main/:na]
                ... 8 common frames omitted
        

        Things are currently hectic, so I don't know when I'll have the chance to have a proper look at it. I'm currently doing a git bisect run so maybe something obvious will turn up.

        Show
        Ben Chan added a comment - Rebased, but can't test properly, because the trunk as of 64bc45849fd2 (my git rebase base) is throwing an error: java.lang.ExceptionInInitializerError: null at org.apache.cassandra.config.KSMetaData.systemKeyspace(KSMetaData.java:92) ~[main/:na] at org.apache.cassandra.config.DatabaseDescriptor.applyConfig(DatabaseDescriptor.java:545) ~[main/:na] at org.apache.cassandra.config.DatabaseDescriptor.<clinit>(DatabaseDescriptor.java:128) ~[main/:na] at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:109) [main/:na] at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:454) [main/:na] at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:543) [main/:na] Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.SyntaxException: line 1:127 mismatched character 'T' expecting set null at org.apache.cassandra.config.CFMetaData.compile(CFMetaData.java:518) ~[main/:na] at org.apache.cassandra.config.CFMetaData.compile(CFMetaData.java:501) ~[main/:na] at org.apache.cassandra.config.CFMetaData.<clinit>(CFMetaData.java:99) ~[main/:na] ... 6 common frames omitted Caused by: org.apache.cassandra.exceptions.SyntaxException: line 1:127 mismatched character 'T' expecting set null at org.apache.cassandra.cql3.CqlLexer.throwLastRecognitionError(CqlLexer.java:201) ~[main/:na] at org.apache.cassandra.cql3.QueryProcessor.parseStatement(QueryProcessor.java:348) ~[main/:na] at org.apache.cassandra.config.CFMetaData.compile(CFMetaData.java:509) ~[main/:na] ... 8 common frames omitted Things are currently hectic, so I don't know when I'll have the chance to have a proper look at it. I'm currently doing a git bisect run so maybe something obvious will turn up.
        Hide
        Jonathan Ellis added a comment - - edited

        Can you both rebase and squash Ben Chan?

        Show
        Jonathan Ellis added a comment - - edited Can you both rebase and squash Ben Chan ?
        Hide
        Lyuben Todorov added a comment -

        v9 has indeed fixed the high cpu usage. I pushed v8 and v9 to d9abbfc62319935ef42d4685037f80b292e03e0f with only the minor change of giving a name to the thread that carries out polling for nodetool output. The thread name is currently RepairTracePolling but I'm open to suggestions.

        Ben Chan I think we should rebase to the latest trunk but the patch is looking good.

        Show
        Lyuben Todorov added a comment - v9 has indeed fixed the high cpu usage. I pushed v8 and v9 to d9abbfc62319935ef42d4685037f80b292e03e0f with only the minor change of giving a name to the thread that carries out polling for nodetool output. The thread name is currently RepairTracePolling but I'm open to suggestions. Ben Chan I think we should rebase to the latest trunk but the patch is looking good.
        Hide
        Ben Chan added a comment - - edited

        Ouch. After running a before and after test, I'm 99% sure this was the problem. There was some obviously wrong code in waitActivity (an older version used 0 instead of -1 to signify "done"; I apparently forgot to update everything when I changed this).

        Sorry about removing the previous patch. It didn't have the correct git diff -p parameters.

        For convenience:

        W=https://issues.apache.org/jira/secure/attachment
        for url in \
          $W/12637720/5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch
        do [ -e $(basename $url) ] || curl -sO $url; done &&
        git apply 5483-v09-*.patch &&
        ant clean && ant
        

        Here's what I used to test with; I get slower and slower repairs, with a hang on the 5th repair with the "before" code, and consistent 10-second repairs with the "after" code.

        cat > ccm-nodetool <<"EE"
        #!/bin/sh
        
        # ccm doesn't let us call nodetool with options, but we still need to get the
        # host and port config from it.
        read -r JMXGET <<E
        /jmx_port/{p=\$2;} \
        /binary/{split(\$2,a,/\047/);h=a[2];} \
        END{printf("bin/nodetool -h %s -p %s\n",h,p);}
        E
        
        NODETOOL=$(ccm $1 show | awk -F= "$JMXGET")
        shift
        $NODETOOL "$@"
        EE
        
        chmod +x ccm-nodetool
        
        for x in $(seq 3); do 
          for y in $(seq 2); do
            echo repair node$x \#$y
            ./ccm-nodetool node$x repair -tr
          done
        done
        

        edit: minor awk code cleanup, properly nest heredocs.

        Show
        Ben Chan added a comment - - edited Ouch. After running a before and after test, I'm 99% sure this was the problem. There was some obviously wrong code in waitActivity (an older version used 0 instead of -1 to signify "done"; I apparently forgot to update everything when I changed this). Sorry about removing the previous patch. It didn't have the correct git diff -p parameters. For convenience: W=https://issues.apache.org/jira/secure/attachment for url in \ $W/12637720/5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch do [ -e $(basename $url) ] || curl -sO $url; done && git apply 5483-v09-*.patch && ant clean && ant Here's what I used to test with; I get slower and slower repairs, with a hang on the 5th repair with the "before" code, and consistent 10-second repairs with the "after" code. cat > ccm-nodetool <<"EE" #!/bin/sh # ccm doesn't let us call nodetool with options, but we still need to get the # host and port config from it. read -r JMXGET <<E /jmx_port/{p=\$2;} \ /binary/{split(\$2,a,/\047/);h=a[2];} \ END{printf("bin/nodetool -h %s -p %s\n",h,p);} E NODETOOL=$(ccm $1 show | awk -F= "$JMXGET") shift $NODETOOL "$@" EE chmod +x ccm-nodetool for x in $(seq 3); do for y in $(seq 2); do echo repair node$x \#$y ./ccm-nodetool node$x repair -tr done done edit: minor awk code cleanup, properly nest heredocs.
        Hide
        Lyuben Todorov added a comment -

        The new set of patches (v8.11 to v8.15) cause high cpu usage which doesn't dissipate. This happens on the node on which the repair was issued. To reproduce just apply the patches to the branch and carry out repair using -tr on all 3 nodes. Sometimes this happens on the 1st repair sometimes it takes repairing all 3 nodes multiple times (most it's ever taken me is 6x repairs, 2 on each node in a 3 node ccm cluster).

        Looking at the threading, there are a lot of ReadStage threads running after this happens (approx 32 that are waiting and 1 that is running) so as far as I can tell the polling doesn't stop even though the repair completed and there is also one TracingStage that is waiting to complete. I'm attaching prerepair-vs-postrepair.diff that shows the extra threads once this problem occurs.

        Show
        Lyuben Todorov added a comment - The new set of patches (v8.11 to v8.15) cause high cpu usage which doesn't dissipate. This happens on the node on which the repair was issued. To reproduce just apply the patches to the branch and carry out repair using -tr on all 3 nodes. Sometimes this happens on the 1st repair sometimes it takes repairing all 3 nodes multiple times (most it's ever taken me is 6x repairs, 2 on each node in a 3 node ccm cluster). Looking at the threading, there are a lot of ReadStage threads running after this happens (approx 32 that are waiting and 1 that is running) so as far as I can tell the polling doesn't stop even though the repair completed and there is also one TracingStage that is waiting to complete. I'm attaching prerepair-vs-postrepair.diff that shows the extra threads once this problem occurs.
        Hide
        Ben Chan added a comment -
        # tested with branch 5483 @ bce0c2c555a3; should also work following successful
        #git apply 5483-full-trunk.txt
        W=https://issues.apache.org/jira/secure/attachment
        for url in \
          $W/12635094/5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch          \
          $W/12635095/5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch \
          $W/12635096/5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch  \
          $W/12635097/5483-v08-14-Poll-system_traces.events.patch                          \
          $W/12635098/5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch
        do [ -e $(basename $url) ] || curl -sO $url; done &&
        git apply 5483-v08-*.patch &&
        ant clean && ant
        
        ./ccm-repair-test -kR &&
        ccm node1 stop &&
        ccm node1 clear &&
        ccm node1 start &&
        ./ccm-repair-test -rt
        
        • v08-11 There was an error in one of the log formats in Differencer, which made my grep for "out of sync" in the logs fruitless.
        • v08-12 I ended up using the handleStreamEvent of StreamingRepairTask instead of implementing and registering my own StreamEventHandler. The new trace messages may need adjusting, especially for ProgressEvent, which is essentially just a toString currently.
        • v08-13 This works by adding a guarded sendNotification to TraceState#trace.
        • v08-14 This works by starting a thread to poll system_traces.events, and by adding notify functionality to TraceState. There is some jitter in the ordering between local and remote traces. An easy fix would be to have the query thread handle all sendNotification of traces. You have to accept latency in sendNotification of local traces in order to get better ordering. It might be necessary to delay all trace sendNotification by a few seconds to make it more likely that remote traces have arrived.
        • v08-15 Even more added TraceState functionality. All to try to reduce the amount of polling without hurting latency too much. There are only a few local traces that you would expect to be followed by a remote trace, so only wake up for those. Poll with an exponential backoff after each notification.

        Heuristics are messy, and I expect plenty of opinions on v08-14 and v08-15. I'm not especially proud of that code, but I can't think of anything better at the moment, given the (self-imposed?) constraints.

        I may have reinvented the wheel with synchronization primitives. I checked java.util.concurrent.* and SimpleCondition, but not much beyond that. I could have missed something; I don't fully understand some of the classes. What I wanted was to be woken up (with a timeout) if anything has changed since the last time I checked. Theoretically, it should work for multiple consumers (As long as no one waits for longer than Integer.MAX_VALUE updates), though that's not really necessary here, if that would simplify the code.

        The code seems to work reasonably well for small-scale tests. I can convince myself that it won't blow up for long repairs, but haven't done a full test yet.

        Show
        Ben Chan added a comment - # tested with branch 5483 @ bce0c2c555a3; should also work following successful #git apply 5483-full-trunk.txt W=https://issues.apache.org/jira/secure/attachment for url in \ $W/12635094/5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch \ $W/12635095/5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch \ $W/12635096/5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch \ $W/12635097/5483-v08-14-Poll-system_traces.events.patch \ $W/12635098/5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch do [ -e $(basename $url) ] || curl -sO $url; done && git apply 5483-v08-*.patch && ant clean && ant ./ccm-repair-test -kR && ccm node1 stop && ccm node1 clear && ccm node1 start && ./ccm-repair-test -rt v08-11 There was an error in one of the log formats in Differencer, which made my grep for "out of sync" in the logs fruitless. v08-12 I ended up using the handleStreamEvent of StreamingRepairTask instead of implementing and registering my own StreamEventHandler. The new trace messages may need adjusting, especially for ProgressEvent, which is essentially just a toString currently. v08-13 This works by adding a guarded sendNotification to TraceState#trace. v08-14 This works by starting a thread to poll system_traces.events , and by adding notify functionality to TraceState. There is some jitter in the ordering between local and remote traces. An easy fix would be to have the query thread handle all sendNotification of traces. You have to accept latency in sendNotification of local traces in order to get better ordering. It might be necessary to delay all trace sendNotification by a few seconds to make it more likely that remote traces have arrived. v08-15 Even more added TraceState functionality. All to try to reduce the amount of polling without hurting latency too much. There are only a few local traces that you would expect to be followed by a remote trace, so only wake up for those. Poll with an exponential backoff after each notification. — Heuristics are messy, and I expect plenty of opinions on v08-14 and v08-15 . I'm not especially proud of that code, but I can't think of anything better at the moment, given the (self-imposed?) constraints. I may have reinvented the wheel with synchronization primitives. I checked java.util.concurrent.* and SimpleCondition , but not much beyond that. I could have missed something; I don't fully understand some of the classes. What I wanted was to be woken up (with a timeout) if anything has changed since the last time I checked. Theoretically, it should work for multiple consumers (As long as no one waits for longer than Integer.MAX_VALUE updates), though that's not really necessary here, if that would simplify the code. The code seems to work reasonably well for small-scale tests. I can convince myself that it won't blow up for long repairs, but haven't done a full test yet.
        Hide
        Ben Chan added a comment -

        About the system_traces feedback loop during repair. I don't have a way to consistently reproduce it. Normally these test repairs take no longer than 1-2 minutes, but occasionally, it gets into the 10-20 minute range, with lots of merkle trees for system_traces getting sent (system_traces.events at that point is in the tens of thousands of rows.), at which point I usually just stop the repair, and start fresh with a new cluster. I'll pay closer attention next time to see if I can get a good repro case.

        Show
        Ben Chan added a comment - About the system_traces feedback loop during repair. I don't have a way to consistently reproduce it. Normally these test repairs take no longer than 1-2 minutes, but occasionally, it gets into the 10-20 minute range, with lots of merkle trees for system_traces getting sent ( system_traces.events at that point is in the tens of thousands of rows.), at which point I usually just stop the repair, and start fresh with a new cluster. I'll pay closer attention next time to see if I can get a good repro case.
        Hide
        Jonathan Ellis added a comment -

        Maybe exclude system_traces from repair if a repair trace is going on.

        I agree that something fishy is going on, but I'd like to understand the cause before adding a workaround. Can you shed any light Yuki Morishita?

        Maybe add a placeholder row with a null duration for ongoing repair sessions

        This is what query tracing does (Tracing.begin, then stopSession writes the duration).

        Verbose option; dump all traces to nodetool.

        I'd be okay pushing this to another ticket, FWIW.

        Simplify trace messages

        The main principle you want to be careful to observe is to 1) log everything the user needs to know but 2) without logging extra noise. Redundant information is noise. So logging the repairsession once in the "New session" line is adequate, you don't need to repeat it with every message afterwards. Similarly, "Requesting merkle tree" line shows what peers are involved, we don't need to repeat that elsewhere either.

        Show
        Jonathan Ellis added a comment - Maybe exclude system_traces from repair if a repair trace is going on. I agree that something fishy is going on, but I'd like to understand the cause before adding a workaround. Can you shed any light Yuki Morishita ? Maybe add a placeholder row with a null duration for ongoing repair sessions This is what query tracing does (Tracing.begin, then stopSession writes the duration). Verbose option; dump all traces to nodetool. I'd be okay pushing this to another ticket, FWIW. Simplify trace messages The main principle you want to be careful to observe is to 1) log everything the user needs to know but 2) without logging extra noise. Redundant information is noise. So logging the repairsession once in the "New session" line is adequate, you don't need to repeat it with every message afterwards. Similarly, "Requesting merkle tree" line shows what peers are involved, we don't need to repeat that elsewhere either.
        Hide
        Yuki Morishita added a comment - - edited

        Is there a simple way to create a situation where a repair requires streaming?

        The easiest way is to nuke one node after writing data to all nodes.
        So, for example, you create 3 nodes cluster, create keyspace with RF=3, fill data in, clear one node(ccm node1 clear), bring it up again, then repair.

        Somehow the streaming repair is getting done somewhere other than Differencer#run.

        Differencer creates StreamingRepairTask at the end which creates StreamPlan and executes.
        In order to log streaming, you can add StreamEventHandler to catch streaming events.

        For mode detail see http://wiki.apache.org/cassandra/Streaming2.
        For example, BulkLoader/SSTableLoader uses StreamEvent to track progress.

        Show
        Yuki Morishita added a comment - - edited Is there a simple way to create a situation where a repair requires streaming? The easiest way is to nuke one node after writing data to all nodes. So, for example, you create 3 nodes cluster, create keyspace with RF=3, fill data in, clear one node(ccm node1 clear), bring it up again, then repair. Somehow the streaming repair is getting done somewhere other than Differencer#run. Differencer creates StreamingRepairTask at the end which creates StreamPlan and executes. In order to log streaming, you can add StreamEventHandler to catch streaming events. For mode detail see http://wiki.apache.org/cassandra/Streaming2 . For example, BulkLoader/SSTableLoader uses StreamEvent to track progress.
        Hide
        Ben Chan added a comment -

        Public TODO list; please comment if any should be not-TODO:

        • Trace streaming and/or lack thereof (I think hooking Differencer#run and related threads should be enough).
        • Maybe exclude system_traces from repair if a repair trace is going on. There seems to be a feedback loop triggering multiple repair commands otherwise.
        • Maybe add a placeholder row with a null duration for ongoing repair sessions. Makes it easier to find the session_id for queries. Update with the final duration at the end.
        • Populate started_at, request, etc in system_traces.sessions.
        • Send the session_id back to nodetool.
        • Shorten/simplify trace messages.
        • Verbose option; dump all traces to nodetool.

        Implementation thoughts follow; please warn of potential problems.

        Verbose option:

        To send local traces back to nodetool, adding a parallel sendNotification is easy enough. Getting the remote traces seems like it would involve monitoring updates to system_traces.events.

        At first I thought triggers, but the docs say that triggers run on the coordinator node, which is not necessarily the node you're repairing. So that leaves polling the table with heuristics that are hopefully good enough to reduce the amount of extra work.

        Simplify trace messages:

        Skipping to the point of difference:

        It looks like each sub-RepairSession has a unique session id (a timeuuid but different from either session_id or event_id). Here is a section of the select above aligned and simplified to increase SNR. The redacted parts are identical.

        [repair #fedc3790-...] Received merkle tree for events from /127.0.0.1
        [repair #fef40550-...] new session: will sync /127.0.0.1, /127.0.0.2 on range (3074457345618258602,-9223372036854775808] for system_traces.[sessions, events]
        [repair #fef40550-...] requesting merkle trees for sessions (to [/127.0.0.2, /127.0.0.1])
        [repair #fedc3790-...] session completed successfully
        [repair #fef40550-...] Sending completed merkle tree to /127.0.0.1 for system_traces/sessions
        

        In the example above, you can see some overlap in the repair session traces, so the sub-session_id (so to speak) has some use in distinguishing these. Since this sub-session_id only has to be unique for a particular repair session, maybe it would be worth it to map each one to a small integer?

        For convenience, I attached a small, not-very-pretty patch that left-justifies columns of type text in cqlsh (makes it easier to read the traces).

        Trace streaming:

        Is there a simple way to create a situation where a repair requires streaming? Here is what I'm currently doing, but it doesn't work.

        #/bin/sh
        ccm create $(mktemp -u 5483-XXX) &&
        ccm populate -n 3 &&
        ccm updateconf --no-hinted-handoff &&
        ccm start &&
        ccm node1 cqlsh <<"E"
        CREATE SCHEMA s1
        WITH replication = { 'class' : 'SimpleStrategy', 'replication_factor' : 2 };
        
        CREATE TABLE s1.users (
          user_id varchar PRIMARY KEY,
          first varchar,
          last varchar,
          age int)
        WITH read_repair_chance = 0.0;
        
        INSERT INTO s1.users (user_id, first, last, age)
          VALUES ('jsmith', 'John', 'Smith', 42);
        E
        
        ccm node1 stop &&
        python - <<"E" | ccm node2 cqlsh
        import random as r
        fs=["John","Art","Skip","Doug","Koala"]
        ls=["Jackson","Jacobs","Jefferson","Smythe"]
        for (f, l) in [(f,l) for f in fs for l in ls]:
          print (
            "insert into s1.users (user_id, age, first, last) "
            "values('%s', %d, '%s', '%s');"
          ) % ((f[0]+l).lower(), r.randint(10,100), f, l)
        E
        ccm node2 cqlsh <<"E"
        select count(*) from s1.users;
        E
        ccm node1 start
        ccm node1 cqlsh <<"E"
        select count(*) from s1.users;
        E
        nodetool -p $(ccm node1 show | awk -F= '/jmx_port/{print $2}') repair -tr s1
        

        The problem is that despite disabling hinted handoff and setting read_repair_chance to 0, the endpoints are still reported as consistent in Differencer#run. Yet node1 is clearly missing some rows prior to the repair, and has them at the end. Somehow the streaming repair is getting done somewhere other than Differencer#run. Is some sort of handoff still being done somewhere? I'm sure there is something simple, but I'm missing it.

        Show
        Ben Chan added a comment - Public TODO list; please comment if any should be not-TODO: Trace streaming and/or lack thereof (I think hooking Differencer#run and related threads should be enough). Maybe exclude system_traces from repair if a repair trace is going on. There seems to be a feedback loop triggering multiple repair commands otherwise. Maybe add a placeholder row with a null duration for ongoing repair sessions. Makes it easier to find the session_id for queries. Update with the final duration at the end. Populate started_at , request , etc in system_traces.sessions . Send the session_id back to nodetool. Shorten/simplify trace messages. Verbose option; dump all traces to nodetool. Implementation thoughts follow; please warn of potential problems. — Verbose option: To send local traces back to nodetool, adding a parallel sendNotification is easy enough. Getting the remote traces seems like it would involve monitoring updates to system_traces.events . At first I thought triggers, but the docs say that triggers run on the coordinator node, which is not necessarily the node you're repairing. So that leaves polling the table with heuristics that are hopefully good enough to reduce the amount of extra work. — Simplify trace messages: Skipping to the point of difference: It looks like each sub-RepairSession has a unique session id (a timeuuid but different from either session_id or event_id ). Here is a section of the select above aligned and simplified to increase SNR. The redacted parts are identical. [repair #fedc3790-...] Received merkle tree for events from /127.0.0.1 [repair #fef40550-...] new session: will sync /127.0.0.1, /127.0.0.2 on range (3074457345618258602,-9223372036854775808] for system_traces.[sessions, events] [repair #fef40550-...] requesting merkle trees for sessions (to [/127.0.0.2, /127.0.0.1]) [repair #fedc3790-...] session completed successfully [repair #fef40550-...] Sending completed merkle tree to /127.0.0.1 for system_traces/sessions In the example above, you can see some overlap in the repair session traces, so the sub-session_id (so to speak) has some use in distinguishing these. Since this sub-session_id only has to be unique for a particular repair session, maybe it would be worth it to map each one to a small integer? For convenience, I attached a small, not-very-pretty patch that left-justifies columns of type text in cqlsh (makes it easier to read the traces). — Trace streaming: Is there a simple way to create a situation where a repair requires streaming? Here is what I'm currently doing, but it doesn't work. #/bin/sh ccm create $(mktemp -u 5483-XXX) && ccm populate -n 3 && ccm updateconf --no-hinted-handoff && ccm start && ccm node1 cqlsh <<"E" CREATE SCHEMA s1 WITH replication = { 'class' : 'SimpleStrategy', 'replication_factor' : 2 }; CREATE TABLE s1.users ( user_id varchar PRIMARY KEY, first varchar, last varchar, age int) WITH read_repair_chance = 0.0; INSERT INTO s1.users (user_id, first, last, age) VALUES ('jsmith', 'John', 'Smith', 42); E ccm node1 stop && python - <<"E" | ccm node2 cqlsh import random as r fs=["John","Art","Skip","Doug","Koala"] ls=["Jackson","Jacobs","Jefferson","Smythe"] for (f, l) in [(f,l) for f in fs for l in ls]: print ( "insert into s1.users (user_id, age, first, last) " "values('%s', %d, '%s', '%s');" ) % ((f[0]+l).lower(), r.randint(10,100), f, l) E ccm node2 cqlsh <<"E" select count(*) from s1.users; E ccm node1 start ccm node1 cqlsh <<"E" select count(*) from s1.users; E nodetool -p $(ccm node1 show | awk -F= '/jmx_port/{print $2}') repair -tr s1 The problem is that despite disabling hinted handoff and setting read_repair_chance to 0, the endpoints are still reported as consistent in Differencer#run . Yet node1 is clearly missing some rows prior to the repair, and has them at the end. Somehow the streaming repair is getting done somewhere other than Differencer#run . Is some sort of handoff still being done somewhere? I'm sure there is something simple, but I'm missing it.
        Hide
        Jonathan Ellis added a comment - - edited

        Thanks. I'm going to bikeshed the messages a bit:

        • Starting repair command #3, repairing 2 ranges for keyspace system_traces (seq=true, full=true) -> Repairing 2 ranges for system_traces.{sessions, events} (seq=true, full=true)
        • repair #fedc3790-a9fa-11e3-ab61-2387bbb38302 new session: will sync /127.0.0.1, /127.0.0.3 on range (-3074457345618258603,3074457345618258602] for system_traces.[sessions, events] -> New session fedc3790-a9fa-11e3-ab61-2387bbb38302 will sync range (-3074457345618258603,3074457345618258602]
        • All the rest: omit the [repair] session id since that's already traced, and capitalize the first word
        • Add a message if there's nothing to stream

        What does it log when there is something to stream?

        Show
        Jonathan Ellis added a comment - - edited Thanks. I'm going to bikeshed the messages a bit: Starting repair command #3, repairing 2 ranges for keyspace system_traces (seq=true, full=true) -> Repairing 2 ranges for system_traces.{sessions, events} (seq=true, full=true) repair #fedc3790-a9fa-11e3-ab61-2387bbb38302 new session: will sync /127.0.0.1, /127.0.0.3 on range (-3074457345618258603,3074457345618258602] for system_traces. [sessions, events] -> New session fedc3790-a9fa-11e3-ab61-2387bbb38302 will sync range (-3074457345618258603,3074457345618258602] All the rest: omit the [repair] session id since that's already traced, and capitalize the first word Add a message if there's nothing to stream What does it log when there is something to stream?
        Hide
        Lyuben Todorov added a comment -

        It's a good idea to add a verbose option to show tracing as it happens. Currently to enable logging to the system_traces.events table the -tr flag has to be supplied ./nodetool repair -tr, and then to view the traces SELECT * FROM system_traces.events

        cqlsh> SELECT * FROM system_traces.events ;
        
         session_id                           | event_id                             | activity                                                                                                                                                                              | command | source    | source_elapsed | thread
        --------------------------------------+--------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+-----------+----------------+-----------------------
         fedbe970-a9fa-11e3-ab61-2387bbb38302 | fedbe971-a9fa-11e3-ab61-2387bbb38302 |                                                                                       Starting repair command #3, repairing 2 ranges for keyspace system_traces (seq=true, full=true) |  REPAIR | 127.0.0.1 |             11 |             Thread-16
         fedbe970-a9fa-11e3-ab61-2387bbb38302 | fedc5ea0-a9fa-11e3-ab61-2387bbb38302 | [repair #fedc3790-a9fa-11e3-ab61-2387bbb38302] new session: will sync /127.0.0.1, /127.0.0.3 on range (-3074457345618258603,3074457345618258602] for system_traces.[sessions, events] |  REPAIR | 127.0.0.1 |           2493 | AntiEntropySessions:7
         fedbe970-a9fa-11e3-ab61-2387bbb38302 | fedef6b0-a9fa-11e3-ab61-2387bbb38302 |                                                                     [repair #fedc3790-a9fa-11e3-ab61-2387bbb38302] requesting merkle trees for sessions (to [/127.0.0.3, /127.0.0.1]) |  REPAIR | 127.0.0.1 |          19687 |       RepairJobTask:2
         fedbe970-a9fa-11e3-ab61-2387bbb38302 | fee167b0-a9fa-11e3-b120-2387bbb38302 |                                                                 [repair #fedc3790-a9fa-11e3-ab61-2387bbb38302] Sending completed merkle tree to /127.0.0.1 for system_traces/sessions |  REPAIR | 127.0.0.3 |          31937 |    AntiEntropyStage:1
         fedbe970-a9fa-11e3-ab61-2387bbb38302 | fee3b1a0-a9fa-11e3-ab61-2387bbb38302 |                                                                                      [repair #fedc3790-a9fa-11e3-ab61-2387bbb38302] Received merkle tree for sessions from /127.0.0.3 |  REPAIR | 127.0.0.1 |          50996 |    AntiEntropyStage:1
         fedbe970-a9fa-11e3-ab61-2387bbb38302 | fee7d050-a9fa-11e3-ab61-2387bbb38302 |                                                                                      [repair #fedc3790-a9fa-11e3-ab61-2387bbb38302] Received merkle tree for sessions from /127.0.0.1 |  REPAIR | 127.0.0.1 |          77907 |    AntiEntropyStage:1
         fedbe970-a9fa-11e3-ab61-2387bbb38302 | feea1a40-a9fa-11e3-ab61-2387bbb38302 |                                                                       [repair #fedc3790-a9fa-11e3-ab61-2387bbb38302] requesting merkle trees for events (to [/127.0.0.3, /127.0.0.1]) |  REPAIR | 127.0.0.1 |          92358 |       RepairJobTask:2
         fedbe970-a9fa-11e3-ab61-2387bbb38302 | feecd960-a9fa-11e3-b120-2387bbb38302 |                                                                   [repair #fedc3790-a9fa-11e3-ab61-2387bbb38302] Sending completed merkle tree to /127.0.0.1 for system_traces/events |  REPAIR | 127.0.0.3 |         107062 |    AntiEntropyStage:1
         fedbe970-a9fa-11e3-ab61-2387bbb38302 | feef2350-a9fa-11e3-ab61-2387bbb38302 |                                                                                        [repair #fedc3790-a9fa-11e3-ab61-2387bbb38302] Received merkle tree for events from /127.0.0.3 |  REPAIR | 127.0.0.1 |         126230 |    AntiEntropyStage:1
         fedbe970-a9fa-11e3-ab61-2387bbb38302 | fef3de40-a9fa-11e3-ab61-2387bbb38302 |                                                                                        [repair #fedc3790-a9fa-11e3-ab61-2387bbb38302] Received merkle tree for events from /127.0.0.1 |  REPAIR | 127.0.0.1 |         157175 |    AntiEntropyStage:1
         fedbe970-a9fa-11e3-ab61-2387bbb38302 | fef40551-a9fa-11e3-ab61-2387bbb38302 | [repair #fef40550-a9fa-11e3-ab61-2387bbb38302] new session: will sync /127.0.0.1, /127.0.0.2 on range (3074457345618258602,-9223372036854775808] for system_traces.[sessions, events] |  REPAIR | 127.0.0.1 |         158225 | AntiEntropySessions:8
         fedbe970-a9fa-11e3-ab61-2387bbb38302 | fef4c8a0-a9fa-11e3-ab61-2387bbb38302 |                                                                     [repair #fef40550-a9fa-11e3-ab61-2387bbb38302] requesting merkle trees for sessions (to [/127.0.0.2, /127.0.0.1]) |  REPAIR | 127.0.0.1 |         162702 |       RepairJobTask:2
         fedbe970-a9fa-11e3-ab61-2387bbb38302 | fef4efb0-a9fa-11e3-ab61-2387bbb38302 |                                                                                                         [repair #fedc3790-a9fa-11e3-ab61-2387bbb38302] session completed successfully |  REPAIR | 127.0.0.1 |         163376 | AntiEntropySessions:7
         fedbe970-a9fa-11e3-ab61-2387bbb38302 | fef60120-a9fa-11e3-b701-2387bbb38302 |                                                                 [repair #fef40550-a9fa-11e3-ab61-2387bbb38302] Sending completed merkle tree to /127.0.0.1 for system_traces/sessions |  REPAIR | 127.0.0.2 |          11744 |    AntiEntropyStage:1
         fedbe970-a9fa-11e3-ab61-2387bbb38302 | fefae320-a9fa-11e3-ab61-2387bbb38302 |                                                                                      [repair #fef40550-a9fa-11e3-ab61-2387bbb38302] Received merkle tree for sessions from /127.0.0.2 |  REPAIR | 127.0.0.1 |         203106 |    AntiEntropyStage:1
         fedbe970-a9fa-11e3-ab61-2387bbb38302 | feff9e10-a9fa-11e3-ab61-2387bbb38302 |                                                                                      [repair #fef40550-a9fa-11e3-ab61-2387bbb38302] Received merkle tree for sessions from /127.0.0.1 |  REPAIR | 127.0.0.1 |         233383 |    AntiEntropyStage:1
         fedbe970-a9fa-11e3-ab61-2387bbb38302 | ff0199e0-a9fa-11e3-ab61-2387bbb38302 |                                                                       [repair #fef40550-a9fa-11e3-ab61-2387bbb38302] requesting merkle trees for events (to [/127.0.0.2, /127.0.0.1]) |  REPAIR | 127.0.0.1 |         247143 |       RepairJobTask:2
         fedbe970-a9fa-11e3-ab61-2387bbb38302 | ff036ea0-a9fa-11e3-b701-2387bbb38302 |                                                                   [repair #fef40550-a9fa-11e3-ab61-2387bbb38302] Sending completed merkle tree to /127.0.0.1 for system_traces/events |  REPAIR | 127.0.0.2 |          99303 |    AntiEntropyStage:1
         fedbe970-a9fa-11e3-ab61-2387bbb38302 | ff054360-a9fa-11e3-ab61-2387bbb38302 |                                                                                        [repair #fef40550-a9fa-11e3-ab61-2387bbb38302] Received merkle tree for events from /127.0.0.2 |  REPAIR | 127.0.0.1 |         270729 |    AntiEntropyStage:1
         fedbe970-a9fa-11e3-ab61-2387bbb38302 | ff089ec0-a9fa-11e3-ab61-2387bbb38302 |                                                                                        [repair #fef40550-a9fa-11e3-ab61-2387bbb38302] Received merkle tree for events from /127.0.0.1 |  REPAIR | 127.0.0.1 |         292795 |    AntiEntropyStage:1
         fedbe970-a9fa-11e3-ab61-2387bbb38302 | ff096210-a9fa-11e3-ab61-2387bbb38302 |                                                                                                         [repair #fef40550-a9fa-11e3-ab61-2387bbb38302] session completed successfully |  REPAIR | 127.0.0.1 |         297948 | AntiEntropySessions:8
         fedbe970-a9fa-11e3-ab61-2387bbb38302 | ff098920-a9fa-11e3-ab61-2387bbb38302 |                                                                                                                                                            Repair command #3 finished |  REPAIR | 127.0.0.1 |         298314 |             Thread-16
         feb74a70-a9fa-11e3-ab61-2387bbb38302 | feb74a71-a9fa-11e3-ab61-2387bbb38302 |                                                                                                Starting repair command #2, repairing 2 ranges for keyspace test (seq=true, full=true) |  REPAIR | 127.0.0.1 |              8 |             Thread-15
         feb74a70-a9fa-11e3-ab61-2387bbb38302 | feb79890-a9fa-11e3-ab61-2387bbb38302 |                        [repair #feb77180-a9fa-11e3-ab61-2387bbb38302] new session: will sync /127.0.0.1, /127.0.0.3 on range (-3074457345618258603,3074457345618258602] for test.[jj] |  REPAIR | 127.0.0.1 |           1909 | AntiEntropySessions:5
         feb74a70-a9fa-11e3-ab61-2387bbb38302 | feb834d0-a9fa-11e3-ab61-2387bbb38302 |                                                                           [repair #feb77180-a9fa-11e3-ab61-2387bbb38302] requesting merkle trees for jj (to [/127.0.0.3, /127.0.0.1]) |  REPAIR | 127.0.0.1 |           6344 |       RepairJobTask:2
         feb74a70-a9fa-11e3-ab61-2387bbb38302 | febcefc0-a9fa-11e3-b120-2387bbb38302 |                                                                                [repair #feb77180-a9fa-11e3-ab61-2387bbb38302] Sending completed merkle tree to /127.0.0.1 for test/jj |  REPAIR | 127.0.0.3 |          33924 |    AntiEntropyStage:1
         feb74a70-a9fa-11e3-ab61-2387bbb38302 | febf87d0-a9fa-11e3-ab61-2387bbb38302 |                                                                                            [repair #feb77180-a9fa-11e3-ab61-2387bbb38302] Received merkle tree for jj from /127.0.0.3 |  REPAIR | 127.0.0.1 |          54472 |    AntiEntropyStage:1
         feb74a70-a9fa-11e3-ab61-2387bbb38302 | fec728f0-a9fa-11e3-ab61-2387bbb38302 |                                                                                            [repair #feb77180-a9fa-11e3-ab61-2387bbb38302] Received merkle tree for jj from /127.0.0.1 |  REPAIR | 127.0.0.1 |         104179 |    AntiEntropyStage:1
         feb74a70-a9fa-11e3-ab61-2387bbb38302 | fec75000-a9fa-11e3-ab61-2387bbb38302 |                        [repair #fec728f1-a9fa-11e3-ab61-2387bbb38302] new session: will sync /127.0.0.1, /127.0.0.2 on range (3074457345618258602,-9223372036854775808] for test.[jj] |  REPAIR | 127.0.0.1 |         105545 | AntiEntropySessions:6
         feb74a70-a9fa-11e3-ab61-2387bbb38302 | fec81350-a9fa-11e3-ab61-2387bbb38302 |                                                                                                         [repair #feb77180-a9fa-11e3-ab61-2387bbb38302] session completed successfully |  REPAIR | 127.0.0.1 |         110357 | AntiEntropySessions:5
         feb74a70-a9fa-11e3-ab61-2387bbb38302 | fec86170-a9fa-11e3-ab61-2387bbb38302 |                                                                           [repair #fec728f1-a9fa-11e3-ab61-2387bbb38302] requesting merkle trees for jj (to [/127.0.0.2, /127.0.0.1]) |  REPAIR | 127.0.0.1 |         111725 |       RepairJobTask:2
         feb74a70-a9fa-11e3-ab61-2387bbb38302 | fed30fd0-a9fa-11e3-b701-2387bbb38302 |                                                                                [repair #fec728f1-a9fa-11e3-ab61-2387bbb38302] Sending completed merkle tree to /127.0.0.1 for test/jj |  REPAIR | 127.0.0.2 |          74919 |    AntiEntropyStage:1
         feb74a70-a9fa-11e3-ab61-2387bbb38302 | fed5a7e0-a9fa-11e3-ab61-2387bbb38302 |                                                                                            [repair #fec728f1-a9fa-11e3-ab61-2387bbb38302] Received merkle tree for jj from /127.0.0.2 |  REPAIR | 127.0.0.1 |         199165 |    AntiEntropyStage:1
         feb74a70-a9fa-11e3-ab61-2387bbb38302 | fed9c690-a9fa-11e3-ab61-2387bbb38302 |                                                                                            [repair #fec728f1-a9fa-11e3-ab61-2387bbb38302] Received merkle tree for jj from /127.0.0.1 |  REPAIR | 127.0.0.1 |         225798 |    AntiEntropyStage:1
         feb74a70-a9fa-11e3-ab61-2387bbb38302 | feda89e0-a9fa-11e3-ab61-2387bbb38302 |                                                                                                         [repair #fec728f1-a9fa-11e3-ab61-2387bbb38302] session completed successfully |  REPAIR | 127.0.0.1 |         230662 | AntiEntropySessions:6
         feb74a70-a9fa-11e3-ab61-2387bbb38302 | feda89e1-a9fa-11e3-ab61-2387bbb38302 |                                                                                                                                                            Repair command #2 finished |  REPAIR | 127.0.0.1 |         231159 |             Thread-15
        
        (36 rows)
        
        
        Show
        Lyuben Todorov added a comment - It's a good idea to add a verbose option to show tracing as it happens. Currently to enable logging to the system_traces.events table the -tr flag has to be supplied ./nodetool repair -tr , and then to view the traces SELECT * FROM system_traces.events cqlsh> SELECT * FROM system_traces.events ; session_id | event_id | activity | command | source | source_elapsed | thread --------------------------------------+--------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+-----------+----------------+----------------------- fedbe970-a9fa-11e3-ab61-2387bbb38302 | fedbe971-a9fa-11e3-ab61-2387bbb38302 | Starting repair command #3, repairing 2 ranges for keyspace system_traces (seq=true, full=true) | REPAIR | 127.0.0.1 | 11 | Thread-16 fedbe970-a9fa-11e3-ab61-2387bbb38302 | fedc5ea0-a9fa-11e3-ab61-2387bbb38302 | [repair #fedc3790-a9fa-11e3-ab61-2387bbb38302] new session: will sync /127.0.0.1, /127.0.0.3 on range (-3074457345618258603,3074457345618258602] for system_traces.[sessions, events] | REPAIR | 127.0.0.1 | 2493 | AntiEntropySessions:7 fedbe970-a9fa-11e3-ab61-2387bbb38302 | fedef6b0-a9fa-11e3-ab61-2387bbb38302 | [repair #fedc3790-a9fa-11e3-ab61-2387bbb38302] requesting merkle trees for sessions (to [/127.0.0.3, /127.0.0.1]) | REPAIR | 127.0.0.1 | 19687 | RepairJobTask:2 fedbe970-a9fa-11e3-ab61-2387bbb38302 | fee167b0-a9fa-11e3-b120-2387bbb38302 | [repair #fedc3790-a9fa-11e3-ab61-2387bbb38302] Sending completed merkle tree to /127.0.0.1 for system_traces/sessions | REPAIR | 127.0.0.3 | 31937 | AntiEntropyStage:1 fedbe970-a9fa-11e3-ab61-2387bbb38302 | fee3b1a0-a9fa-11e3-ab61-2387bbb38302 | [repair #fedc3790-a9fa-11e3-ab61-2387bbb38302] Received merkle tree for sessions from /127.0.0.3 | REPAIR | 127.0.0.1 | 50996 | AntiEntropyStage:1 fedbe970-a9fa-11e3-ab61-2387bbb38302 | fee7d050-a9fa-11e3-ab61-2387bbb38302 | [repair #fedc3790-a9fa-11e3-ab61-2387bbb38302] Received merkle tree for sessions from /127.0.0.1 | REPAIR | 127.0.0.1 | 77907 | AntiEntropyStage:1 fedbe970-a9fa-11e3-ab61-2387bbb38302 | feea1a40-a9fa-11e3-ab61-2387bbb38302 | [repair #fedc3790-a9fa-11e3-ab61-2387bbb38302] requesting merkle trees for events (to [/127.0.0.3, /127.0.0.1]) | REPAIR | 127.0.0.1 | 92358 | RepairJobTask:2 fedbe970-a9fa-11e3-ab61-2387bbb38302 | feecd960-a9fa-11e3-b120-2387bbb38302 | [repair #fedc3790-a9fa-11e3-ab61-2387bbb38302] Sending completed merkle tree to /127.0.0.1 for system_traces/events | REPAIR | 127.0.0.3 | 107062 | AntiEntropyStage:1 fedbe970-a9fa-11e3-ab61-2387bbb38302 | feef2350-a9fa-11e3-ab61-2387bbb38302 | [repair #fedc3790-a9fa-11e3-ab61-2387bbb38302] Received merkle tree for events from /127.0.0.3 | REPAIR | 127.0.0.1 | 126230 | AntiEntropyStage:1 fedbe970-a9fa-11e3-ab61-2387bbb38302 | fef3de40-a9fa-11e3-ab61-2387bbb38302 | [repair #fedc3790-a9fa-11e3-ab61-2387bbb38302] Received merkle tree for events from /127.0.0.1 | REPAIR | 127.0.0.1 | 157175 | AntiEntropyStage:1 fedbe970-a9fa-11e3-ab61-2387bbb38302 | fef40551-a9fa-11e3-ab61-2387bbb38302 | [repair #fef40550-a9fa-11e3-ab61-2387bbb38302] new session: will sync /127.0.0.1, /127.0.0.2 on range (3074457345618258602,-9223372036854775808] for system_traces.[sessions, events] | REPAIR | 127.0.0.1 | 158225 | AntiEntropySessions:8 fedbe970-a9fa-11e3-ab61-2387bbb38302 | fef4c8a0-a9fa-11e3-ab61-2387bbb38302 | [repair #fef40550-a9fa-11e3-ab61-2387bbb38302] requesting merkle trees for sessions (to [/127.0.0.2, /127.0.0.1]) | REPAIR | 127.0.0.1 | 162702 | RepairJobTask:2 fedbe970-a9fa-11e3-ab61-2387bbb38302 | fef4efb0-a9fa-11e3-ab61-2387bbb38302 | [repair #fedc3790-a9fa-11e3-ab61-2387bbb38302] session completed successfully | REPAIR | 127.0.0.1 | 163376 | AntiEntropySessions:7 fedbe970-a9fa-11e3-ab61-2387bbb38302 | fef60120-a9fa-11e3-b701-2387bbb38302 | [repair #fef40550-a9fa-11e3-ab61-2387bbb38302] Sending completed merkle tree to /127.0.0.1 for system_traces/sessions | REPAIR | 127.0.0.2 | 11744 | AntiEntropyStage:1 fedbe970-a9fa-11e3-ab61-2387bbb38302 | fefae320-a9fa-11e3-ab61-2387bbb38302 | [repair #fef40550-a9fa-11e3-ab61-2387bbb38302] Received merkle tree for sessions from /127.0.0.2 | REPAIR | 127.0.0.1 | 203106 | AntiEntropyStage:1 fedbe970-a9fa-11e3-ab61-2387bbb38302 | feff9e10-a9fa-11e3-ab61-2387bbb38302 | [repair #fef40550-a9fa-11e3-ab61-2387bbb38302] Received merkle tree for sessions from /127.0.0.1 | REPAIR | 127.0.0.1 | 233383 | AntiEntropyStage:1 fedbe970-a9fa-11e3-ab61-2387bbb38302 | ff0199e0-a9fa-11e3-ab61-2387bbb38302 | [repair #fef40550-a9fa-11e3-ab61-2387bbb38302] requesting merkle trees for events (to [/127.0.0.2, /127.0.0.1]) | REPAIR | 127.0.0.1 | 247143 | RepairJobTask:2 fedbe970-a9fa-11e3-ab61-2387bbb38302 | ff036ea0-a9fa-11e3-b701-2387bbb38302 | [repair #fef40550-a9fa-11e3-ab61-2387bbb38302] Sending completed merkle tree to /127.0.0.1 for system_traces/events | REPAIR | 127.0.0.2 | 99303 | AntiEntropyStage:1 fedbe970-a9fa-11e3-ab61-2387bbb38302 | ff054360-a9fa-11e3-ab61-2387bbb38302 | [repair #fef40550-a9fa-11e3-ab61-2387bbb38302] Received merkle tree for events from /127.0.0.2 | REPAIR | 127.0.0.1 | 270729 | AntiEntropyStage:1 fedbe970-a9fa-11e3-ab61-2387bbb38302 | ff089ec0-a9fa-11e3-ab61-2387bbb38302 | [repair #fef40550-a9fa-11e3-ab61-2387bbb38302] Received merkle tree for events from /127.0.0.1 | REPAIR | 127.0.0.1 | 292795 | AntiEntropyStage:1 fedbe970-a9fa-11e3-ab61-2387bbb38302 | ff096210-a9fa-11e3-ab61-2387bbb38302 | [repair #fef40550-a9fa-11e3-ab61-2387bbb38302] session completed successfully | REPAIR | 127.0.0.1 | 297948 | AntiEntropySessions:8 fedbe970-a9fa-11e3-ab61-2387bbb38302 | ff098920-a9fa-11e3-ab61-2387bbb38302 | Repair command #3 finished | REPAIR | 127.0.0.1 | 298314 | Thread-16 feb74a70-a9fa-11e3-ab61-2387bbb38302 | feb74a71-a9fa-11e3-ab61-2387bbb38302 | Starting repair command #2, repairing 2 ranges for keyspace test (seq=true, full=true) | REPAIR | 127.0.0.1 | 8 | Thread-15 feb74a70-a9fa-11e3-ab61-2387bbb38302 | feb79890-a9fa-11e3-ab61-2387bbb38302 | [repair #feb77180-a9fa-11e3-ab61-2387bbb38302] new session: will sync /127.0.0.1, /127.0.0.3 on range (-3074457345618258603,3074457345618258602] for test.[jj] | REPAIR | 127.0.0.1 | 1909 | AntiEntropySessions:5 feb74a70-a9fa-11e3-ab61-2387bbb38302 | feb834d0-a9fa-11e3-ab61-2387bbb38302 | [repair #feb77180-a9fa-11e3-ab61-2387bbb38302] requesting merkle trees for jj (to [/127.0.0.3, /127.0.0.1]) | REPAIR | 127.0.0.1 | 6344 | RepairJobTask:2 feb74a70-a9fa-11e3-ab61-2387bbb38302 | febcefc0-a9fa-11e3-b120-2387bbb38302 | [repair #feb77180-a9fa-11e3-ab61-2387bbb38302] Sending completed merkle tree to /127.0.0.1 for test/jj | REPAIR | 127.0.0.3 | 33924 | AntiEntropyStage:1 feb74a70-a9fa-11e3-ab61-2387bbb38302 | febf87d0-a9fa-11e3-ab61-2387bbb38302 | [repair #feb77180-a9fa-11e3-ab61-2387bbb38302] Received merkle tree for jj from /127.0.0.3 | REPAIR | 127.0.0.1 | 54472 | AntiEntropyStage:1 feb74a70-a9fa-11e3-ab61-2387bbb38302 | fec728f0-a9fa-11e3-ab61-2387bbb38302 | [repair #feb77180-a9fa-11e3-ab61-2387bbb38302] Received merkle tree for jj from /127.0.0.1 | REPAIR | 127.0.0.1 | 104179 | AntiEntropyStage:1 feb74a70-a9fa-11e3-ab61-2387bbb38302 | fec75000-a9fa-11e3-ab61-2387bbb38302 | [repair #fec728f1-a9fa-11e3-ab61-2387bbb38302] new session: will sync /127.0.0.1, /127.0.0.2 on range (3074457345618258602,-9223372036854775808] for test.[jj] | REPAIR | 127.0.0.1 | 105545 | AntiEntropySessions:6 feb74a70-a9fa-11e3-ab61-2387bbb38302 | fec81350-a9fa-11e3-ab61-2387bbb38302 | [repair #feb77180-a9fa-11e3-ab61-2387bbb38302] session completed successfully | REPAIR | 127.0.0.1 | 110357 | AntiEntropySessions:5 feb74a70-a9fa-11e3-ab61-2387bbb38302 | fec86170-a9fa-11e3-ab61-2387bbb38302 | [repair #fec728f1-a9fa-11e3-ab61-2387bbb38302] requesting merkle trees for jj (to [/127.0.0.2, /127.0.0.1]) | REPAIR | 127.0.0.1 | 111725 | RepairJobTask:2 feb74a70-a9fa-11e3-ab61-2387bbb38302 | fed30fd0-a9fa-11e3-b701-2387bbb38302 | [repair #fec728f1-a9fa-11e3-ab61-2387bbb38302] Sending completed merkle tree to /127.0.0.1 for test/jj | REPAIR | 127.0.0.2 | 74919 | AntiEntropyStage:1 feb74a70-a9fa-11e3-ab61-2387bbb38302 | fed5a7e0-a9fa-11e3-ab61-2387bbb38302 | [repair #fec728f1-a9fa-11e3-ab61-2387bbb38302] Received merkle tree for jj from /127.0.0.2 | REPAIR | 127.0.0.1 | 199165 | AntiEntropyStage:1 feb74a70-a9fa-11e3-ab61-2387bbb38302 | fed9c690-a9fa-11e3-ab61-2387bbb38302 | [repair #fec728f1-a9fa-11e3-ab61-2387bbb38302] Received merkle tree for jj from /127.0.0.1 | REPAIR | 127.0.0.1 | 225798 | AntiEntropyStage:1 feb74a70-a9fa-11e3-ab61-2387bbb38302 | feda89e0-a9fa-11e3-ab61-2387bbb38302 | [repair #fec728f1-a9fa-11e3-ab61-2387bbb38302] session completed successfully | REPAIR | 127.0.0.1 | 230662 | AntiEntropySessions:6 feb74a70-a9fa-11e3-ab61-2387bbb38302 | feda89e1-a9fa-11e3-ab61-2387bbb38302 | Repair command #2 finished | REPAIR | 127.0.0.1 | 231159 | Thread-15 (36 rows)
        Hide
        Jonathan Ellis added a comment - - edited

        Can you post an example trace session?

        Does this add a "verbose" flag to nodetool repair to show the trace as-it-happens or do I have to manually check the trace tables?

        Show
        Jonathan Ellis added a comment - - edited Can you post an example trace session? Does this add a "verbose" flag to nodetool repair to show the trace as-it-happens or do I have to manually check the trace tables?
        Hide
        Lyuben Todorov added a comment -

        I'm attaching a patch that is basically all the patches that have been committed to the 5483 feature branch.

        The only change between your last set of patches and this patch is that I added DebuggableThreadPoolExecutor#createWithCachedTheadpool which is the new constructor to allow for a cached pool (just for consistency).

        +1 from me.

        Show
        Lyuben Todorov added a comment - I'm attaching a patch that is basically all the patches that have been committed to the 5483 feature branch. The only change between your last set of patches and this patch is that I added DebuggableThreadPoolExecutor#createWithCachedTheadpool which is the new constructor to allow for a cached pool (just for consistency). +1 from me.
        Hide
        Ben Chan added a comment -

        Okay, final set of changes (I think).

        repro code:

        #git checkout 5483
        W=https://issues.apache.org/jira/secure/attachment
        for url in \
          $W/12633989/5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch \
          $W/12633990/5483-v07-08-Fix-brace-style.patch \
          $W/12633991/5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch \
          $W/12633992/5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch
        do [ -e $(basename $url) ] || curl -sO $url; done &&
        git apply 5483-v07-*.patch &&
        ant clean && ant
        
        ./ccm-repair-test
        

        Comments:

        • v07-07 I get multiple RepairJobTask:... ids in the trace, so hopefully this resolves the issue. This is as much as I can say with any assurance, since I don't normally work with threading.
        • v07-09 Without having to worry about backwards compatibility, I decided to trace-enable all repair functions I could find and conveniently convert. No real change functionally, since nothing in the source tree calls those functions and/or overloadings (otherwise I would have had to change much more in order to get things to compile).
        • v07-10 Again, no functional change. I noticed the difference in parameter name when doing the conversion.

        You can take or leave the last two; they don't do anything, really.

        Show
        Ben Chan added a comment - Okay, final set of changes (I think). repro code: #git checkout 5483 W=https://issues.apache.org/jira/secure/attachment for url in \ $W/12633989/5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch \ $W/12633990/5483-v07-08-Fix-brace-style.patch \ $W/12633991/5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch \ $W/12633992/5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch do [ -e $(basename $url) ] || curl -sO $url; done && git apply 5483-v07-*.patch && ant clean && ant ./ccm-repair-test Comments: v07-07 I get multiple RepairJobTask:... ids in the trace, so hopefully this resolves the issue. This is as much as I can say with any assurance, since I don't normally work with threading. v07-09 Without having to worry about backwards compatibility, I decided to trace-enable all repair functions I could find and conveniently convert. No real change functionally, since nothing in the source tree calls those functions and/or overloadings (otherwise I would have had to change much more in order to get things to compile). v07-10 Again, no functional change. I noticed the difference in parameter name when doing the conversion. You can take or leave the last two; they don't do anything, really.
        Hide
        Lyuben Todorov added a comment -

        Ben Chan Can you please attach the last patch so we can complete this ticket? I'll apply it into my git branch

        Show
        Lyuben Todorov added a comment - Ben Chan Can you please attach the last patch so we can complete this ticket? I'll apply it into my git branch
        Hide
        Lyuben Todorov added a comment -

        Move to a github workflow, or some sort of hybrid github/JIRA?

        Ben Chan Whichever you prefer, attaching a patch will be simpler, I can merge it into the existing github branch and post the update, but if you want feel free to create a branch yourself.

        the server isn't going to work with old versions of nodetool

        I think that is ok since if someone upgrades it would kind of be expected for them to upgrade to the new nodetool as well.

        there is a lot of repetitious parallel code with the tracing and logging together

        Maybe worth refactoring on a separate ticket, but lets not cause regression by trying to minimise code.

        Show
        Lyuben Todorov added a comment - Move to a github workflow, or some sort of hybrid github/JIRA? Ben Chan Whichever you prefer, attaching a patch will be simpler, I can merge it into the existing github branch and post the update, but if you want feel free to create a branch yourself. the server isn't going to work with old versions of nodetool I think that is ok since if someone upgrades it would kind of be expected for them to upgrade to the new nodetool as well. there is a lot of repetitious parallel code with the tracing and logging together Maybe worth refactoring on a separate ticket, but lets not cause regression by trying to minimise code.
        Hide
        Ben Chan added a comment -

        Thanks for the catch. Hopefully the fix is as easy as:

        // using OpenJDK's source code for Executors.newCachedThreadPool() as a reference
        new DebuggableThreadPoolExecutor(0, Integer.MAX_VALUE, 60L, TimeUnit.SECONDS,
                                         new SynchronousQueue<Runnable>(),
                                         new NamedThreadFactory("RepairJobTask"));
        

        Also note that v02-01 added CompactionExecutor extends DebuggableThreadPoolExecutor, so that is something else to watch out for.

        Re: {
        Oops (I'm normally K&R).

        Re: github
        Move to a github workflow, or some sort of hybrid github/JIRA?

        I guess the last caveat here is that most (I may have left in a few) of the old non-trace function overloads are gone, so the server isn't going to work with old versions of nodetool.

        And a newer nodetool still won't work with older server versions. There may be some tricks you can do with JMX, but I don't have deep knowledge there. Maybe adding function overloads from the future (I noticed a similar trick in MessagingService.java).

        And there is a lot of repetitious parallel code with the tracing and logging together. But that shouldn't affect the external functionality.

        Show
        Ben Chan added a comment - Thanks for the catch. Hopefully the fix is as easy as: // using OpenJDK's source code for Executors.newCachedThreadPool() as a reference new DebuggableThreadPoolExecutor(0, Integer.MAX_VALUE, 60L, TimeUnit.SECONDS, new SynchronousQueue<Runnable>(), new NamedThreadFactory("RepairJobTask")); Also note that v02-01 added CompactionExecutor extends DebuggableThreadPoolExecutor , so that is something else to watch out for. — Re: { Oops (I'm normally K&R). Re: github Move to a github workflow, or some sort of hybrid github/JIRA? — I guess the last caveat here is that most (I may have left in a few) of the old non- trace function overloads are gone, so the server isn't going to work with old versions of nodetool . And a newer nodetool still won't work with older server versions. There may be some tricks you can do with JMX, but I don't have deep knowledge there. Maybe adding function overloads from the future (I noticed a similar trick in MessagingService.java ). And there is a lot of repetitious parallel code with the tracing and logging together. But that shouldn't affect the external functionality.
        Hide
        Yuki Morishita added a comment -

        switching from Executors#newCachedThreadPool to DebuggableThreadPoolExecutor

        DebuggableThreadPoolExecutor with that constructor keeps only one thread for tasks submitted.
        I used newCachedThreadPool there to parallelize execution of short running tasks as much as possible.

        Show
        Yuki Morishita added a comment - switching from Executors#newCachedThreadPool to DebuggableThreadPoolExecutor DebuggableThreadPoolExecutor with that constructor keeps only one thread for tasks submitted. I used newCachedThreadPool there to parallelize execution of short running tasks as much as possible.
        Hide
        Lyuben Todorov added a comment -

        Patch is almost there, I'm mostly worried about the performance hit we would be taking by switching from Executors#newCachedThreadPool to DebuggableThreadPoolExecutor. /cc Yuki Morishita

        Also there are a few codestyle nits:

        1. Validator#run has a statement with { on the same line.
        2. Tracing#getCommandName has its { on the same line

        Lastsly I created a feature branch to simplify applying patches

        Show
        Lyuben Todorov added a comment - Patch is almost there, I'm mostly worried about the performance hit we would be taking by switching from Executors#newCachedThreadPool to DebuggableThreadPoolExecutor . /cc Yuki Morishita Also there are a few codestyle nits: Validator#run has a statement with { on the same line. Tracing#getCommandName has its { on the same line Lastsly I created a feature branch to simplify applying patches
        Hide
        Ben Chan added a comment -

        It was more involved than I thought, partly because of heisenbugs and the trace state mysteriously not propagating (see v06-05).

        Note: changing JMX can cause mysterious errors if you don't ant clean && ant. I ran into the same kinds of stack traces as you did. It's not consistent. Sometimes I can make a JMX change and ant with no problem.

        To make patches simpler, I'm posting full repro code. I also tried to simplify the naming. Unfortunately, all the previous patches are in jumbled order due to a naming convention that doesn't sort. Fortunately, JIRA seems to have an easter egg where you can choose the attachment name by changing the url.

        # Uncomment to exactly reproduce state.
        #git checkout -b 5483-e30d6dc e30d6dc
        
        # Download all needed patches with consistent names, apply patches, build.
        W=https://issues.apache.org/jira/secure/attachment
        for url in \
          $W/12630490/5483-v02-01-Trace-filtering-and-tracestate-propagation.patch \
          $W/12630491/5483-v02-02-Put-a-few-traces-parallel-to-the-repair-logging.patch \
          $W/12631967/5483-v03-03-Make-repair-tracing-controllable-via-nodetool.patch \
          $W/12633153/5483-v06-04-Allow-tracing-ttl-to-be-configured.patch \
          $W/12633154/5483-v06-05-Add-a-command-column-to-system_traces.events.patch \
          $W/12633155/5483-v06-06-Fix-interruption-in-tracestate-propagation.patch \
          $W/12633156/ccm-repair-test
        do [ -e $(basename $url) ] || curl -sO $url; done &&
        git apply 5483-v0[236]-*.patch &&
        ant clean && ant
        
        # put on a separate line because you should at least minimally inspect
        # arbitrary code before running.
        chmod +x ./ccm-repair-test && ./ccm-repair-test
        

        ccm-repair-test has some options for convenience:

        -k keep (don't delete) the created cluster after successful exit.
        -r repair only
        -R don't repair
        -t do traced repair only
        -T don't do traced repair (if neither, then do both traced and untraced repair)
        

        The output of a test run:

        Current cluster is now: test-5483-QiR
        [2014-03-06 10:46:13,617] Nothing to repair for keyspace 'system'
        [2014-03-06 10:46:13,646] Starting repair command #1, repairing 2 ranges for keyspace s1 (seq=true, full=true)
        [2014-03-06 10:46:16,999] Repair session 72648190-a546-11e3-a5f4-f94811c7b860 for range (-3074457345618258603,3074457345618258602] finished
        [2014-03-06 10:46:17,465] Repair session 73ee2ed0-a546-11e3-a5f4-f94811c7b860 for range (3074457345618258602,-9223372036854775808] finished
        [2014-03-06 10:46:17,465] Repair command #1 finished
        [2014-03-06 10:46:17,485] Starting repair command #2, repairing 2 ranges for keyspace system_traces (seq=true, full=true)
        [2014-03-06 10:46:18,782] Repair session 74aaef20-a546-11e3-a5f4-f94811c7b860 for range (-3074457345618258603,3074457345618258602] finished
        [2014-03-06 10:46:18,816] Repair session 74ff0290-a546-11e3-a5f4-f94811c7b860 for range (3074457345618258602,-9223372036854775808] finished
        [2014-03-06 10:46:18,816] Repair command #2 finished
        0 rows exported in 0.015 seconds.
        test-5483-QiR-system_traces-events.txt
        ok
        [2014-03-06 10:46:24,128] Nothing to repair for keyspace 'system'
        [2014-03-06 10:46:24,166] Starting repair command #3, repairing 2 ranges for keyspace s1 (seq=true, full=true)
        [2014-03-06 10:46:25,366] Repair session 78a6d4e0-a546-11e3-a5f4-f94811c7b860 for range (-3074457345618258603,3074457345618258602] finished
        [2014-03-06 10:46:25,415] Repair session 79263e10-a546-11e3-a5f4-f94811c7b860 for range (3074457345618258602,-9223372036854775808] finished
        [2014-03-06 10:46:25,415] Repair command #3 finished
        [2014-03-06 10:46:25,485] Starting repair command #4, repairing 2 ranges for keyspace system_traces (seq=true, full=true)
        [2014-03-06 10:46:27,077] Repair session 796f7c10-a546-11e3-a5f4-f94811c7b860 for range (-3074457345618258603,3074457345618258602] finished
        [2014-03-06 10:46:27,120] Repair session 79f240a0-a546-11e3-a5f4-f94811c7b860 for range (3074457345618258602,-9223372036854775808] finished
        [2014-03-06 10:46:27,120] Repair command #4 finished
        48 rows exported in 0.104 seconds.
        test-5483-QiR-system_traces-events-tr.txt
        found source: 127.0.0.1
        found thread: Thread-15
        found thread: AntiEntropySessions:1
        found thread: RepairJobTask:1
        found source: 127.0.0.2
        found thread: AntiEntropyStage:1
        found source: 127.0.0.3
        found thread: AntiEntropySessions:2
        found thread: Thread-16
        found thread: AntiEntropySessions:3
        found thread: AntiEntropySessions:4
        unique sources traced: 3
        unique threads traced: 8
        All thread categories accounted for
        ok
        

        Patch comments:

        • v06-04 I did something similar to v03-03, (almost) no refactoring. The implementation is a little messy architecturally.
        • v06-05 This is the suggestion you had to add a "command" column. I don't know how to make it the last column. At least on my box, it's column 5 of 7 despite me putting it last in the cql. Note that ccm-repair-test's checking code will break if the column order changes.
        • v06-06 You need to submit Runnable s, etc. using DebuggableThreadPoolExecutor if you want them to inherit tracestate. Tracestate propagation is very easy to break under concurrency, so this is probably the first thing to check if it ever happens again.
        Show
        Ben Chan added a comment - It was more involved than I thought, partly because of heisenbugs and the trace state mysteriously not propagating (see v06-05 ). Note: changing JMX can cause mysterious errors if you don't ant clean && ant . I ran into the same kinds of stack traces as you did. It's not consistent. Sometimes I can make a JMX change and ant with no problem. To make patches simpler, I'm posting full repro code. I also tried to simplify the naming. Unfortunately, all the previous patches are in jumbled order due to a naming convention that doesn't sort. Fortunately, JIRA seems to have an easter egg where you can choose the attachment name by changing the url. # Uncomment to exactly reproduce state. #git checkout -b 5483-e30d6dc e30d6dc # Download all needed patches with consistent names, apply patches, build. W=https://issues.apache.org/jira/secure/attachment for url in \ $W/12630490/5483-v02-01-Trace-filtering-and-tracestate-propagation.patch \ $W/12630491/5483-v02-02-Put-a-few-traces-parallel-to-the-repair-logging.patch \ $W/12631967/5483-v03-03-Make-repair-tracing-controllable-via-nodetool.patch \ $W/12633153/5483-v06-04-Allow-tracing-ttl-to-be-configured.patch \ $W/12633154/5483-v06-05-Add-a-command-column-to-system_traces.events.patch \ $W/12633155/5483-v06-06-Fix-interruption-in-tracestate-propagation.patch \ $W/12633156/ccm-repair-test do [ -e $(basename $url) ] || curl -sO $url; done && git apply 5483-v0[236]-*.patch && ant clean && ant # put on a separate line because you should at least minimally inspect # arbitrary code before running. chmod +x ./ccm-repair-test && ./ccm-repair-test ccm-repair-test has some options for convenience: -k keep (don't delete) the created cluster after successful exit. -r repair only -R don't repair -t do traced repair only -T don't do traced repair (if neither, then do both traced and untraced repair) The output of a test run: Current cluster is now: test-5483-QiR [2014-03-06 10:46:13,617] Nothing to repair for keyspace 'system' [2014-03-06 10:46:13,646] Starting repair command #1, repairing 2 ranges for keyspace s1 (seq=true, full=true) [2014-03-06 10:46:16,999] Repair session 72648190-a546-11e3-a5f4-f94811c7b860 for range (-3074457345618258603,3074457345618258602] finished [2014-03-06 10:46:17,465] Repair session 73ee2ed0-a546-11e3-a5f4-f94811c7b860 for range (3074457345618258602,-9223372036854775808] finished [2014-03-06 10:46:17,465] Repair command #1 finished [2014-03-06 10:46:17,485] Starting repair command #2, repairing 2 ranges for keyspace system_traces (seq=true, full=true) [2014-03-06 10:46:18,782] Repair session 74aaef20-a546-11e3-a5f4-f94811c7b860 for range (-3074457345618258603,3074457345618258602] finished [2014-03-06 10:46:18,816] Repair session 74ff0290-a546-11e3-a5f4-f94811c7b860 for range (3074457345618258602,-9223372036854775808] finished [2014-03-06 10:46:18,816] Repair command #2 finished 0 rows exported in 0.015 seconds. test-5483-QiR-system_traces-events.txt ok [2014-03-06 10:46:24,128] Nothing to repair for keyspace 'system' [2014-03-06 10:46:24,166] Starting repair command #3, repairing 2 ranges for keyspace s1 (seq=true, full=true) [2014-03-06 10:46:25,366] Repair session 78a6d4e0-a546-11e3-a5f4-f94811c7b860 for range (-3074457345618258603,3074457345618258602] finished [2014-03-06 10:46:25,415] Repair session 79263e10-a546-11e3-a5f4-f94811c7b860 for range (3074457345618258602,-9223372036854775808] finished [2014-03-06 10:46:25,415] Repair command #3 finished [2014-03-06 10:46:25,485] Starting repair command #4, repairing 2 ranges for keyspace system_traces (seq=true, full=true) [2014-03-06 10:46:27,077] Repair session 796f7c10-a546-11e3-a5f4-f94811c7b860 for range (-3074457345618258603,3074457345618258602] finished [2014-03-06 10:46:27,120] Repair session 79f240a0-a546-11e3-a5f4-f94811c7b860 for range (3074457345618258602,-9223372036854775808] finished [2014-03-06 10:46:27,120] Repair command #4 finished 48 rows exported in 0.104 seconds. test-5483-QiR-system_traces-events-tr.txt found source: 127.0.0.1 found thread: Thread-15 found thread: AntiEntropySessions:1 found thread: RepairJobTask:1 found source: 127.0.0.2 found thread: AntiEntropyStage:1 found source: 127.0.0.3 found thread: AntiEntropySessions:2 found thread: Thread-16 found thread: AntiEntropySessions:3 found thread: AntiEntropySessions:4 unique sources traced: 3 unique threads traced: 8 All thread categories accounted for ok — Patch comments: v06-04 I did something similar to v03-03 , (almost) no refactoring. The implementation is a little messy architecturally. v06-05 This is the suggestion you had to add a "command" column. I don't know how to make it the last column. At least on my box, it's column 5 of 7 despite me putting it last in the cql. Note that ccm-repair-test 's checking code will break if the column order changes. v06-06 You need to submit Runnable s, etc. using DebuggableThreadPoolExecutor if you want them to inherit tracestate. Tracestate propagation is very easy to break under concurrency, so this is probably the first thing to check if it ever happens again.
        Hide
        Lyuben Todorov added a comment - - edited

        Are the latest 3 patches supposed to be incrementally added onto trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch and trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch? As in

        1 - apply trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch
        2 - apply trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch
        3 - apply one of the three latest patches (v3, v4 or v5)
        

        v5 Does a lot of refactoring that I think is outside the scope of this ticket (but might be worth it's own ticket as the idea is good), so my vote is for v3, but I'm getting a NoSuchMethod exception, can you post a branch with all the patches added onto trunk (for v3)?

        The exception:

        java.lang.NoSuchMethodException: forceRepairAsync(java.lang.String, boolean, java.util.Collection, java.util.Collection, boolean, boolean, boolean, [Ljava.lang.String;)
        	at com.sun.jmx.mbeanserver.PerInterface.noSuchMethod(PerInterface.java:168)
        	at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:135)
        	at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:252)
        	at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819)
        	at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801)
        	at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1487)
        	at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:97)
        	at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1328)
        	at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1420)
        	at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:848)
        	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        	at java.lang.reflect.Method.invoke(Method.java:606)
        	at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:322)
        	at sun.rmi.transport.Transport$1.run(Transport.java:177)
        	at sun.rmi.transport.Transport$1.run(Transport.java:174)
        	at java.security.AccessController.doPrivileged(Native Method)
        	at sun.rmi.transport.Transport.serviceCall(Transport.java:173)
        	at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:556)
        	at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:811)
        	at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:670)
        	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        	at java.lang.Thread.run(Thread.java:744)
        

        I am thinking of calling the new table something generic like system_traces.trace_logs. I also assume, that like system_traces.events

        I'd say events is pretty generic, the new table should show that the traces aren't query related like in events. If we are going to add new tables to the trace CF it's worth thinking about refactoring events into something more specific and adding new tables with names that carry meaning. Another possible solution is to add a "command" field to system_traces.events where it can allow users to retrieve data about specific events, e.g. Jonathan Ellis WDYT?

        SELECT * FROM system_traces.events;
         session_id                           | ... | thread    | command
        --------------------------------------+ ... +-----------+---------
         09d48eb0-a2f1-11e3-9f04-7d9e3709bf93 | ... | Thrift:1  | REPAIR
         29084f90-a2f3-11e3-9f04-7d9e3709bf93 | ... | Thrift:1  | QUERY
        
        (2 rows)
        
        SELECT * FROM system_traces.events WHERE command='REPAIR';
        
         session_id                           | ... | thread    | command
        --------------------------------------+ ... +-----------+---------
         09d48eb0-a2f1-11e3-9f04-7d9e3709bf93 | ... | Thrift:1  | REPAIR
        
        (1 rows)
        

        the rows in this table should expire, though perhaps not as fast as 24 hours.

        +1, repairs can take a very long time so this should be configurable with the default perhaps being around 90 days (but should be configurable), but with incremental repairs (in 2.1) it will end up logging a lot of data, still a better choice than users doing regular repairs missing out on information.

        One last thing I wanted to ask is about the possibility of trace log levels. What is the minimum amount of trace log information you would find useful, the next amount, and so on? Should it just follow the loglevel?

        Trace is supposed to give as much info as possible and tends to be used for debugging problems, e.g. slow queries or in this case, repairs taking too long, so its important to include useful information but not spam logs with every detail. Different log levels might be useful, but in this ticket the aim is to track progress of repairs, so logging each repair command's completion should be sufficient.

        Show
        Lyuben Todorov added a comment - - edited Are the latest 3 patches supposed to be incrementally added onto trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch and trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch ? As in 1 - apply trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch 2 - apply trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch 3 - apply one of the three latest patches (v3, v4 or v5) v5 Does a lot of refactoring that I think is outside the scope of this ticket (but might be worth it's own ticket as the idea is good), so my vote is for v3, but I'm getting a NoSuchMethod exception, can you post a branch with all the patches added onto trunk (for v3)? The exception: java.lang.NoSuchMethodException: forceRepairAsync(java.lang.String, boolean, java.util.Collection, java.util.Collection, boolean, boolean, boolean, [Ljava.lang.String;) at com.sun.jmx.mbeanserver.PerInterface.noSuchMethod(PerInterface.java:168) at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:135) at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:252) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819) at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801) at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1487) at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:97) at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1328) at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1420) at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:848) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:322) at sun.rmi.transport.Transport$1.run(Transport.java:177) at sun.rmi.transport.Transport$1.run(Transport.java:174) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.Transport.serviceCall(Transport.java:173) at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:556) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:811) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:670) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) I am thinking of calling the new table something generic like system_traces.trace_logs. I also assume, that like system_traces.events I'd say events is pretty generic, the new table should show that the traces aren't query related like in events. If we are going to add new tables to the trace CF it's worth thinking about refactoring events into something more specific and adding new tables with names that carry meaning. Another possible solution is to add a "command" field to system_traces.events where it can allow users to retrieve data about specific events, e.g. Jonathan Ellis WDYT? SELECT * FROM system_traces.events; session_id | ... | thread | command --------------------------------------+ ... +-----------+--------- 09d48eb0-a2f1-11e3-9f04-7d9e3709bf93 | ... | Thrift:1 | REPAIR 29084f90-a2f3-11e3-9f04-7d9e3709bf93 | ... | Thrift:1 | QUERY (2 rows) SELECT * FROM system_traces.events WHERE command='REPAIR'; session_id | ... | thread | command --------------------------------------+ ... +-----------+--------- 09d48eb0-a2f1-11e3-9f04-7d9e3709bf93 | ... | Thrift:1 | REPAIR (1 rows) the rows in this table should expire, though perhaps not as fast as 24 hours. +1, repairs can take a very long time so this should be configurable with the default perhaps being around 90 days (but should be configurable), but with incremental repairs (in 2.1) it will end up logging a lot of data, still a better choice than users doing regular repairs missing out on information. One last thing I wanted to ask is about the possibility of trace log levels. What is the minimum amount of trace log information you would find useful, the next amount, and so on? Should it just follow the loglevel? Trace is supposed to give as much info as possible and tends to be used for debugging problems, e.g. slow queries or in this case, repairs taking too long, so its important to include useful information but not spam logs with every detail. Different log levels might be useful, but in this ticket the aim is to track progress of repairs, so logging each repair command's completion should be sufficient.
        Hide
        Ben Chan added a comment -

        These next few patches are different variations on adding nodetool control of tracing (Use nodetool repair -tr), and are all based off of the v02-0002 patch, hence the naming and the 0003 numbering. An overview:

        • v03 simply adds a trace boolean to all of the repair functions.
        • v04 does not actually work, due to complications with sending an EnumSet parameter via JMX.
        • v05 has the same structure as V04, except it uses a long and traditional bitflags.

        The idea for v04 and v05 was to consolidate all of the boolean options into a single parameter. That way, adding or removing boolean options doesn't require you to modify the entire call chain. It also makes binary compatibility easier going forward (no need to maintain an ever growing list of function overloads). I'm personally leaning towards v05.

        About tracing to a separate table: an earlier comment mentioned wanting to trace bootstrap and decommission. I wonder if these would go into that same table. If so, I am thinking of calling the new table something generic like system_traces.trace_logs. I also assume, that like system_traces.events, the rows in this table should expire, though perhaps not as fast as 24 hours. Thoughts on the naming and the use of the system_traces schema?

        One last thing I wanted to ask is about the possibility of trace log levels. What is the minimum amount of trace log information you would find useful, the next amount, and so on? Should it just follow the loglevel? (One possible problem with that is you can't change loglevel without a server restart.)

        I don't run Cassandra in any sort of production environment, so I'm not as familiar with the use cases as I would like.

        Show
        Ben Chan added a comment - These next few patches are different variations on adding nodetool control of tracing (Use nodetool repair -tr ), and are all based off of the v02-0002 patch, hence the naming and the 0003 numbering. An overview: v03 simply adds a trace boolean to all of the repair functions. v04 does not actually work, due to complications with sending an EnumSet parameter via JMX. v05 has the same structure as V04 , except it uses a long and traditional bitflags. The idea for v04 and v05 was to consolidate all of the boolean options into a single parameter. That way, adding or removing boolean options doesn't require you to modify the entire call chain. It also makes binary compatibility easier going forward (no need to maintain an ever growing list of function overloads). I'm personally leaning towards v05 . — About tracing to a separate table: an earlier comment mentioned wanting to trace bootstrap and decommission. I wonder if these would go into that same table. If so, I am thinking of calling the new table something generic like system_traces.trace_logs . I also assume, that like system_traces.events , the rows in this table should expire, though perhaps not as fast as 24 hours. Thoughts on the naming and the use of the system_traces schema? — One last thing I wanted to ask is about the possibility of trace log levels. What is the minimum amount of trace log information you would find useful, the next amount, and so on? Should it just follow the loglevel? (One possible problem with that is you can't change loglevel without a server restart.) I don't run Cassandra in any sort of production environment, so I'm not as familiar with the use cases as I would like.
        Hide
        Lyuben Todorov added a comment -

        Ben Chan Join the club . The patch makes it so repair is traced regardless of whether tracing is on or not. Is this on purpose? Also as you said the table does fill up rather quickly and this tracing should be moved to a separate table as Oscar Bonilla suggested.

        Show
        Lyuben Todorov added a comment - Ben Chan Join the club . The patch makes it so repair is traced regardless of whether tracing is on or not. Is this on purpose? Also as you said the table does fill up rather quickly and this tracing should be moved to a separate table as Oscar Bonilla suggested.
        Hide
        Ben Chan added a comment -

        Sorry; I forgot to check email for a few days. Attached new patch version rebased onto trunk commit 4620823. The previous minimal test still works (despite gratuitous use of cat).

        Show
        Ben Chan added a comment - Sorry; I forgot to check email for a few days. Attached new patch version rebased onto trunk commit 4620823 . The previous minimal test still works (despite gratuitous use of cat ).
        Hide
        Lyuben Todorov added a comment -

        Ben Chan Can you please rebase to trunk? Patches don't apply cleanly anymore.

        Show
        Lyuben Todorov added a comment - Ben Chan Can you please rebase to trunk? Patches don't apply cleanly anymore.
        Hide
        Jonathan Ellis added a comment -

        Can you review, Lyuben?

        Show
        Jonathan Ellis added a comment - Can you review, Lyuben?
        Hide
        Ben Chan added a comment -

        Thus far, I've been going under the assumption that it's mostly meant to be used as a combination of performance profiling and some sort of globally accessible error log (for that particular repair session), though there will probably easily be enough information there to be able to extract some stats.

        I could use some suggestions on what things (and what level of detail) would be good to trace. In addition to what I already have, I can only think of a few places in Differencer and StreamingRepairTask, along with a more complete covering of the repair-related error logs.

        Show
        Ben Chan added a comment - Thus far, I've been going under the assumption that it's mostly meant to be used as a combination of performance profiling and some sort of globally accessible error log (for that particular repair session), though there will probably easily be enough information there to be able to extract some stats. I could use some suggestions on what things (and what level of detail) would be good to trace. In addition to what I already have, I can only think of a few places in Differencer and StreamingRepairTask , along with a more complete covering of the repair-related error logs.
        Hide
        Oscar Bonilla added a comment -

        I was looking at getting some stats from repair and was pointed here. Looking at the code makes
        me think that repair is more like compaction than query tracing. I.e. it would make more sense to
        keep a system.repair_history that mimics system.compaction_history.

        Show
        Oscar Bonilla added a comment - I was looking at getting some stats from repair and was pointed here. Looking at the code makes me think that repair is more like compaction than query tracing. I.e. it would make more sense to keep a system.repair_history that mimics system.compaction_history.
        Hide
        Ben Chan added a comment -

        This patch contains only the minimum necessary to get a proof-of-concept working, but I thought I'd get some feedback before going any further.

        Currently it reproduces some of the repair logs into system_traces.events. For the simple test below, it's not too bad, but the traces seem to get large fast as the amount of data grows. The same technique used for repair seems to work for bootstrap and decommission with the caveat that if you try to trace too far into the decommission process, it doesn't get propagated to the other nodes.

        For convenience:

        # optional; patch should apply with any recent trunk
        #git checkout -b 5483-repair-filtering 8ebeee1
        git apply trunk@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt
        git apply trunk@8ebeee1-5483-v01-002-simple-repair-tracing.txt
        ant
        
        # simple test
        ccm create test-5483
        ccm populate -n 3
        ccm start
        
        cat <<"EOT" | ccm node1 cqlsh
        CREATE SCHEMA s1
        WITH replication = { 'class' : 'SimpleStrategy', 'replication_factor' : 2 };
        
        USE s1;
        
        CREATE TABLE users (
          user_id varchar PRIMARY KEY,
          first varchar,
          last varchar,
          age int
        );
        
        INSERT INTO users (user_id, first, last, age)
          VALUES ('jsmith', 'John', 'Smith', 42);
        EOT
        
        ccm node1 repair
        
        cat <<"EOT" | ccm node1 cqlsh
        copy system_traces.events to 'test-5483-system_traces-events.txt';
        EOT
        
        Show
        Ben Chan added a comment - This patch contains only the minimum necessary to get a proof-of-concept working, but I thought I'd get some feedback before going any further. Currently it reproduces some of the repair logs into system_traces.events . For the simple test below, it's not too bad, but the traces seem to get large fast as the amount of data grows. The same technique used for repair seems to work for bootstrap and decommission with the caveat that if you try to trace too far into the decommission process, it doesn't get propagated to the other nodes. For convenience: # optional; patch should apply with any recent trunk #git checkout -b 5483-repair-filtering 8ebeee1 git apply trunk@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt git apply trunk@8ebeee1-5483-v01-002-simple-repair-tracing.txt ant # simple test ccm create test-5483 ccm populate -n 3 ccm start cat <<"EOT" | ccm node1 cqlsh CREATE SCHEMA s1 WITH replication = { 'class' : 'SimpleStrategy', 'replication_factor' : 2 }; USE s1; CREATE TABLE users ( user_id varchar PRIMARY KEY, first varchar, last varchar, age int ); INSERT INTO users (user_id, first, last, age) VALUES ('jsmith', 'John', 'Smith', 42); EOT ccm node1 repair cat <<"EOT" | ccm node1 cqlsh copy system_traces.events to 'test-5483-system_traces-events.txt'; EOT
        Hide
        Sylvain Lebresne added a comment -

        We should even probably don't stop at repair (but include bootstrap/decommission etc...). Though not suggesting by that that we should all do in one shot.

        Show
        Sylvain Lebresne added a comment - We should even probably don't stop at repair (but include bootstrap/decommission etc...). Though not suggesting by that that we should all do in one shot.
        Hide
        Jonathan Ellis added a comment -

        Good idea!

        Show
        Jonathan Ellis added a comment - Good idea!

          People

          • Assignee:
            Ben Chan
            Reporter:
            Yuki Morishita
            Reviewer:
            Joshua McKenzie
          • Votes:
            2 Vote for this issue
            Watchers:
            27 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development