Cassandra
  1. Cassandra
  2. CASSANDRA-2698

Instrument repair to be able to assess it's efficiency (precision)

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Fix Version/s: 2.0.1
    • Component/s: None
    • Labels:

      Description

      Some reports indicate that repair sometime transfer huge amounts of data. One hypothesis is that the merkle tree precision may deteriorate too much at some data size. To check this hypothesis, it would be reasonably to gather statistic during the merkle tree building of how many rows each merkle tree range account for (and the size that this represent). It is probably an interesting statistic to have anyway.

      1. trunk-2698.txt
        34 kB
        Benedict
      2. patch.taketwo.forreview.diff
        42 kB
        Benedict
      3. patch.taketwo.alpha.diff
        26 kB
        Benedict
      4. patch-rebased.diff
        35 kB
        Benedict
      5. patch.diff
        34 kB
        Benedict
      6. patch_2698_v1.txt
        7 kB
        Ivan Sobolev
      7. nodetool_repair_and_cfhistogram.tar.gz
        6 kB
        Jason Wee

        Issue Links

          Activity

          Hide
          Yuki Morishita added a comment -

          Thanks for the patch.

          I committed the patch to be released in 2.0.1, with nit style fix and type change(int to long in CountingDigest.count).

          We can change the log output in the future if needed.

          Show
          Yuki Morishita added a comment - Thanks for the patch. I committed the patch to be released in 2.0.1, with nit style fix and type change(int to long in CountingDigest.count). We can change the log output in the future if needed.
          Hide
          Benedict added a comment -

          Hi Yuki,

          Sorry for the slight delay. I wanted to setup my development environment on my laptop after a hard disk failure before completing the changes, so I could configure ccm, intellij, etc.

          I opted in the end for the format [lb..ub] for range printing, as this seems to more neatly solve the problem of indicating the absolute min. There are some ugly complications with the other method around indicating a LB of 0 given the way EstimatedHistogram is defined ("defaults" to a LB of 1). It would probably result in -1s being needed again.

          Show
          Benedict added a comment - Hi Yuki, Sorry for the slight delay. I wanted to setup my development environment on my laptop after a hard disk failure before completing the changes, so I could configure ccm, intellij, etc. I opted in the end for the format [lb..ub] for range printing, as this seems to more neatly solve the problem of indicating the absolute min. There are some ugly complications with the other method around indicating a LB of 0 given the way EstimatedHistogram is defined ("defaults" to a LB of 1). It would probably result in -1s being needed again.
          Hide
          Benedict added a comment -

          Of course the other option is to always emit the range [0..lb] even if its not populated to demarcate the lb - which is your preference?

          Show
          Benedict added a comment - Of course the other option is to always emit the range [0..lb] even if its not populated to demarcate the lb - which is your preference?
          Hide
          Benedict added a comment -

          Good points.

          As to the changes to differenceHelper(), the row count permits not breaking up contiguous ranges of differences that happen to be separated by unpopulated leaves (using just the hash to determine if the data was populated I realised was dangerous, as you cannot disambiguate between no rows and a non-zero number of empty rows), which in my previous patch was generating a lot of ugly log messages. After sending my patch last night I must admit I began to doubt the sense of keeping the changes in, and was probably the hangover of wanting to retain what I could from the previous patch. I think "kill your babies" is the mantra to apply here, as it doesn't serve any purpose at the moment, and if we don't intend to send counts over the wire would be actively dangerous.

          I'll strip out those changes, modify the messages and and fire over another patch.

          That said, I'd prefer to emit the lower bound as well so we know the starting point; "~100: xxx" doesn't tell you if the distribution is 0-100, or 99-100, which might be useful information. This is only helpful for the first item, so could emit only for that, but for neatness I'd probably retain it for all; since we're dealing with integers there's an easy fix of just bumping both start/end by 1 and swapping the brackets.

          Show
          Benedict added a comment - Good points. As to the changes to differenceHelper(), the row count permits not breaking up contiguous ranges of differences that happen to be separated by unpopulated leaves (using just the hash to determine if the data was populated I realised was dangerous, as you cannot disambiguate between no rows and a non-zero number of empty rows), which in my previous patch was generating a lot of ugly log messages. After sending my patch last night I must admit I began to doubt the sense of keeping the changes in, and was probably the hangover of wanting to retain what I could from the previous patch. I think "kill your babies" is the mantra to apply here, as it doesn't serve any purpose at the moment, and if we don't intend to send counts over the wire would be actively dangerous. I'll strip out those changes, modify the messages and and fire over another patch. That said, I'd prefer to emit the lower bound as well so we know the starting point; "~100: xxx" doesn't tell you if the distribution is 0-100, or 99-100, which might be useful information. This is only helpful for the first item, so could emit only for that, but for neatness I'd probably retain it for all; since we're dealing with integers there's an easy fix of just bumping both start/end by 1 and swapping the brackets.
          Hide
          Yuki Morishita added a comment -

          Benedict,

          Thanks for the update.

          1) I'm not sure what you mean by not "serializing those"

          I meant you don't have to send back to the coordinator.
          Changing serialized format means we have to bump up messaging version defined in MessagingService.
          2.0 got feature freeze, so I think it's better to wait until next minor version for message change.

          Also, I looked at the change made to MerkleTree#differenceHelper, and I'm still not sure how row count helps improve logic.
          What is the difference from just using hash value?

          5) One thing we might want to consider changing is the format of the EstimatedHistogram ranges in the log messages

          Yeah, especially -1 in (-1, 0] feels weird. How about omitting lower bound from the label and output like:

           ~0: xxx
          ~10: xxx
          ~20: xxx
          

          nit: you should surround whole output logic in Validator#compele with `logger.isDebugEnabled` check

          Show
          Yuki Morishita added a comment - Benedict, Thanks for the update. 1) I'm not sure what you mean by not "serializing those" I meant you don't have to send back to the coordinator. Changing serialized format means we have to bump up messaging version defined in MessagingService. 2.0 got feature freeze, so I think it's better to wait until next minor version for message change. Also, I looked at the change made to MerkleTree#differenceHelper, and I'm still not sure how row count helps improve logic. What is the difference from just using hash value? 5) One thing we might want to consider changing is the format of the EstimatedHistogram ranges in the log messages Yeah, especially -1 in (-1, 0] feels weird. How about omitting lower bound from the label and output like: ~0: xxx ~10: xxx ~20: xxx nit: you should surround whole output logic in Validator#compele with `logger.isDebugEnabled` check
          Hide
          Benedict added a comment - - edited

          Oh, and patch is against main trunk, as before

          Show
          Benedict added a comment - - edited Oh, and patch is against main trunk, as before
          Hide
          Benedict added a comment -

          Hi Yuki,

          Had some fun rebasing, but think everything looks good now. A few things to note:

          1) I'm not sure what you mean by not "serializing those" - for correctness I serialize all of the data in a node. Do you want me to change the serialization methods to not send these values? I don't log them the other end, but I would prefer they were sent to ensure no surprises for users of the data, and also because of some optimisations to difference() that rely on knowing the number of rows for each sub-tree. It's not a tremendous amount of data after all.

          2) I've modified DifferencerTest, and created two versions of the testDifference() method - one that tests differences on an empty tree, and one which tests a tree that has been populated with rows. Previously only the former was tested. This is because the changes I made to difference() for my previous patch, which I have retained and which ensures contiguous ranges are emitted where possible, treats the entire empty tree as one contiguous difference range (since the only non-empty sub-range in the tree is different), which was breaking the previous test. This test now works with the fully populated tree, and the previous test now confirms that the whole tree is considered different when it is empty. It's possible you may want to not deploy these improvements in this patch, but it seems a good idea to me whilst it's being modified, and given that I'd made the change already. Since we're not logging the ranges themselves at this time it won't have any direct impact, but it will be useful if that ever changes, and might help with future debugging.

          3) I've updated the MerkleTreeTest methods to test the serialization and difference changes, and introduced a new HistogramBuilderTest

          4) The histogram is built differently from my first patch, and is described in HistogramBuilder. Basically rather than creating neat linear ranges, I calculate the mean and create ranges that are multiples of the standard deviation either side of the mean, up to min/max (or, in this case, 3 stdevs, plus one range to min/max)

          5) One thing we might want to consider changing is the format of the EstimatedHistogram ranges in the log messages. I've reproduced faithfully the boundary conventions of the EstimatedHistogram, but this is not a user friendly convention - it has an exclusive lower bound and inclusive upper bound, as opposed to the typical opposite convention. As such we get ranges like (-1, 0] to represent the range containing only 0, as opposed to [0, 1)

          Think that's everything. Should respond quickly to queries at the moment, so drop me a line if you have any questions.

          Show
          Benedict added a comment - Hi Yuki, Had some fun rebasing, but think everything looks good now. A few things to note: 1) I'm not sure what you mean by not "serializing those" - for correctness I serialize all of the data in a node. Do you want me to change the serialization methods to not send these values? I don't log them the other end, but I would prefer they were sent to ensure no surprises for users of the data, and also because of some optimisations to difference() that rely on knowing the number of rows for each sub-tree. It's not a tremendous amount of data after all. 2) I've modified DifferencerTest, and created two versions of the testDifference() method - one that tests differences on an empty tree, and one which tests a tree that has been populated with rows. Previously only the former was tested. This is because the changes I made to difference() for my previous patch, which I have retained and which ensures contiguous ranges are emitted where possible, treats the entire empty tree as one contiguous difference range (since the only non-empty sub-range in the tree is different), which was breaking the previous test. This test now works with the fully populated tree, and the previous test now confirms that the whole tree is considered different when it is empty. It's possible you may want to not deploy these improvements in this patch, but it seems a good idea to me whilst it's being modified, and given that I'd made the change already. Since we're not logging the ranges themselves at this time it won't have any direct impact, but it will be useful if that ever changes, and might help with future debugging. 3) I've updated the MerkleTreeTest methods to test the serialization and difference changes, and introduced a new HistogramBuilderTest 4) The histogram is built differently from my first patch, and is described in HistogramBuilder. Basically rather than creating neat linear ranges, I calculate the mean and create ranges that are multiples of the standard deviation either side of the mean, up to min/max (or, in this case, 3 stdevs, plus one range to min/max) 5) One thing we might want to consider changing is the format of the EstimatedHistogram ranges in the log messages. I've reproduced faithfully the boundary conventions of the EstimatedHistogram, but this is not a user friendly convention - it has an exclusive lower bound and inclusive upper bound, as opposed to the typical opposite convention. As such we get ranges like (-1, 0] to represent the range containing only 0, as opposed to [0, 1) Think that's everything. Should respond quickly to queries at the moment, so drop me a line if you have any questions.
          Hide
          Benedict added a comment -

          Thanks Yuki, sounds like my changes are fine assuming they test okay. I have a bit of a furlow between my various trips now (which I had rather optimistically expected to find time to test this on) so should be able to get a patch over in the next couple of days or so.

          Show
          Benedict added a comment - Thanks Yuki, sounds like my changes are fine assuming they test okay. I have a bit of a furlow between my various trips now (which I had rather optimistically expected to find time to test this on) so should be able to get a patch over in the next couple of days or so.
          Hide
          Yuki Morishita added a comment -

          Hi Benedict,

          Sorry for late reply.
          I think calculating number of rows and those size per range part is fine.
          One thing to point out is that we don't need to serialize those and return to the initiator, just log locally like you do is enough for now.

          p.s. ActiveRepairService is broken up to o.a.c.repair package, so be careful when rebasing.

          Show
          Yuki Morishita added a comment - Hi Benedict, Sorry for late reply. I think calculating number of rows and those size per range part is fine. One thing to point out is that we don't need to serialize those and return to the initiator, just log locally like you do is enough for now. p.s. ActiveRepairService is broken up to o.a.c.repair package, so be careful when rebasing.
          Hide
          Benedict added a comment -

          Just noticed that patch doesn't include the HistogramBuilder class - after I posted I noticed I needed to pull in the latest remote changes, which unfortunately use 1.7 syntax, and since I'm on an old eclipse my problems window exploded. It's late here so I didn't/don't want to faff around too much, but if you have trouble let me know and I'll upload another patch after upgrading eclipse. The patch uploaded should be easy to quickly scan with my description to let me know if I'm still barking up the wrong tree, or if there are any changes you disagree with in principle.

          Show
          Benedict added a comment - Just noticed that patch doesn't include the HistogramBuilder class - after I posted I noticed I needed to pull in the latest remote changes, which unfortunately use 1.7 syntax, and since I'm on an old eclipse my problems window exploded. It's late here so I didn't/don't want to faff around too much, but if you have trouble let me know and I'll upload another patch after upgrading eclipse. The patch uploaded should be easy to quickly scan with my description to let me know if I'm still barking up the wrong tree, or if there are any changes you disagree with in principle.
          Hide
          Benedict added a comment -

          Hi Yuki,

          how many rows each merkle tree range account for (and the size that this represent)

          I take it two histograms, one of row distribution, the other of size (of row) distribution, is what you're looking for?

          I've attached a new patch which is not complete, in that I have not tested it and may want to change a few of the final details (such as, possibly, where and what is logged with the histogram), but before I iron out those kinks I wanted to run past the main crux of the changes to see if it's what you're looking for. Simply put, the merkle tree ranges now retain both a sizeOfRange (ie size of rows added) and rowsInRange (ie number of rows added). Merkle tree now exposes two histogramXXX() methods which use these, and which as of now are logged in Validator.complete(). As it stands, I serialise both these new values over the wire with any merkle tree, to ensure no unexpected behaviour for future users of the class, and as such I also retained my TreeDifference changes to the merkle tree, which reports the size and row count of each side of a difference. These latter two changes may be slightly controversial, so want to run them past you, as well as confirm the basic information I'm printing is what you're looking for.

          Cheers!

          Show
          Benedict added a comment - Hi Yuki, how many rows each merkle tree range account for (and the size that this represent) I take it two histograms, one of row distribution, the other of size (of row) distribution, is what you're looking for? I've attached a new patch which is not complete, in that I have not tested it and may want to change a few of the final details (such as, possibly, where and what is logged with the histogram), but before I iron out those kinks I wanted to run past the main crux of the changes to see if it's what you're looking for. Simply put, the merkle tree ranges now retain both a sizeOfRange (ie size of rows added) and rowsInRange (ie number of rows added). Merkle tree now exposes two histogramXXX() methods which use these, and which as of now are logged in Validator.complete(). As it stands, I serialise both these new values over the wire with any merkle tree, to ensure no unexpected behaviour for future users of the class, and as such I also retained my TreeDifference changes to the merkle tree, which reports the size and row count of each side of a difference. These latter two changes may be slightly controversial, so want to run them past you, as well as confirm the basic information I'm printing is what you're looking for. Cheers!
          Hide
          Benedict added a comment -

          Hi Yuki,

          Sorry for the glacial response.

          If you do that, you should create your own class with labels and array since you're not using default offsets nor other histogram related methods. It confused me at first why you are doing addToIndex to EstimatedHistogram.

          Agreed. I was a little reticent to introduce a different histogram class, but it is a little ugly as it stands.

          But looking at this from the begining again, what we want to see is if we have Merkle tree of evenly distributed keys(or rows) in each hash. You can use EstimatedHistogram or your own to show that. For now, just use logger to log that distribution at the end of Merkle Tree creation with corresponding repair session Id is fine, instead of sending stats back to the coordinator.

          It sounds like all you want logged is the number of rows per merkle leaf, to see if the tree is roughly balanced? In which case I misinterpreted the goal entirely, though it makes sense now I understand more how the repair works. Is it worth leaving in the streaming of the leaf sizes with the merkle tree so that the deltas can be logged in future, should that be desired?

          I'll strip out the logging of the size of the ranges being streamed for now as well.

          Show
          Benedict added a comment - Hi Yuki, Sorry for the glacial response. If you do that, you should create your own class with labels and array since you're not using default offsets nor other histogram related methods. It confused me at first why you are doing addToIndex to EstimatedHistogram. Agreed. I was a little reticent to introduce a different histogram class, but it is a little ugly as it stands. But looking at this from the begining again, what we want to see is if we have Merkle tree of evenly distributed keys(or rows) in each hash. You can use EstimatedHistogram or your own to show that. For now, just use logger to log that distribution at the end of Merkle Tree creation with corresponding repair session Id is fine, instead of sending stats back to the coordinator. It sounds like all you want logged is the number of rows per merkle leaf, to see if the tree is roughly balanced? In which case I misinterpreted the goal entirely, though it makes sense now I understand more how the repair works. Is it worth leaving in the streaming of the leaf sizes with the merkle tree so that the deltas can be logged in future, should that be desired? I'll strip out the logging of the size of the ranges being streamed for now as well.
          Hide
          Jonathan Ellis added a comment -

          Jake has an Eclipse formatter at https://github.com/tjake/cassandra-style-eclipse. Not sure if it's 100% accurate but Jake's patches look okay to me so it's probably close.

          Show
          Jonathan Ellis added a comment - Jake has an Eclipse formatter at https://github.com/tjake/cassandra-style-eclipse . Not sure if it's 100% accurate but Jake's patches look okay to me so it's probably close.
          Hide
          Yuki Morishita added a comment -

          Benedict,

          Since I need to reinsert all the records once I've decided this anyway, I need to retain them all, which I chose to do in EstimatedHistogram as they do, ...

          If you do that, you should create your own class with labels and array since you're not using default offsets nor other histogram related methods. It confused me at first why you are doing addToIndex to EstimatedHistogram.

          But looking at this from the begining again, what we want to see is if we have Merkle tree of evenly distributed keys(or rows) in each hash. You can use EstimatedHistogram or your own to show that. For now, just use logger to log that distribution at the end of Merkle Tree creation with corresponding repair session Id is fine, instead of sending stats back to the coordinator.

          For the streaming part, it is hard to distinguish which stream session belongs to certain repair on current code base(we can only see if it is repair related or not by looking at OperationType). So we need to improve that, and I'm working on as part of repair and streaming redesign(CASSANDRA-5426, CASSANDRA-5286). So, let's focus on the former, validation part.

          Do you have an Eclipse formatter profile I could use for your coding convention?

          Sorry, I use intellij, but I think someone on #cassandra-dev on irc has one.

          Show
          Yuki Morishita added a comment - Benedict, Since I need to reinsert all the records once I've decided this anyway, I need to retain them all, which I chose to do in EstimatedHistogram as they do, ... If you do that, you should create your own class with labels and array since you're not using default offsets nor other histogram related methods. It confused me at first why you are doing addToIndex to EstimatedHistogram. But looking at this from the begining again, what we want to see is if we have Merkle tree of evenly distributed keys(or rows) in each hash. You can use EstimatedHistogram or your own to show that. For now, just use logger to log that distribution at the end of Merkle Tree creation with corresponding repair session Id is fine, instead of sending stats back to the coordinator. For the streaming part, it is hard to distinguish which stream session belongs to certain repair on current code base(we can only see if it is repair related or not by looking at OperationType). So we need to improve that, and I'm working on as part of repair and streaming redesign( CASSANDRA-5426 , CASSANDRA-5286 ). So, let's focus on the former, validation part. Do you have an Eclipse formatter profile I could use for your coding convention? Sorry, I use intellij, but I think someone on #cassandra-dev on irc has one.
          Hide
          Benedict added a comment -

          Hi Yuki,

          Without in some way collecting (or at least sampling) the size of the differences, I don't know what bucket sizes to use. Since I need to reinsert all the records once I've decided this anyway, I need to retain them all, which I chose to do in EstimatedHistogram as they do, in effect, constitute a histogram. I also sample the largest records which I figure could be useful for debugging purposes (though that was just a guess). I don't see why 1000s of items is a major issue?

          I agree that logging is suboptimal for this data. Presumably similar data for other tasks may be optionally logged in future, and so I would guess this should form part of a discussion about metric logging?

          fix coding style (especially whitespace) to match other code.

          Do you have an Eclipse formatter profile I could use for your coding convention? I did my best to keep it correct manually, but it is difficult to spot differences in an unfamiliar convention. Whitespace should be comparatively easy though.

          EstimatedHistogram#testGroupBy is failing.

          Noted - will fix and resubmit

          comparator in Arrays#sort in EstimatedHistogram#logSummary has the same conditions in both if and else if.

          Thanks, good spot. I'm surprised Eclipse didn't warn me.

          Show
          Benedict added a comment - Hi Yuki, Without in some way collecting (or at least sampling) the size of the differences, I don't know what bucket sizes to use. Since I need to reinsert all the records once I've decided this anyway, I need to retain them all, which I chose to do in EstimatedHistogram as they do, in effect, constitute a histogram. I also sample the largest records which I figure could be useful for debugging purposes (though that was just a guess). I don't see why 1000s of items is a major issue? I agree that logging is suboptimal for this data. Presumably similar data for other tasks may be optionally logged in future, and so I would guess this should form part of a discussion about metric logging? fix coding style (especially whitespace) to match other code. Do you have an Eclipse formatter profile I could use for your coding convention? I did my best to keep it correct manually, but it is difficult to spot differences in an unfamiliar convention. Whitespace should be comparatively easy though. EstimatedHistogram#testGroupBy is failing. Noted - will fix and resubmit comparator in Arrays#sort in EstimatedHistogram#logSummary has the same conditions in both if and else if. Thanks, good spot. I'm surprised Eclipse didn't warn me.
          Hide
          Yuki Morishita added a comment -

          Benedict

          hmm, it is not clear to me why you create EstimatedHistogram of size of differences. Sometimes I see more than 1000 of differences for large clusters. You should just create it with reasonable bucket count. You don't have to keep every size for every range.

          Using logging for outputting statistic is fine at this point, but I think we should come up with other way so that it is easy to see all the related logs and statistics about certain repair session. I don't have specific idea yet though. (Maybe another system cf similar to Tracing?)

          nits:

          • fix coding style (especially whitespace) to match other code.
          • EstimatedHistogram#testGroupBy is failing.
          • comparator in Arrays#sort in EstimatedHistogram#logSummary has the same conditions in both if and else if.
          Show
          Yuki Morishita added a comment - Benedict hmm, it is not clear to me why you create EstimatedHistogram of size of differences. Sometimes I see more than 1000 of differences for large clusters. You should just create it with reasonable bucket count. You don't have to keep every size for every range. Using logging for outputting statistic is fine at this point, but I think we should come up with other way so that it is easy to see all the related logs and statistics about certain repair session. I don't have specific idea yet though. (Maybe another system cf similar to Tracing?) nits: fix coding style (especially whitespace) to match other code. EstimatedHistogram#testGroupBy is failing. comparator in Arrays#sort in EstimatedHistogram#logSummary has the same conditions in both if and else if.
          Hide
          Benedict added a comment - - edited

          Hi Yuki,

          The patch was created some time ago, and there were some minor renames/changes to MerkleTree and AntiEntropyService in the meantime. I've pulled the latest changes, merged, and regenerated the patch. This is against the main trunk / HEAD branch.

          Show
          Benedict added a comment - - edited Hi Yuki, The patch was created some time ago, and there were some minor renames/changes to MerkleTree and AntiEntropyService in the meantime. I've pulled the latest changes, merged, and regenerated the patch. This is against the main trunk / HEAD branch.
          Hide
          Yuki Morishita added a comment -

          Hi, Benedict,
          Patch is not applicable to any branch. Can you rebase and re-upload?
          When you do, please tell us which branch the patch is for.

          Show
          Yuki Morishita added a comment - Hi, Benedict, Patch is not applicable to any branch. Can you rebase and re-upload? When you do, please tell us which branch the patch is for.
          Hide
          Benedict added a comment -

          Hi,

          I've uploaded a patch for this issue (patch.diff - apologies for the potentially future-clashing name). Logging is performed in two places, both on the source (not requesting) node of any comparison:

          1) On the requesting node in AntiEntropyService.Difference.run(), after the MerkleTree difference is calculated and before the StreamingRepairTask is created
          2) On the source node, on which StreamingRepairTask is run, in StreamOut.createPendingFiles()

          In both cases we log, at debug level, a sample of the largest ranges followed by a histogram of the range size distribution. The first is achieved by inserting each range directly into an EstimatedHistogram, on which we call the new logSummary() method; the second by calling the new groupByFrequency() method on that same histogram, to yield a histogram based on the frequency of sizes present in the original (on which we simply call log()).

          In case 1, we construct the MerkleTree to include a size taken from the AbstractCompactedRow we compute the hash from, and use this in MerkleTree.difference to estimate the size of mismatching ranges. This tends to underestimate, versus that reported by StreamOut, by around 15%. One design decision of note here: instead of modifying AbstractCompactedRow to return a size (which would be invasive and in some cases incur an unnecessary penalty) we use a custom implementation of MessageDigest that counts the number of bytes provided to it.

          Case 2 is much simpler, as we already have the ranges and their sizes available to us.

          There are some other changes, particularly in MerkleTree, with some refactoring/renames/new subclasses as part of updating MerkleTree.difference(). In particular, TreeDifference is returned instead of TreeRange (to accommodate the extra size information), and it is used generally in place of it within this method tree where applicable; hash() and hashHelper() have also been renamed to find() and findHelper(), with a new hash() implementation depending on find(). I'm sure there are other minutiae, but hopefully nothing too opaque. If you need any clarification, feel free to ask.

          Show
          Benedict added a comment - Hi, I've uploaded a patch for this issue (patch.diff - apologies for the potentially future-clashing name). Logging is performed in two places, both on the source (not requesting) node of any comparison: 1) On the requesting node in AntiEntropyService.Difference.run(), after the MerkleTree difference is calculated and before the StreamingRepairTask is created 2) On the source node, on which StreamingRepairTask is run, in StreamOut.createPendingFiles() In both cases we log, at debug level, a sample of the largest ranges followed by a histogram of the range size distribution. The first is achieved by inserting each range directly into an EstimatedHistogram, on which we call the new logSummary() method; the second by calling the new groupByFrequency() method on that same histogram, to yield a histogram based on the frequency of sizes present in the original (on which we simply call log()). In case 1, we construct the MerkleTree to include a size taken from the AbstractCompactedRow we compute the hash from, and use this in MerkleTree.difference to estimate the size of mismatching ranges. This tends to underestimate, versus that reported by StreamOut, by around 15%. One design decision of note here: instead of modifying AbstractCompactedRow to return a size (which would be invasive and in some cases incur an unnecessary penalty) we use a custom implementation of MessageDigest that counts the number of bytes provided to it. Case 2 is much simpler, as we already have the ranges and their sizes available to us. There are some other changes, particularly in MerkleTree, with some refactoring/renames/new subclasses as part of updating MerkleTree.difference(). In particular, TreeDifference is returned instead of TreeRange (to accommodate the extra size information), and it is used generally in place of it within this method tree where applicable; hash() and hashHelper() have also been renamed to find() and findHelper(), with a new hash() implementation depending on find(). I'm sure there are other minutiae, but hopefully nothing too opaque. If you need any clarification, feel free to ask.
          Hide
          Ivan Sobolev added a comment -

          Would be glad if someone could have a look.

          Executed AntiEntropyServiceStandardTest to check the thing looks good. Wondering which of the test base classes would be the best fit to test the whole thing against:

          • SSTables inside/outside of repair range
          • Other data/cases/fragile places you may think about
          Show
          Ivan Sobolev added a comment - Would be glad if someone could have a look. Executed AntiEntropyServiceStandardTest to check the thing looks good. Wondering which of the test base classes would be the best fit to test the whole thing against: SSTables inside/outside of repair range Other data/cases/fragile places you may think about
          Hide
          Ivan Sobolev added a comment -

          Hi, guys, it appears, the thing would be useful for me too. I'm going to give it a try. I'll follow the plan summarized by Jonathan Ellis.
          If you feel that I should know something you know - please shout

          Show
          Ivan Sobolev added a comment - Hi, guys, it appears, the thing would be useful for me too. I'm going to give it a try. I'll follow the plan summarized by Jonathan Ellis . If you feel that I should know something you know - please shout
          Hide
          Jonathan Ellis added a comment -

          Following the comments at the top, you want two things here:

          1. A histogram of TreeRange row counts
          2. for each pair of merkle tree, the number of ranges that differs and the corresponding streamed size of the data

          1. is easy: add an EstimatedHistogram to the MerkleTree class, and when the ranges are finished computing, you'd iterate over each and add its row count to the histogram
          2. is a bit more involved: you want to extend the logging done by Differencer to include the given information, which is going to involve poking into the guts of (probably) MerkleTree.difference and performStreamingRepair.

          I agree though that repair is an intimidating part of the code base. If you want to start with something simpler, that's fine too.

          Show
          Jonathan Ellis added a comment - Following the comments at the top, you want two things here: A histogram of TreeRange row counts for each pair of merkle tree, the number of ranges that differs and the corresponding streamed size of the data 1. is easy: add an EstimatedHistogram to the MerkleTree class, and when the ranges are finished computing, you'd iterate over each and add its row count to the histogram 2. is a bit more involved: you want to extend the logging done by Differencer to include the given information, which is going to involve poking into the guts of (probably) MerkleTree.difference and performStreamingRepair. I agree though that repair is an intimidating part of the code base. If you want to start with something simpler, that's fine too.
          Hide
          Jason Wee added a comment -

          Hello Jonathan, I read through this ticket and trying to understand the context. Unfortunately, at this stage, I am clueless where to start. However, attach is the output of nodetool repair and nodetool cfhistogram which may help in this ticket context imo. Please let me know if the attachment helps or I should probably go to trivial bug first so that I can have more time to understand the code as well as the development culture and procedures here.

          Show
          Jason Wee added a comment - Hello Jonathan, I read through this ticket and trying to understand the context. Unfortunately, at this stage, I am clueless where to start. However, attach is the output of nodetool repair and nodetool cfhistogram which may help in this ticket context imo. Please let me know if the attachment helps or I should probably go to trivial bug first so that I can have more time to understand the code as well as the development culture and procedures here.
          Hide
          Jason Wee added a comment -

          output of nodetool repair on one of the node.

          output of nodetool cfhistogram

          Show
          Jason Wee added a comment - output of nodetool repair on one of the node. output of nodetool cfhistogram
          Hide
          Jonathan Ellis added a comment -

          Hi Jason, you've come to the right place. Fire away.

          Show
          Jonathan Ellis added a comment - Hi Jason, you've come to the right place. Fire away.
          Hide
          Jason Wee added a comment - - edited

          Hello, I've been using cassandra and develop client (hector) application to interfacing (crud) to the cassandra. It has been a great software and I would like to contribute back to cassandra and I've read
          http://wiki.apache.org/cassandra/HowToContribute
          which link me to
          https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+12310865+AND+labels+%3D+lhf+AND+status+!%3D+resolved . Since this is a fresh for me, I'm not sure if this is a right place to start contributing and I hope you can response where and how should I able to contribute.
          Thank you.

          Show
          Jason Wee added a comment - - edited Hello, I've been using cassandra and develop client (hector) application to interfacing (crud) to the cassandra. It has been a great software and I would like to contribute back to cassandra and I've read http://wiki.apache.org/cassandra/HowToContribute which link me to https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+12310865+AND+labels+%3D+lhf+AND+status+!%3D+resolved . Since this is a fresh for me, I'm not sure if this is a right place to start contributing and I hope you can response where and how should I able to contribute. Thank you.
          Hide
          Sylvain Lebresne added a comment -

          An EstimatedHistogram would be just fine. That plus for each pair of merkle tree, the number of ranges that differs and the corresponding streamed size of the data would be a very good start imho.

          I think the only thing we need to figure out for this patch is where it makes the most sense to record that data. What I mean here is that the merkle trees are computed on each node participating in a repair (and thus that is where the EstimatedHistogram can be computed), while the computing of the differences is only done on the coordinator. But on an ideal world, it would seem more useful to store those information together (for a given repair) because they are related.

          Show
          Sylvain Lebresne added a comment - An EstimatedHistogram would be just fine. That plus for each pair of merkle tree, the number of ranges that differs and the corresponding streamed size of the data would be a very good start imho. I think the only thing we need to figure out for this patch is where it makes the most sense to record that data. What I mean here is that the merkle trees are computed on each node participating in a repair (and thus that is where the EstimatedHistogram can be computed), while the computing of the differences is only done on the coordinator. But on an ideal world, it would seem more useful to store those information together (for a given repair) because they are related.
          Hide
          Jonathan Ellis added a comment -

          Would an EstimatedHistogram for the entire tree be good enough or do you want to actually log which ranges had which key counts?

          Show
          Jonathan Ellis added a comment - Would an EstimatedHistogram for the entire tree be good enough or do you want to actually log which ranges had which key counts?

            People

            • Assignee:
              Benedict
              Reporter:
              Sylvain Lebresne
              Reviewer:
              Yuki Morishita
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development