Cassandra
  1. Cassandra
  2. CASSANDRA-2405

should expose 'time since last successful repair' for easier aes monitoring

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Fix Version/s: None
    • Component/s: Core
    • Labels:
      None

      Description

      The practical implementation issues of actually ensuring repair runs is somewhat of an undocumented/untreated issue.

      One hopefully low hanging fruit would be to at least expose the time since last successful repair for a particular column family, to make it easier to write a correct script to monitor for lack of repair in a non-buggy fashion.

      1. CASSANDRA-2405.patch
        5 kB
        Pavel Yaskevich
      2. CASSANDRA-2405-v2.patch
        22 kB
        Pavel Yaskevich
      3. CASSANDRA-2405-v3.patch
        22 kB
        Pavel Yaskevich
      4. CASSANDRA-2405-v4.patch
        27 kB
        Pavel Yaskevich

        Issue Links

          Activity

          Hide
          Jonathan Ellis added a comment -

          Good idea, let's create an AntiEntropyServiceMBean (under o.a.c.db) and expose it there.

          Show
          Jonathan Ellis added a comment - Good idea, let's create an AntiEntropyServiceMBean (under o.a.c.db) and expose it there.
          Hide
          Pavel Yaskevich added a comment -

          I will put in under o.a.c.service instead...

          Show
          Pavel Yaskevich added a comment - I will put in under o.a.c.service instead...
          Hide
          Jonathan Ellis added a comment -

          I thought we got rid of .service for 0.7

          Show
          Jonathan Ellis added a comment - I thought we got rid of .service for 0.7
          Hide
          Pavel Yaskevich added a comment -

          It's still around, StorageService, StorageProxy, AntiEntropyService and others live there...

          Show
          Pavel Yaskevich added a comment - It's still around, StorageService, StorageProxy, AntiEntropyService and others live there...
          Hide
          Pavel Yaskevich added a comment -

          Should we also share this using nodetool?

          Show
          Pavel Yaskevich added a comment - Should we also share this using nodetool?
          Hide
          Peter Schuller added a comment -

          IMO yes, since the aim was to make it easy to write a script. Poking JMX in a shell script != easy unless you happen to have crossed that threshold already; nodetool is one way to do that.

          Also: From brief inspection it seems it's storing the time of the completion of the last successful repair. It needs to store the time of the commencement of the last successful repair in order to be useful for monitoring for gc grace seconds purposes since that it what determines at what point in the timeline the node is "guaranteed" to have seen writes.

          Show
          Peter Schuller added a comment - IMO yes, since the aim was to make it easy to write a script. Poking JMX in a shell script != easy unless you happen to have crossed that threshold already; nodetool is one way to do that. Also: From brief inspection it seems it's storing the time of the completion of the last successful repair. It needs to store the time of the commencement of the last successful repair in order to be useful for monitoring for gc grace seconds purposes since that it what determines at what point in the timeline the node is "guaranteed" to have seen writes.
          Hide
          Pavel Yaskevich added a comment -

          I will make "time [Keyspace] [ColumnFamily, ...]" command for the nodetool...

          By commencement do you mean the moment in the Differencer.run() were it decides: do we need a repair or not for each CF (line 492 of AES, right after logger.info())?

          Show
          Pavel Yaskevich added a comment - I will make "time [Keyspace] [ColumnFamily, ...] " command for the nodetool... By commencement do you mean the moment in the Differencer.run() were it decides: do we need a repair or not for each CF (line 492 of AES, right after logger.info())?
          Hide
          Peter Schuller added a comment -

          If I'm reading the code correctly, then no I mean earlier. Recall that the reason AES is important w.r.t. GC grace seconds, is that in order for it to be safe to remove a tombstone for some piece of data, said piece of data must be guaranteed to have become consistent across the cluster up to the point of gc grace period start (at the moment of tombstone removal).

          That essentially boils down to any write that happened prior to the start of the gc grace period must have been propagated, whether it be in the form of a hinted hand-off or by 'nodetool repair'. Since hinted hand-off is only ever an optimization, only nodetool repair is relevant to maintaining the invariant.

          An AES session will only be guaranteed to "see" things that existed in the form of sstables at the point where it started. This presumably means that AES implies that a memtable flush happens (if not, it would be broken I think).

          So that in turn means that the time to record as 'last successful repair' needs to be before the flushing of memtables.

          It should be noted that of course, for monitoring purposes this isn't about a few milliseconds here and there. So maybe that's enough to fudge the memtable flushing (although I'm not personally comfortable with that either); but definitely the time it takes to do the validating compaction must be counted after the millisecond timestamp since that can clearly take a lot of time (even days for large CF:s).

          Show
          Peter Schuller added a comment - If I'm reading the code correctly, then no I mean earlier. Recall that the reason AES is important w.r.t. GC grace seconds, is that in order for it to be safe to remove a tombstone for some piece of data, said piece of data must be guaranteed to have become consistent across the cluster up to the point of gc grace period start (at the moment of tombstone removal). That essentially boils down to any write that happened prior to the start of the gc grace period must have been propagated, whether it be in the form of a hinted hand-off or by 'nodetool repair'. Since hinted hand-off is only ever an optimization, only nodetool repair is relevant to maintaining the invariant. An AES session will only be guaranteed to "see" things that existed in the form of sstables at the point where it started. This presumably means that AES implies that a memtable flush happens (if not, it would be broken I think). So that in turn means that the time to record as 'last successful repair' needs to be before the flushing of memtables. It should be noted that of course, for monitoring purposes this isn't about a few milliseconds here and there. So maybe that's enough to fudge the memtable flushing (although I'm not personally comfortable with that either); but definitely the time it takes to do the validating compaction must be counted after the millisecond timestamp since that can clearly take a lot of time (even days for large CF:s).
          Hide
          Peter Schuller added a comment -

          A further complication: Since the intent here is to enable people to set up alarms to trigger whenever the time-since-last is not within an acceptable range, it raises the issue of whether to keep this information persistent in system tables or just in-memory. Keeping in mind that:

          (1) For large amounts of data the act of doing another round of AES "just in case" if a node was restarted is significant
          (2) If the alarm were to triggered on the information not being available, that would instantly lead to false positive alarms when nodes are restarted, instantly rendering alarms useless to operations.
          (3) If the alarm were to ignore the case where the information is not yet available, that is a very dangerous silent failure and effectively means the alarm is not functioning properly.

          ... I get the feeling one wants this information persistent.

          I guess this all makes the ticket non-trivial, but I think the need for an "easy" way for operators to ensure sufficient AES frequency is important.

          (I'm actually kind of surprised issues with this do not crop up more often on the mailing lists... am I missing something that mitigates the impact here, or are people just using sufficiently long grace periods relative to repair frequency that they're not hitting these things in practice?)

          Show
          Peter Schuller added a comment - A further complication: Since the intent here is to enable people to set up alarms to trigger whenever the time-since-last is not within an acceptable range, it raises the issue of whether to keep this information persistent in system tables or just in-memory. Keeping in mind that: (1) For large amounts of data the act of doing another round of AES "just in case" if a node was restarted is significant (2) If the alarm were to triggered on the information not being available, that would instantly lead to false positive alarms when nodes are restarted, instantly rendering alarms useless to operations. (3) If the alarm were to ignore the case where the information is not yet available, that is a very dangerous silent failure and effectively means the alarm is not functioning properly. ... I get the feeling one wants this information persistent. I guess this all makes the ticket non-trivial, but I think the need for an "easy" way for operators to ensure sufficient AES frequency is important. (I'm actually kind of surprised issues with this do not crop up more often on the mailing lists... am I missing something that mitigates the impact here, or are people just using sufficiently long grace periods relative to repair frequency that they're not hitting these things in practice?)
          Hide
          Pavel Yaskevich added a comment -

          Thanks for your comment Peter, I agree on both - timestamp should be stored right before validation compaction is started (AES line 618) which is flushing data btw and that this information should be persisted into system tables. Another question is open for consideration - what should we return if we don't have any information about given KS/CF: -1 or throw an exception?

          Show
          Pavel Yaskevich added a comment - Thanks for your comment Peter, I agree on both - timestamp should be stored right before validation compaction is started (AES line 618) which is flushing data btw and that this information should be persisted into system tables. Another question is open for consideration - what should we return if we don't have any information about given KS/CF: -1 or throw an exception?
          Hide
          Peter Schuller added a comment -

          The best solution I can think of is to populate the information on CF creation with the timestamp that represents the time the CF was created on the node. If the node was bootstrapped as usual, that would have happened after the local CF creation. If it was not (e.g. forcefully inserted into the ring), then some operator has explicitly made the choice of entering it into the ring "inconsistently" anyway so it doesn't matter.

          If this is easy to do, I think it would make for a really clean solution from the point of view of the user. The nodetool command would always return valid data except if something is truly broken; not even a single edge case to deal with. Simplicity rocks for this type of thing (for writing a monitoring script to trigger an alarm).

          If that's overkill/non-easy, I dunno - slight preference for throwing an exception just because I really dislike silent failures and returning an out-of-band integer seems more likely to go unnoticed if somehow it never changes because repair is never run, for example. I.e, either your monitoring script treats -1 as an error anyway (so it's no worse in terms of triggering the alarm unnecessarily than an exception), or it doesn't - in which case you have a silent failure mode in the case of perpetual lack of repair running.

          Show
          Peter Schuller added a comment - The best solution I can think of is to populate the information on CF creation with the timestamp that represents the time the CF was created on the node. If the node was bootstrapped as usual, that would have happened after the local CF creation. If it was not (e.g. forcefully inserted into the ring), then some operator has explicitly made the choice of entering it into the ring "inconsistently" anyway so it doesn't matter. If this is easy to do, I think it would make for a really clean solution from the point of view of the user. The nodetool command would always return valid data except if something is truly broken; not even a single edge case to deal with. Simplicity rocks for this type of thing (for writing a monitoring script to trigger an alarm). If that's overkill/non-easy, I dunno - slight preference for throwing an exception just because I really dislike silent failures and returning an out-of-band integer seems more likely to go unnoticed if somehow it never changes because repair is never run, for example. I.e, either your monitoring script treats -1 as an error anyway (so it's no worse in terms of triggering the alarm unnecessarily than an exception), or it doesn't - in which case you have a silent failure mode in the case of perpetual lack of repair running.
          Hide
          Pavel Yaskevich added a comment -

          SGTM, I will try to stick with the first scenario populating data on the CF creation.

          Show
          Pavel Yaskevich added a comment - SGTM, I will try to stick with the first scenario populating data on the CF creation.
          Hide
          Pavel Yaskevich added a comment -

          working branch - trunk, latest commit 1adcd1c52e87a2f59d9006a29fb9476174968a60

          Show
          Pavel Yaskevich added a comment - working branch - trunk, latest commit 1adcd1c52e87a2f59d9006a29fb9476174968a60
          Hide
          Sylvain Lebresne added a comment -

          This needs rebasing. First, two small remarks:

          • It seems we store the time in microseconds but then, when computing the time since last repair we use System.currentTimeMillis() - stored_time.
          • I would be in favor of calling the system table REPAIR_INFO, because the truth is I think it would make sense to record a number of other statistics on repair and it doesn't hurt to make the system table less specific. That also means we should probably not force any type for the value (though that can be easily changed later, so it's not a bit deal for this patch).
          • I think we usually put the code to query the system table in SystemTable, so I would move it from AntiEntropy to there.

          Then more generally, a given repair involves multiple states and multiple nodes, so I don't think keeping only one timestamp is enough. Right now, we save the time of the last scheduled validation compaction on each node. With only that we're missing information so that people can do any reasonably inform decision:

          • First, this does not correspond to the last repair session started on that node, since the validation can be a request from another node. People may be interested by that information.
          • Second, given that repair concerns a given range, keeping only one general number is wrong (it would suggest the node have been repaired recently even when only one range out of 3 or 5 have been actually repaired).
          • Third, though recording the start of the validation compaction is important, this says nothing on the success of the repair (and we all know failing during repair do happen, if only because it's a fairly long operation during which node can die). So we need to record some info on the success of the operation if we don't want to return misleading information. Turns out, this is easy to record on the node coordinating the repair, maybe not so much on the other node participating in the repair.

          Truth is, I'm not so sure what is the simplest way to handle this. Maybe one option could be to only register the start and end time of a repair session on the coordinator of the repair (adding the info of which range was repaired).

          Also, what do people think of keeping an history (instead of just keeping the last number). I'm thinking a little bit ahead here, but what about storing one supercolumn by repair, where the super column name would be the repair session id (a TimeUUID really) and the columns infos on that repair. For this patch we would only record the range for that session, the start time and the end time (or maybe one end time for each node). But we would populate this a little bit further with stuff like CASSANDRA-2698. I think having such history would be fairly interesting.

          Show
          Sylvain Lebresne added a comment - This needs rebasing. First, two small remarks: It seems we store the time in microseconds but then, when computing the time since last repair we use System.currentTimeMillis() - stored_time. I would be in favor of calling the system table REPAIR_INFO, because the truth is I think it would make sense to record a number of other statistics on repair and it doesn't hurt to make the system table less specific. That also means we should probably not force any type for the value (though that can be easily changed later, so it's not a bit deal for this patch). I think we usually put the code to query the system table in SystemTable, so I would move it from AntiEntropy to there. Then more generally, a given repair involves multiple states and multiple nodes, so I don't think keeping only one timestamp is enough. Right now, we save the time of the last scheduled validation compaction on each node. With only that we're missing information so that people can do any reasonably inform decision: First, this does not correspond to the last repair session started on that node, since the validation can be a request from another node. People may be interested by that information. Second, given that repair concerns a given range, keeping only one general number is wrong (it would suggest the node have been repaired recently even when only one range out of 3 or 5 have been actually repaired). Third, though recording the start of the validation compaction is important, this says nothing on the success of the repair (and we all know failing during repair do happen, if only because it's a fairly long operation during which node can die). So we need to record some info on the success of the operation if we don't want to return misleading information. Turns out, this is easy to record on the node coordinating the repair, maybe not so much on the other node participating in the repair. Truth is, I'm not so sure what is the simplest way to handle this. Maybe one option could be to only register the start and end time of a repair session on the coordinator of the repair (adding the info of which range was repaired). Also, what do people think of keeping an history (instead of just keeping the last number). I'm thinking a little bit ahead here, but what about storing one supercolumn by repair, where the super column name would be the repair session id (a TimeUUID really) and the columns infos on that repair. For this patch we would only record the range for that session, the start time and the end time (or maybe one end time for each node). But we would populate this a little bit further with stuff like CASSANDRA-2698 . I think having such history would be fairly interesting.
          Hide
          Jonathan Ellis added a comment -

          what do people think of keeping an history

          I like that. No reason to throw away such small pieces of potentially useful information.

          Show
          Jonathan Ellis added a comment - what do people think of keeping an history I like that. No reason to throw away such small pieces of potentially useful information.
          Hide
          Pavel Yaskevich added a comment -

          It seems we store the time in microseconds but then, when computing the time since last repair we use System.currentTimeMillis() - stored_time.

          Please take a look at the storeLastSuccessfulRepairTime method - it is storing System.currentTimeMillis() and timeSinceLastSuccessfulRepair is also using System.currentTimeMillis()

          I would be in favor of calling the system table REPAIR_INFO, because the truth is I think it would make sense to record a number of other statistics on repair and it doesn't hurt to make the system table less specific. That also means we should probably not force any type for the value (though that can be easily changed later, so it's not a bit deal for this patch).

          Will do

          I think we usually put the code to query the system table in SystemTable, so I would move it from AntiEntropy to there.

          Will do

          About history: I like the idea of keeping a history for each of the successful repairs, I will check if that is possible to track end time of each of the nodes easily but anyway will do an start/end time tracking for coordinator.

          Show
          Pavel Yaskevich added a comment - It seems we store the time in microseconds but then, when computing the time since last repair we use System.currentTimeMillis() - stored_time. Please take a look at the storeLastSuccessfulRepairTime method - it is storing System.currentTimeMillis() and timeSinceLastSuccessfulRepair is also using System.currentTimeMillis() I would be in favor of calling the system table REPAIR_INFO, because the truth is I think it would make sense to record a number of other statistics on repair and it doesn't hurt to make the system table less specific. That also means we should probably not force any type for the value (though that can be easily changed later, so it's not a bit deal for this patch). Will do I think we usually put the code to query the system table in SystemTable, so I would move it from AntiEntropy to there. Will do About history: I like the idea of keeping a history for each of the successful repairs, I will check if that is possible to track end time of each of the nodes easily but anyway will do an start/end time tracking for coordinator.
          Hide
          Jonathan Ellis added a comment -

          I like the idea of keeping a history for each of the successful repairs, I will check if that is possible to track end time of each of the nodes easily

          I think each node should only worry about itself here.

          One possible data model: row key is Range repaired, columns are times it completed successfully. We could store extra metadata (start time, ?) as subcolumns. (And I'd just call it system.repairs.)

          Show
          Jonathan Ellis added a comment - I like the idea of keeping a history for each of the successful repairs, I will check if that is possible to track end time of each of the nodes easily I think each node should only worry about itself here. One possible data model: row key is Range repaired, columns are times it completed successfully. We could store extra metadata (start time, ?) as subcolumns. (And I'd just call it system.repairs.)
          Hide
          Pavel Yaskevich added a comment -

          I concur. I was also thinking about doing it that way - system.repairs[<range>][<successful_completion_time_in_millis>][started_at] = <start_time_in_millis>;

          Show
          Pavel Yaskevich added a comment - I concur. I was also thinking about doing it that way - system.repairs [<range>] [<successful_completion_time_in_millis>] [started_at] = <start_time_in_millis>;
          Hide
          Pavel Yaskevich added a comment - - edited

          I think I will go with this model: key - "

          {KS}

          /

          {CF}

          " (string), columns are times repair completed successfully, subcolumns are range (string) and started_at (millis). That will it easy to query time since last successful repair + will give us a change to store valuable information.

          Show
          Pavel Yaskevich added a comment - - edited I think I will go with this model: key - " {KS} / {CF} " (string), columns are times repair completed successfully, subcolumns are range (string) and started_at (millis). That will it easy to query time since last successful repair + will give us a change to store valuable information.
          Hide
          Pavel Yaskevich added a comment -

          work branch: cassandra-0.8, the latest commit 53cf64fc3a191b65b03bd957da0528a08ac9b865

          Show
          Pavel Yaskevich added a comment - work branch: cassandra-0.8, the latest commit 53cf64fc3a191b65b03bd957da0528a08ac9b865
          Hide
          Sylvain Lebresne added a comment -

          The problem with using the completion time as the (Super)Column name is that you have to wait the end of the repair to store anything. First, this will not capture started but failed session (which while not mandatory could be nice, especially as soon as we will start keeping a bit more info this could help troubleshooting). And Second, it will be a pain to have to keep some of the information until the end (the processingStartedAt is a first sign of this). And third, we may want to keep some info on say merkle tree creation on all replica participating in the repair, even though we only store the completed time on the node initiating the repair.

          So I would propose to something like:
          row key: KS/CF
          super column name: repair session name (a TimeUUID)
          columns: the infos on the session (range, start and end time, number of range repaired, bytes transferred, ...)

          That is roughly the same thing as you propose but with super column name being the repair session name.

          Now, because the repair session names are TimeUUID (well, right now it is a sting including a UUID, we can change it to a simple TimeUUID easily), the session will be ordered by creation time. So getting the last successful repair is probably not too hard: just grab the last 1000 created sessions and find the last successful one.
          And if we want, we can even use another specific "index" row that associate 'completion time' -> 'session UUID' (and thanks to the new DynamicCompositeType we can have some rows ordered by TimeUUIDType and some other ordered by LongType without the need of multiple system table).

          Show
          Sylvain Lebresne added a comment - The problem with using the completion time as the (Super)Column name is that you have to wait the end of the repair to store anything. First, this will not capture started but failed session (which while not mandatory could be nice, especially as soon as we will start keeping a bit more info this could help troubleshooting). And Second, it will be a pain to have to keep some of the information until the end (the processingStartedAt is a first sign of this). And third, we may want to keep some info on say merkle tree creation on all replica participating in the repair, even though we only store the completed time on the node initiating the repair. So I would propose to something like: row key: KS/CF super column name: repair session name (a TimeUUID) columns: the infos on the session (range, start and end time, number of range repaired, bytes transferred, ...) That is roughly the same thing as you propose but with super column name being the repair session name. Now, because the repair session names are TimeUUID (well, right now it is a sting including a UUID, we can change it to a simple TimeUUID easily), the session will be ordered by creation time. So getting the last successful repair is probably not too hard: just grab the last 1000 created sessions and find the last successful one. And if we want, we can even use another specific "index" row that associate 'completion time' -> 'session UUID' (and thanks to the new DynamicCompositeType we can have some rows ordered by TimeUUIDType and some other ordered by LongType without the need of multiple system table).
          Hide
          Pavel Yaskevich added a comment -

          We need to decide - are we going with description of this task or trying to build logging for repair?

          Show
          Pavel Yaskevich added a comment - We need to decide - are we going with description of this task or trying to build logging for repair?
          Hide
          Sylvain Lebresne added a comment -

          I'm keen on adding persisted stats for repair for CASSANDRA-2698. Recording the start and end time of repair also amounts to persisting stats on repair. Given that, I don't care too much about what the description of this says, but I'm pretty much opposed to doing anything here that would make CASSANDRA-2698 much harder that it needs unless there is a good reason, and I don't see one. I'm happy with making this a duplicate or dependency of CASSANDRA-2698 though.

          Show
          Sylvain Lebresne added a comment - I'm keen on adding persisted stats for repair for CASSANDRA-2698 . Recording the start and end time of repair also amounts to persisting stats on repair. Given that, I don't care too much about what the description of this says, but I'm pretty much opposed to doing anything here that would make CASSANDRA-2698 much harder that it needs unless there is a good reason, and I don't see one. I'm happy with making this a duplicate or dependency of CASSANDRA-2698 though.
          Hide
          Pavel Yaskevich added a comment -

          Understood, will keep this in mind.

          Show
          Pavel Yaskevich added a comment - Understood, will keep this in mind.
          Hide
          Pavel Yaskevich added a comment -

          I'm thinking of using range as supercolumn and storing session_id, start/completion times as subcolumns, that will give us possibility to log individually each of the repairs made in the session.

          Show
          Pavel Yaskevich added a comment - I'm thinking of using range as supercolumn and storing session_id, start/completion times as subcolumns, that will give us possibility to log individually each of the repairs made in the session.
          Hide
          Sylvain Lebresne added a comment -

          Hum, the thing is that there will be many repair sessions for a given set of KS/CF and range. So you need one of the key (either row key or supercolumn name) to be the session_id (or anything that is unique to a session). If you use a row for each KS/CF pair and one super column for each session, you will have one super column for each repair made in a session (or kind of, you will indeed have multiple merkle tree for instance, one for each replica, but we can easily prefix the column with the replica name if need be).

          Show
          Sylvain Lebresne added a comment - Hum, the thing is that there will be many repair sessions for a given set of KS/CF and range. So you need one of the key (either row key or supercolumn name) to be the session_id (or anything that is unique to a session). If you use a row for each KS/CF pair and one super column for each session, you will have one super column for each repair made in a session (or kind of, you will indeed have multiple merkle tree for instance, one for each replica, but we can easily prefix the column with the replica name if need be).
          Hide
          Pavel Yaskevich added a comment -

          I suggest to create 2 CFs: one as we have right now so users can easily trigger time since last successful repair and CF to hold detailed information about each repair session with the following structure: sessionId (key), range (super column), subcolumns: started_at (long), completed_at (long), failed (0 or 1)

          Show
          Pavel Yaskevich added a comment - I suggest to create 2 CFs: one as we have right now so users can easily trigger time since last successful repair and CF to hold detailed information about each repair session with the following structure: sessionId (key), range (super column), subcolumns: started_at (long), completed_at (long), failed (0 or 1)
          Hide
          Sylvain Lebresne added a comment -

          Nothing against that, though if we're going to have only a handful of rows in each it could be more efficient/cleaner to use the DynamicCompositeType instead of the creating two different CFs. Though if you absolutely prefer 2 CFs I won't fight against it.

          Show
          Sylvain Lebresne added a comment - Nothing against that, though if we're going to have only a handful of rows in each it could be more efficient/cleaner to use the DynamicCompositeType instead of the creating two different CFs. Though if you absolutely prefer 2 CFs I won't fight against it.
          Hide
          Pavel Yaskevich added a comment -

          Made two CFs: SuccessfulRepairs (to use with `./bin/nodetool time` command) and RepairDetailedInfo - with the structure from my previous comment.

          Show
          Pavel Yaskevich added a comment - Made two CFs: SuccessfulRepairs (to use with `./bin/nodetool time` command) and RepairDetailedInfo - with the structure from my previous comment.
          Hide
          Jonathan Ellis added a comment -

          Nick and Tyler pointed out that we want similar information for basically everything that runs on CompactionManager. Maybe we could make the DetailedInfo cf take an arbitrary map from CompactionInfo? (Rename to BackgroundIODetails?)

          Show
          Jonathan Ellis added a comment - Nick and Tyler pointed out that we want similar information for basically everything that runs on CompactionManager. Maybe we could make the DetailedInfo cf take an arbitrary map from CompactionInfo? (Rename to BackgroundIODetails?)
          Hide
          Pavel Yaskevich added a comment -

          Separate task?

          Show
          Pavel Yaskevich added a comment - Separate task?
          Hide
          Jonathan Ellis added a comment -

          I'd rather not create a CF for this task if we know it's going to be obsolete in a few days.

          Show
          Jonathan Ellis added a comment - I'd rather not create a CF for this task if we know it's going to be obsolete in a few days.
          Hide
          Pavel Yaskevich added a comment -

          v3 does not have a detailed info CF so it can be committed...

          Show
          Pavel Yaskevich added a comment - v3 does not have a detailed info CF so it can be committed...
          Hide
          Sylvain Lebresne added a comment -

          I'm sorry but I still think we are still returning the wrong number to the user. To be clear, this is nothing against the code of the patch itself, I just think that given the way repair works, it is not so simple to have a "time since last successful repair".

          The "unit" of a repair is for a given keyspace, column family and range. Because of that, I don't think we can return a single "time since last successful repair" for a given keyspace and column family. It has to include the range somehow. Granted, so far a nodetool repair repairs all the ranges of the node you launch it on, but I don't think this should be the case (CASSANDRA-2610). Moreover, even now, one of the range can fail without the other. So returning only one number for all ranges is wrong.

          The other problem is: I'm not convinced that recording the information only on the node coordinating the repair is necessarily super helpful. When you start a repair a node, you will also repair its neighbor (for only the range they share), so recording the time only on the initial node on which the nodetool command was connected is random, and will convey the idea that repair should be started for every range on every node (while I strongly thing that the short term goal should be to make it easy to NOT do that – CASSANDRA-2610 again).

          Imho, we should hold back on this issue for now and at least wait for CASSANDRA-2610, CASSANDRA-2606 and CASSANDRA-2816 before committing to anything. I agree that having information to help people plan repair is nice, but it is at most a very minor improvement and exposing a misleading number is more harmful that no number.

          Show
          Sylvain Lebresne added a comment - I'm sorry but I still think we are still returning the wrong number to the user. To be clear, this is nothing against the code of the patch itself, I just think that given the way repair works, it is not so simple to have a "time since last successful repair". The "unit" of a repair is for a given keyspace, column family and range. Because of that, I don't think we can return a single "time since last successful repair" for a given keyspace and column family. It has to include the range somehow. Granted, so far a nodetool repair repairs all the ranges of the node you launch it on, but I don't think this should be the case ( CASSANDRA-2610 ). Moreover, even now, one of the range can fail without the other. So returning only one number for all ranges is wrong. The other problem is: I'm not convinced that recording the information only on the node coordinating the repair is necessarily super helpful. When you start a repair a node, you will also repair its neighbor (for only the range they share), so recording the time only on the initial node on which the nodetool command was connected is random, and will convey the idea that repair should be started for every range on every node (while I strongly thing that the short term goal should be to make it easy to NOT do that – CASSANDRA-2610 again). Imho, we should hold back on this issue for now and at least wait for CASSANDRA-2610 , CASSANDRA-2606 and CASSANDRA-2816 before committing to anything. I agree that having information to help people plan repair is nice, but it is at most a very minor improvement and exposing a misleading number is more harmful that no number.
          Hide
          Pavel Yaskevich added a comment -

          Sounds reasonable, maybe we should close this with "Won't fix" and create a more general one?

          Show
          Pavel Yaskevich added a comment - Sounds reasonable, maybe we should close this with "Won't fix" and create a more general one?
          Hide
          Jonathan Ellis added a comment -

          Let's just mark this as blocked-by 2610 and 2606

          Show
          Jonathan Ellis added a comment - Let's just mark this as blocked-by 2610 and 2606
          Hide
          Pavel Yaskevich added a comment -

          Sure.

          Show
          Pavel Yaskevich added a comment - Sure.
          Hide
          Jonathan Ellis added a comment -

          Bumping priority since, as noted in CASSANDRA-3620, we can purge tombstones "early" after a successful repair, which would be a big improvement for delete-heavy workloads. (As a separate ticket.)

          Show
          Jonathan Ellis added a comment - Bumping priority since, as noted in CASSANDRA-3620 , we can purge tombstones "early" after a successful repair, which would be a big improvement for delete-heavy workloads. (As a separate ticket.)
          Hide
          Jonathan Ellis added a comment -

          Created CASSANDRA-5839 to pick this up again. I think there's enough baggage here (discussion of supercolumns, DCT) that it's worth starting fresh.

          Show
          Jonathan Ellis added a comment - Created CASSANDRA-5839 to pick this up again. I think there's enough baggage here (discussion of supercolumns, DCT) that it's worth starting fresh.

            People

            • Assignee:
              Unassigned
              Reporter:
              Peter Schuller
              Reviewer:
              Sylvain Lebresne
            • Votes:
              2 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development