# GSoC 2010: Failure Detector Model

## Details

• Type: Wish
• Status: Open
• Priority: Major
• Resolution: Unresolved
• Affects Version/s: None
• Fix Version/s:
• Component/s: None
• Labels:

## Description

Failure Detector Module
Possible Mentor
Henry Robinson (henry at apache dot org)

Requirements
Java, some distributed systems knowledge, comfort implementing distributed systems protocols

Description
ZooKeeper servers detects the failure of other servers and clients by counting the number of 'ticks' for which it doesn't get a heartbeat from other machines. This is the 'timeout' method of failure detection and works very well; however it is possible that it is too aggressive and not easily tuned for some more unusual ZooKeeper installations (such as in a wide-area network, or even in a mobile ad-hoc network).

This project would abstract the notion of failure detection to a dedicated Java module, and implement several failure detectors to compare and contrast their appropriateness for ZooKeeper. For example, Apache Cassandra uses a phi-accrual failure detector (http://ddsg.jaist.ac.jp/pub/HDY+04.pdf) which is much more tunable and has some very interesting properties. This is a great project if you are interested in distributed algorithms, or want to help re-factor some of ZooKeeper's internal code.

## Attachments

1. bertier-pseudo.txt
2 kB
Abmar Barros
2. bertier-pseudo.txt
0.7 kB
Abmar Barros
3. chen-pseudo.txt
1 kB
Abmar Barros
4. chen-pseudo.txt
0.4 kB
Abmar Barros
5. phiaccrual-pseudo.txt
2 kB
Abmar Barros
6. phiaccrual-pseudo.txt
0.7 kB
Abmar Barros
7. ZOOKEEPER-702.patch
225 kB
Abmar Barros
8. ZOOKEEPER-702.patch
253 kB
Abmar Barros
9. ZOOKEEPER-702.patch
224 kB
Abmar Barros
10. ZOOKEEPER-702.patch
221 kB
Abmar Barros
11. ZOOKEEPER-702.patch
202 kB
Abmar Barros
12. ZOOKEEPER-702.patch
202 kB
Abmar Barros
13. ZOOKEEPER-702.patch
202 kB
Abmar Barros
14. ZOOKEEPER-702.patch
202 kB
Abmar Barros
15. ZOOKEEPER-702.patch
201 kB
Abmar Barros
16. ZOOKEEPER-702.patch
199 kB
Abmar Barros
17. ZOOKEEPER-702.patch
199 kB
Abmar Barros
18. ZOOKEEPER-702.patch
206 kB
Abmar Barros
19. ZOOKEEPER-702.patch
199 kB
Abmar Barros
20. ZOOKEEPER-702.patch
199 kB
Abmar Barros
21. ZOOKEEPER-702.patch
197 kB
Abmar Barros
22. ZOOKEEPER-702.patch
167 kB
Abmar Barros
23. ZOOKEEPER-702.patch
163 kB
Abmar Barros
24. ZOOKEEPER-702.patch
171 kB
Abmar Barros
25. ZOOKEEPER-702.patch
173 kB
Abmar Barros
26. ZOOKEEPER-702.patch
155 kB
Abmar Barros
27. ZOOKEEPER-702.patch
145 kB
Abmar Barros
28. ZOOKEEPER-702.patch
146 kB
Abmar Barros
29. ZOOKEEPER-702.patch
142 kB
Abmar Barros
30. ZOOKEEPER-702.patch
127 kB
Abmar Barros
31. ZOOKEEPER-702.patch
88 kB
Abmar Barros
32. ZOOKEEPER-702.patch
53 kB
Abmar Barros
33. ZOOKEEPER-702.patch
16 kB
Abmar Barros
34. ZOOKEEPER-702-code.patch
155 kB
Abmar Barros
35. ZOOKEEPER-702-doc.patch
16 kB
Abmar Barros

 1 Failure Detector Model: Refactor server to server monitoring Open 2 Failure Detector Model: Write Forrest docs Open 3 Failure Detector Model: Evaluate QoS metrics Open

## Activity

Hide
Sergey Doroshenko added a comment -

Submitted GSoC application for this. Looking forward to feedback.
Btw, link to HDY04 is broken, found this work only with google.

Show
Sergey Doroshenko added a comment - Submitted GSoC application for this. Looking forward to feedback. Btw, link to HDY04 is broken, found this work only with google.
Hide
Abmar Barros added a comment -

Also submitted GSoC project proposal for this.

Show
Abmar Barros added a comment - Also submitted GSoC project proposal for this.
Hide
Patrick Hunt added a comment -

Henry, I'm not seeing the applications on the gsoc site. Can you add me as appropriate? So that I can see the applications I mean.

Show
Patrick Hunt added a comment - Henry, I'm not seeing the applications on the gsoc site. Can you add me as appropriate? So that I can see the applications I mean.
Hide
Abmar Barros added a comment -

I have updated the wiki page for this (http://wiki.apache.org/hadoop/ZooKeeper/GSoCFailureDetector), including a discussion on whether there should be another thread for failure detection.
I have previously discussed it with Flavio in order to raise benefits and drawbacks of including another thread in the application and we are expecting feedback to make this decision.

Show
Abmar Barros added a comment - I have updated the wiki page for this ( http://wiki.apache.org/hadoop/ZooKeeper/GSoCFailureDetector ), including a discussion on whether there should be another thread for failure detection. I have previously discussed it with Flavio in order to raise benefits and drawbacks of including another thread in the application and we are expecting feedback to make this decision.
Hide
Abmar Barros added a comment -

This patch proposes a first version of the FailureDetector interface. Its implementations are intended to run in the same thread of the application. Also, I have implemented a simple failure detector based on the ClientCnxn class and adapted this class to use this FD implementation.
This is a preliminary patch, just for analysis and feedback, it is not intended to be committed.

Show
Abmar Barros added a comment - This patch proposes a first version of the FailureDetector interface. Its implementations are intended to run in the same thread of the application. Also, I have implemented a simple failure detector based on the ClientCnxn class and adapted this class to use this FD implementation. This is a preliminary patch, just for analysis and feedback, it is not intended to be committed.
Hide
Flavio Junqueira added a comment -

Hi Abmar, I was wondering about FailureDetectorStatus. It might be cleaner to separate getStatus into two separate calls, each call for each of the data structures you're currently returning on FailureDetectorStatus.

I was also thinking that a shorter package name might be better, and it might be also better to have it under zookeeper. I was thinking about something along the lines of: "org.apache.zookeeper.fd" or "org.apache.zookeeper.fdetector".

Show
Flavio Junqueira added a comment - Hi Abmar, I was wondering about FailureDetectorStatus. It might be cleaner to separate getStatus into two separate calls, each call for each of the data structures you're currently returning on FailureDetectorStatus. I was also thinking that a shorter package name might be better, and it might be also better to have it under zookeeper. I was thinking about something along the lines of: "org.apache.zookeeper.fd" or "org.apache.zookeeper.fdetector".
Hide
Abmar Barros added a comment -

Attached the pseudo-codes for the failure detectors I have proposed. It seems I need to include the heartbeat sequence number in the receiveHeartbeat method of the FailureDetector interface in order to implement some of these FDs.

Show
Abmar Barros added a comment - Attached the pseudo-codes for the failure detectors I have proposed. It seems I need to include the heartbeat sequence number in the receiveHeartbeat method of the FailureDetector interface in order to implement some of these FDs.
Hide
Abmar Barros added a comment -

Once ZooKeeper guarantees delivery order, I can ignore the seq field in heartbeat and pings messages.
Now I am implementing the methods on the interface I proposed and two issues came to mind:

• ZooKeeper uses application messages as hearbeats, so actual pings and heartbeats are used only at idle states. The proposed methods must be adapted to consider this scenario.
• Regarding that a single thread is being used for the app and the FD, there is a delay between the time a monitored must be pinged and the time it is actually pinged. The impact of this delay in the FD QoS must be analyzed.
Show
Abmar Barros added a comment - Once ZooKeeper guarantees delivery order, I can ignore the seq field in heartbeat and pings messages. Now I am implementing the methods on the interface I proposed and two issues came to mind: ZooKeeper uses application messages as hearbeats, so actual pings and heartbeats are used only at idle states. The proposed methods must be adapted to consider this scenario. Regarding that a single thread is being used for the app and the FD, there is a delay between the time a monitored must be pinged and the time it is actually pinged. The impact of this delay in the FD QoS must be analyzed.
Hide
Flavio Junqueira added a comment -

Abmar, It would be nice if you could add some comments to the pseudo-code of each FD, especially for the variables and parameters, and also provide the corresponding references.

On your first point above, it would be nice if we could keep using zookeeper messages as heartbeats.

On your second point, I believe that the problem you're alluding to will exist in any implementation, since scheduling a message to be sent doesn't guarantee that it will be sent right away. Is there any different issue you're seeing here that I'm overlooking?

Show
Flavio Junqueira added a comment - Abmar, It would be nice if you could add some comments to the pseudo-code of each FD, especially for the variables and parameters, and also provide the corresponding references. On your first point above, it would be nice if we could keep using zookeeper messages as heartbeats. On your second point, I believe that the problem you're alluding to will exist in any implementation, since scheduling a message to be sent doesn't guarantee that it will be sent right away. Is there any different issue you're seeing here that I'm overlooking?
Hide
Abmar Barros added a comment -

Agreed with what Flavio said about the second point. The application scheduling interval is, indeed, much lower than the FD pinging interval.

Attached the implementations of all initially suggested FDs and their unit tests. Also have included the suggestions Flavio gave concerning package naming and method scope.

Once the Phi Accrual implementation needs to compute the Normal Distribution cdf, I have used the math-commons API, however, it still has to be added as an Ivy dependency.

So, before experimenting, there are a few steps that need to be accomplished:

• Create a receiveAppHeartbeat() method on the interface so that we can keep using zookeeper messages as heartbeats and analyze its impact and adapt it to each FD implementation
• Adapt server side code to use the proposed FD interface
• Expand unit tests
• Enhance code documentation
• Add math-commons dependency to Ivy
Show
Hide
Flavio Junqueira added a comment -

3- I was wondering what the current plan is to configure which failure detector to use and the parameters of the failure detector. In the current patch, it looks like you're setting it directly in the code.

Show
Flavio Junqueira added a comment - Three very quick comments, Abmar: 1- Please don't forget to add license headers to each one of the new files; 2- Please add javadocs; 3- I was wondering what the current plan is to configure which failure detector to use and the parameters of the failure detector. In the current patch, it looks like you're setting it directly in the code.
Hide
Abmar Barros added a comment -

Sure Flavio, so it will take another step:

• Make failure detector method and parameters configurable by the user.
Show
Abmar Barros added a comment - Sure Flavio, so it will take another step: Make failure detector method and parameters configurable by the user.
Hide
Abmar Barros added a comment -

As discussed with Flavio, I am thinking of adding a new parameter called -fd to the client, which will determine the name of the failure detector to be used. In this sense, the parameters of the failure detector will have its name as prefix, for instance:

-fd chen -fd.chen.alpha 10000

In the server side, these parameters are intended to be defined in the ZooKeeper configuration file.
What do you think?

Show
Abmar Barros added a comment - As discussed with Flavio, I am thinking of adding a new parameter called -fd to the client, which will determine the name of the failure detector to be used. In this sense, the parameters of the failure detector will have its name as prefix, for instance: -fd chen -fd.chen.alpha 10000 In the server side, these parameters are intended to be defined in the ZooKeeper configuration file. What do you think?
Hide
Abmar Barros added a comment -

Show
Hide
Abmar Barros added a comment -

Patch changes are:

• Created appMessageReceived() and appMessageSent() methods. These methods allow the Failure Detector to use application messages as heartbeats, which reflects the ZooKeeper case.
• Added command line options to configure failure detector method and its parameters.
• Unit tests expanded to comply with new methods.
• Enhanced javadocs for each Failure Detector implementation
• Added commons-math dependency to ivy.xml

Next things to work on are:

• Adapt server side code to use the proposed FD interface
• Start experimenting
Show
Abmar Barros added a comment - Patch changes are: Created appMessageReceived() and appMessageSent() methods. These methods allow the Failure Detector to use application messages as heartbeats, which reflects the ZooKeeper case. Added command line options to configure failure detector method and its parameters. Unit tests expanded to comply with new methods. Enhanced javadocs for each Failure Detector implementation Added commons-math dependency to ivy.xml Next things to work on are: Adapt server side code to use the proposed FD interface Start experimenting
Hide
Abmar Barros added a comment -

While refactoring the server side of the server-client monitoring, more specifically the SessionTracking related classes, I realized that, once the servers do not ping the clients, and they also use app messages received from the clients as heartbeats, it is not possible to do an adaptive monitoring of the clients. Servers do not know the time a client sends a ping and pings are not sent frequently – only at idle intervals.

I am finishing refactoring the server side code and I will force the use of a fixed heartbeat strategy for now.

Show
Abmar Barros added a comment - While refactoring the server side of the server-client monitoring, more specifically the SessionTracking related classes, I realized that, once the servers do not ping the clients, and they also use app messages received from the clients as heartbeats, it is not possible to do an adaptive monitoring of the clients. Servers do not know the time a client sends a ping and pings are not sent frequently – only at idle intervals. I am finishing refactoring the server side code and I will force the use of a fixed heartbeat strategy for now. Comments and suggestions are welcome
Hide
Flavio Junqueira added a comment -

Abmar, I'm not sure I understand what you're saying. What exactly prevents you from performing an adaptive monitoring of the clients? If you can use regular intervals, then I would assume that you can also use variable intervals. Is this a problem with our current design or the one we are proposing for this jira?

Show
Flavio Junqueira added a comment - Abmar, I'm not sure I understand what you're saying. What exactly prevents you from performing an adaptive monitoring of the clients? If you can use regular intervals, then I would assume that you can also use variable intervals. Is this a problem with our current design or the one we are proposing for this jira?
Hide
Abmar Barros added a comment -

Hi Flavio. What I was trying to say is that the algorithms (Chen, Bertier) assumes that heartbeats are sent in regular intervals. Once this is not the ZooKeeper case, I tried to adapt these algorithms to estimate the time between a ping is sent and a corresponding heartbeat is received. This worked well for the client side, but it is not appropriate for the server side.

After thinking about this, I decided to adopt an estimation of the inter arrival times between heartbeats. I guess it fits both server and client sides of the solution. What do you think?

Show
Abmar Barros added a comment - Hi Flavio. What I was trying to say is that the algorithms (Chen, Bertier) assumes that heartbeats are sent in regular intervals. Once this is not the ZooKeeper case, I tried to adapt these algorithms to estimate the time between a ping is sent and a corresponding heartbeat is received. This worked well for the client side, but it is not appropriate for the server side. After thinking about this, I decided to adopt an estimation of the inter arrival times between heartbeats. I guess it fits both server and client sides of the solution. What do you think?
Hide
Ashwin Jayaprakash added a comment -

The Phi accrual detector code is also available as a simple library - http://github.com/arosien/failure. Fairly simple if you use Commons Math.

Show
Ashwin Jayaprakash added a comment - The Phi accrual detector code is also available as a simple library - http://github.com/arosien/failure . Fairly simple if you use Commons Math.
Hide
Abmar Barros added a comment -

Ashwin, I have already taken a look at this library, as suggested by Adam, and my implementation for the Phi Accrual is pretty similar, it also uses the Commons Math. However it uses the Normal distribution, as stated in the original paper, instead of the Exponential one. What do you think about making the samples distribution a parameter of the Failure Detector?

Show
Abmar Barros added a comment - Ashwin, I have already taken a look at this library, as suggested by Adam, and my implementation for the Phi Accrual is pretty similar, it also uses the Commons Math. However it uses the Normal distribution, as stated in the original paper, instead of the Exponential one. What do you think about making the samples distribution a parameter of the Failure Detector?
Hide
Flavio Junqueira added a comment -

Here are a couple of comments:

1. It sounds like we will have to sample the heartbeats instead of considering all of them. If we use all messages as heartbeats, then we may end up in a situation in which we have bursts of traffic interleaved with periods of silence. In such cases, the failure detector might get confused for some time when it transitions from burst to silence.Using the terminology of Chen et al., I was thinking that we could take \eta and only consider one heartbeat for every period determined by \eta. In ZooKeeper today, we have such \eta for both client-side detection and server-side detection. Alternatively, we could simply work with the phi accrual detector on the server side for now, make sure it works, and then revisit the other two.
2. The choice of an exponential distribution in the math commons implementation is curious. The original paper does assume a normal distribution, so I wonder why it uses exponential instead.
Show
Flavio Junqueira added a comment - Here are a couple of comments: It sounds like we will have to sample the heartbeats instead of considering all of them. If we use all messages as heartbeats, then we may end up in a situation in which we have bursts of traffic interleaved with periods of silence. In such cases, the failure detector might get confused for some time when it transitions from burst to silence.Using the terminology of Chen et al., I was thinking that we could take \eta and only consider one heartbeat for every period determined by \eta. In ZooKeeper today, we have such \eta for both client-side detection and server-side detection. Alternatively, we could simply work with the phi accrual detector on the server side for now, make sure it works, and then revisit the other two. The choice of an exponential distribution in the math commons implementation is curious. The original paper does assume a normal distribution, so I wonder why it uses exponential instead.
Hide
Flavio Junqueira added a comment -

Show
Hide
Abmar Barros added a comment -

Hi Flavio, actually I was talking about the client monitoring on the server-side (SessionTracking).

Show
Abmar Barros added a comment - Hi Flavio, actually I was talking about the client monitoring on the server-side (SessionTracking).
Hide
Abmar Barros added a comment -

In this patch:

• Adapted failure detectors to work on both client and server sides of the client-server monitoring.
• Refactored server-side code to use the FailureDetector interface.
• Created a new FailureDetector, which groups monitoreds by their tick time, similar to what the SessionTracking does.
• Made the failure detector (and its parameters) options of the ZooKeeper configuration file.
• Corrected indentation (spaces instead of tabs).

Next steps are:

• As suggested by Flavio, create sub tasks in this Jira for:
• Experimentation.
• Refactor server-server monitoring to use FailureDetector interface and make it configurable.
• Forrest documentation.
Show
Abmar Barros added a comment - In this patch: Adapted failure detectors to work on both client and server sides of the client-server monitoring. Refactored server-side code to use the FailureDetector interface. Created a new FailureDetector, which groups monitoreds by their tick time, similar to what the SessionTracking does. Made the failure detector (and its parameters) options of the ZooKeeper configuration file. Corrected indentation (spaces instead of tabs). Next steps are: As suggested by Flavio, create sub tasks in this Jira for: Experimentation. Refactor server-server monitoring to use FailureDetector interface and make it configurable. Forrest documentation.
Hide
Abmar Barros added a comment -

As I was refactoring server-to-server monitoring, I noticed that Leaders cannot process the client session heartbeats received by Learners, once they only track the last heartbeat (or app message) received from the clients, in order to report to the Leader (LearnerSessionTracker).

I am thinking of tracking the heartbeats (not app messages) received from the clients in the LearnerSessionTracker, so the Learners can send them to the Leader, and the Leader itself can do an adaptive monitoring on these clients too.

What do you think?

Show
Abmar Barros added a comment - As I was refactoring server-to-server monitoring, I noticed that Leaders cannot process the client session heartbeats received by Learners, once they only track the last heartbeat (or app message) received from the clients, in order to report to the Leader (LearnerSessionTracker). I am thinking of tracking the heartbeats (not app messages) received from the clients in the LearnerSessionTracker, so the Learners can send them to the Leader, and the Leader itself can do an adaptive monitoring on these clients too. What do you think?
Hide

Hi!
There is paper by Satzger et al. where it is described an easy method to use non-periodic application messages as periodic heartbeats. The paper is called "A Lazy Monitoring Approach for Heartbeat-Style Failure Detectors".

I hope that is still relevant.

Cheers, Diogo

Show
Diogo added a comment - Hi! There is paper by Satzger et al. where it is described an easy method to use non-periodic application messages as periodic heartbeats. The paper is called "A Lazy Monitoring Approach for Heartbeat-Style Failure Detectors". I hope that is still relevant. Cheers, Diogo
Hide
Abmar Barros added a comment -

Hi Diogo! Thank you for indicating this paper, I haven't found such failure detection type so far, it is very interesting.

It proposes a simple way of estimating heartbeats arrival times based on application messages. However it does require the attachment of sending times to all application messages (or at least the ones it will use to do the estimation), which is an overhead to message size. Anyway, with the separate failure detector module, it would be easy to implement a new Failure Detector that uses such data.

So far, I have adapted the proposed failure detectors in order to compute the estimated next arrival time only when a heartbeat is received.

Show
Abmar Barros added a comment - Hi Diogo! Thank you for indicating this paper, I haven't found such failure detection type so far, it is very interesting. It proposes a simple way of estimating heartbeats arrival times based on application messages. However it does require the attachment of sending times to all application messages (or at least the ones it will use to do the estimation), which is an overhead to message size. Anyway, with the separate failure detector module, it would be easy to implement a new Failure Detector that uses such data. So far, I have adapted the proposed failure detectors in order to compute the estimated next arrival time only when a heartbeat is received.
Hide
Abmar Barros added a comment -

In this patch:

• Learners report clients heartbeat sample information (mean and standard deviation) to Leaders.
Show
Abmar Barros added a comment - In this patch: Learners report clients heartbeat sample information (mean and standard deviation) to Leaders.
Hide
Benjamin Reed added a comment -

hey abmar this looks pretty cool. i think it will make the code much more clearer when we get this in.

i'm wondering about the asymmetrical nature of the current zk fd that you've already mentioned: one side is generating traffic and the other is responding. shouldn't that be reflected in the interface? it is a bit confusing when parts of the interface are used on one side and parts on the other.

on a separate note, i'm wondering if our use of TCP affects the failure detector. since TCP automatically handles lost packets, we aren't going to lose packets like we do with UDP based protocols.

Show
Benjamin Reed added a comment - hey abmar this looks pretty cool. i think it will make the code much more clearer when we get this in. i'm wondering about the asymmetrical nature of the current zk fd that you've already mentioned: one side is generating traffic and the other is responding. shouldn't that be reflected in the interface? it is a bit confusing when parts of the interface are used on one side and parts on the other. on a separate note, i'm wondering if our use of TCP affects the failure detector. since TCP automatically handles lost packets, we aren't going to lose packets like we do with UDP based protocols.
Hide
Abmar Barros added a comment -

Hi Benjamin

Regarding your first question, a failure detector is intended to determine when monitored objects are to be pinged and when they are supposed to have failed. So, you can have two different failure detectors running in two sides of the application, or you can have the failure detector in just one side, and the other side just replies the pings messages.
Summarizing, a failure detector implementation is absolutely responsible for one side of the monitoring.

Concerning your note, the failure detectors I have implemented rely on the assumption that zk (TCP) guarantees massage delivery and in order. To see the effects of the new failure detection methods, we are planning to do some experiments in the next week. We can also check if there is any effect of TCP in the adaptive fds behavior.

Show
Abmar Barros added a comment - Hi Benjamin Thanks for your feedback! Regarding your first question, a failure detector is intended to determine when monitored objects are to be pinged and when they are supposed to have failed. So, you can have two different failure detectors running in two sides of the application, or you can have the failure detector in just one side, and the other side just replies the pings messages. Summarizing, a failure detector implementation is absolutely responsible for one side of the monitoring. Concerning your note, the failure detectors I have implemented rely on the assumption that zk (TCP) guarantees massage delivery and in order. To see the effects of the new failure detection methods, we are planning to do some experiments in the next week. We can also check if there is any effect of TCP in the adaptive fds behavior.
Hide
Abmar Barros added a comment -

In this patch:

• Refactored SessionTracker and Heartbeat tracking on learners.
• Fixed an issue concerning server recovery.

Patches of the subtasks use this patch as a baseline.

Show
Abmar Barros added a comment - In this patch: Refactored SessionTracker and Heartbeat tracking on learners. Fixed an issue concerning server recovery. Patches of the subtasks use this patch as a baseline.
Hide
Flavio Junqueira added a comment -

My understanding is that the failure detector is supposed to behave as an oracle that simply says dead or alive for each monitored process. The application on top, in this case ZooKeeper, decides what to do with that information. Followers and leaders will most likely use it differently.

Your point about TCP is excellent, Ben.The way I see it the lost packets won't be observed for the failure detector, and they will end up translated into jitter due to the retransmissions, assuming they are retransmitted. Is this right, Abmar?

Show
Flavio Junqueira added a comment - My understanding is that the failure detector is supposed to behave as an oracle that simply says dead or alive for each monitored process. The application on top, in this case ZooKeeper, decides what to do with that information. Followers and leaders will most likely use it differently. Your point about TCP is excellent, Ben.The way I see it the lost packets won't be observed for the failure detector, and they will end up translated into jitter due to the retransmissions, assuming they are retransmitted. Is this right, Abmar?
Hide
Abmar Barros added a comment -

Sure Flavio, you are right.

When I said that failure detectors rely on the assumption that TCP guarantees massage delivery and in order, I meant that the fds are not concerned about these issues. I have not implemented any ordering (or loss) treatment from the original algorithms, which do not rely on any protocol.

Again, thanks for the feedback Benjamin.
Doubts and suggestions are always welcome.

Show
Abmar Barros added a comment - Sure Flavio, you are right. When I said that failure detectors rely on the assumption that TCP guarantees massage delivery and in order, I meant that the fds are not concerned about these issues. I have not implemented any ordering (or loss) treatment from the original algorithms, which do not rely on any protocol. Again, thanks for the feedback Benjamin. Doubts and suggestions are always welcome.
Hide
Abmar Barros added a comment -

In this patch I removed some tests modifications that were not intended to be committed.

Show
Abmar Barros added a comment - In this patch I removed some tests modifications that were not intended to be committed.
Hide
Abmar Barros added a comment -
• Synchronized patch with updates from trunk
• Put all changes from subtasks together in the this patch
Show
Abmar Barros added a comment - Synchronized patch with updates from trunk Put all changes from subtasks together in the this patch
Hide
Abmar Barros added a comment -

I was thinking of adding a new parameter to the adaptive methods, that is the minimum window size to begin performing timeout adaptation. As we noticed, adaptive methods can get benefit of a large window size, and there can be premature false suspicion if we do not know the behaviour of the network. In this case, the methods will start using a static timeout until the window reaches a certain size. What do you think?

Show
Abmar Barros added a comment - I was thinking of adding a new parameter to the adaptive methods, that is the minimum window size to begin performing timeout adaptation. As we noticed, adaptive methods can get benefit of a large window size, and there can be premature false suspicion if we do not know the behaviour of the network. In this case, the methods will start using a static timeout until the window reaches a certain size. What do you think?
Hide
Abmar Barros added a comment -

In this patch:

• Modified some initialization parameters of the algorithms
Show
Abmar Barros added a comment - In this patch: Added forrest docs Modified some initialization parameters of the algorithms
Hide
Flavio Junqueira added a comment -

1. Unit tests are not running properly for me. It is running very slowly and I'm not sure what's causing it. It looks like you've been running on Windows. It would be good to also test on a Unix platform. I'm on MacOS 10.5.6, Java(TM) SE Runtime Environment (build 1.6.0_20-b02-279-9M3165);
2. We don't have changes for the C client, right? We need make sure that the documentation says that the C client must currently use the default detector and open a jira to fix it;
3. Please consider splitting the patch into multiple files to make it easier on other folks to review your work.
Show
Flavio Junqueira added a comment - Abmar, A few comments: Unit tests are not running properly for me. It is running very slowly and I'm not sure what's causing it. It looks like you've been running on Windows. It would be good to also test on a Unix platform. I'm on MacOS 10.5.6, Java(TM) SE Runtime Environment (build 1.6.0_20-b02-279-9M3165); We don't have changes for the C client, right? We need make sure that the documentation says that the C client must currently use the default detector and open a jira to fix it; Please consider splitting the patch into multiple files to make it easier on other folks to review your work.
Hide
Abmar Barros added a comment -

In this patch:

• Fixed a delay in client heartbeat sending, that was causing unit tests to run slowly.
• Added a new parameter to phi accrual FD - the minimum window size to begin timeout adaptation.
• Added a moderation step to Bertier FD, to comply with second level task described in his paper.
• Made Chen's alpha parameter configurable, and not a quarter of the timeout.
• Removed ^M characters from the patch lines.

I also have splitted the patch in two parts: one containing the failure detector module source code (ZOOKEEPER-702-code.patch) and another containing the documentation (ZOOKEEPER-702-doc.patch).

Show
Abmar Barros added a comment - In this patch: Fixed a delay in client heartbeat sending, that was causing unit tests to run slowly. Added a new parameter to phi accrual FD - the minimum window size to begin timeout adaptation. Added a moderation step to Bertier FD, to comply with second level task described in his paper. Made Chen's alpha parameter configurable, and not a quarter of the timeout. Removed ^M characters from the patch lines. I also have splitted the patch in two parts: one containing the failure detector module source code ( ZOOKEEPER-702 -code.patch) and another containing the documentation ( ZOOKEEPER-702 -doc.patch).
Hide
Abmar Barros added a comment -

I have updated the wiki page with the results of the experiments:

Show
Abmar Barros added a comment - I have updated the wiki page with the results of the experiments: http://wiki.apache.org/hadoop/ZooKeeper/GSoCFailureDetector
Hide
Ivan Kelly added a comment -

I've just attempted to apply ZOOKEEPER-702.patch against trunk and it doesn't works. The patch seems to be a delta against a previous patch. Could you generate a patch of everything against current trunk?

Show
Ivan Kelly added a comment - I've just attempted to apply ZOOKEEPER-702 .patch against trunk and it doesn't works. The patch seems to be a delta against a previous patch. Could you generate a patch of everything against current trunk?
Hide
Ivan Kelly added a comment -

Show
Hide
Flavio Junqueira added a comment -

Great job so far, Abmar. I'm thoroughly reviewing the patch, and I'm not done yet, but I have I few comments that you could consider addressing:

1. Please revise javadocs. Many of them have typos and incorrect information, e.g., the phi accrual javadocs refer to the Bertier fd;
2. In ZooKeeperMain.java (-259,9 +279,21), you should perhaps have a more informative message when throwing an IllegalArgumentException. You could, for example, list the current fd options available;
3. The tests included do not exercise the failure detectors with a running instance of the system. Instead they only check a few invariants of the failure detectors. We need to test that we can run an ensemble of servers with each of the failure detector implementations. You could use (or extend if necessary) QuorumBase or QuorumUtil to run an ensemble with each of the failure detectors. You could then kill the leader while executing operations and verify that eventually a new leader emerges and regular operation is resumed.
Show
Hide
Ivan Kelly added a comment -

Hi Abmar,

The code looks good. Excellent job! I have a few comments about the API.

API:
Ping and Heartbeat are used interchangably.

There are heartbeatReceived & appMessageReceived and pingSent & appMessageSent. In the actual implementations these do the same or very similar things. It would make sense to compress them into one api, and have the type of message received/sent specified by a parameters. Whats more, if a new type of message comes along, the failure detectors will already handle it without each needing to have their code updated.

void messageReceived(String id, long now, MessageType type);
void messageSent(String id, long now, MessageType type);

The getFailedObjects and getObjectsToPing are used to return lists, which in all but one case have contains called on them once and are then discarded. I think these would be better as:

bool isFailed(String id, long now)
bool shouldPing(String id, long now)

So that:
a) You don't construct a list which is going to be discarded very soon.
b) The internals of the FailureDetector are better hidden from the user of the API.

In fact, shouldPing could be discarded and getTimeToNextPing(String id, long now) == 0 used.

release should be releaseMonitored to make it symmetrical with registerMonitored.

General:
The failure detection factory could use newInstance to create arbitary failure detectors. So after if (fdName.equals("slicedhb")) fails, you could check if fdName is a fully qualified class name, and if so find the class and create a newInstance. This way its possible to add a failure detector without having to modify the zookeeper code.

There's a lot of duplication in the javadoc for the Zookeeper class. A lot of this could be eliminated by using the @see directive.

FailureDetector.java L131, typo, pint instead of ping.

Show
Hide
Abmar Barros added a comment -

Hi Flavio and Ivan.

Thanks for reviewing the patch and posting feedback.
Ivan, totally agreed with your suggestions, I have already been thinking about some of the issues you have raised also.
I will address the issues you have mentioned and will update the patch asap.

By the way, I would like to thank the ZooKeeper community for such opportunity, it was a very enriching experience for me, not only as a developer but also as a researcher. And I plan to keep on contributing to ZK community – as I have talked to Flavio previously, next steps include some more experimentation on the failure detection methods to better characterize them, and writing some guidelines regarding their usage and their appropriateness for certain scenarios/deployments.

Show
Abmar Barros added a comment - Hi Flavio and Ivan. Thanks for reviewing the patch and posting feedback. Ivan, totally agreed with your suggestions, I have already been thinking about some of the issues you have raised also. I will address the issues you have mentioned and will update the patch asap. By the way, I would like to thank the ZooKeeper community for such opportunity, it was a very enriching experience for me, not only as a developer but also as a researcher. And I plan to keep on contributing to ZK community – as I have talked to Flavio previously, next steps include some more experimentation on the failure detection methods to better characterize them, and writing some guidelines regarding their usage and their appropriateness for certain scenarios/deployments.
Hide
Flavio Junqueira added a comment -

Thanks for the comments, Ivan. I particularly agree with your point on building a list every time. I was actually wondering if there is any other important performance penalty we might be introducing. I can't think of any at the moment, but it be good to keep it in mind.

The use of hbReceived and appMsgReceived does not bother me because they make more explicit what they do, but I agree it is more flexible the way you propose.

Abmar, I went back to the papers and contrasted your implementation against the proposed algorithms. The only issue I could find is that the Bertier approach computes EA differently when j < n. I suggest we follow their description. The others seem fine to me.

Show
Flavio Junqueira added a comment - Thanks for the comments, Ivan. I particularly agree with your point on building a list every time. I was actually wondering if there is any other important performance penalty we might be introducing. I can't think of any at the moment, but it be good to keep it in mind. The use of hbReceived and appMsgReceived does not bother me because they make more explicit what they do, but I agree it is more flexible the way you propose. Abmar, I went back to the papers and contrasted your implementation against the proposed algorithms. The only issue I could find is that the Bertier approach computes EA differently when j < n. I suggest we follow their description. The others seem fine to me.
Hide
Abmar Barros added a comment -

Hi Flavio,

As we are using inter arrival times and a flexible sampling window, it can work the same way of Chen's estimation in both cases (j >= n or j < n).

Show
Abmar Barros added a comment - Hi Flavio, As we are using inter arrival times and a flexible sampling window, it can work the same way of Chen's estimation in both cases (j >= n or j < n).
Hide
Abmar Barros added a comment -

I this patch I have addressed all the issues Ivan and Flavio raised, but the reflection code on the FaiilureDetectorFactory, in which I am working.

What do you think is the most user-friendly way of configuring the failure detector? (using reflection on the FailureDetectorFactory).

Should we require the fully classified class name in the fd.classname option and the parameters should be like fd.param1, fd.param2...? In this sense, should we force the failure detectors classes to have methods like setParam1(String param1), setParam2(String param2)...?

Show
Abmar Barros added a comment - I this patch I have addressed all the issues Ivan and Flavio raised, but the reflection code on the FaiilureDetectorFactory, in which I am working. What do you think is the most user-friendly way of configuring the failure detector? (using reflection on the FailureDetectorFactory). Should we require the fully classified class name in the fd.classname option and the parameters should be like fd.param1, fd.param2...? In this sense, should we force the failure detectors classes to have methods like setParam1(String param1), setParam2(String param2)...?
Hide
Ivan Kelly added a comment -

I think putting it under the fd.classname would be most accessible for users. For example,

fd.myfailuredetector = com.softwaresoft.zk.myFailureDetector
fd.myfailuredetector.blahblah = foobar
fd.myfailuredetector.myoption = barfoo
etc...

Then when the factory tries to instantiate a myfailuredetector and it sees that myfailuredetector isn't a known type it should look up "fd." + "myfailuredetector" in properties, and instantiate the class with fdOptions as a parameter. Let the implementation take care of which options it wants to read from fdOptions.

Show
Ivan Kelly added a comment - I think putting it under the fd.classname would be most accessible for users. For example, fd.myfailuredetector = com.softwaresoft.zk.myFailureDetector fd.myfailuredetector.blahblah = foobar fd.myfailuredetector.myoption = barfoo etc... Then when the factory tries to instantiate a myfailuredetector and it sees that myfailuredetector isn't a known type it should look up "fd." + "myfailuredetector" in properties, and instantiate the class with fdOptions as a parameter. Let the implementation take care of which options it wants to read from fdOptions.
Hide
Ivan Kelly added a comment -

One more comment on the API. Ping and heartbeat seem to be used interchangeably though it's my understanding that these are fundamentally the same thing.

Show
Ivan Kelly added a comment - One more comment on the API. Ping and heartbeat seem to be used interchangeably though it's my understanding that these are fundamentally the same thing.
Hide
Flavio Junqueira added a comment -

+1 to changing the API to use either ping or heartbeat only. I prefer ping because it is shorter.

Show
Flavio Junqueira added a comment - +1 to changing the API to use either ping or heartbeat only. I prefer ping because it is shorter.
Hide
Flavio Junqueira added a comment -

Show
Hide
Thomas Koch added a comment -

ZooKeeper-702 and ZOOKEEPER-823 both patch ClientCnxn

Show
Thomas Koch added a comment - ZooKeeper-702 and ZOOKEEPER-823 both patch ClientCnxn
Hide
Abmar Barros added a comment -

In this patch:

• Added support for declaring new failure detectors, as Ivan suggested.
• Updated documentation to reflect this.
• Replaced all 'heartbeat' mentions in the API and tests for 'ping'.
Show
Abmar Barros added a comment - In this patch: Added support for declaring new failure detectors, as Ivan suggested. Updated documentation to reflect this. Replaced all 'heartbeat' mentions in the API and tests for 'ping'.
Hide
Ivan Kelly added a comment -

Code looks good Ambar, good job

I think the factory code could be improved slightly though. Currently if you want a custom FD you pass in a FDname and an FDClass.
I would merge these two, so that you only pass in FDname to the factory, and the factory can decide whether this is a builtin or a custom. So as an example, FDname can be fixed, chen, phiaccrual or com.blah.foobar.MyFailureDetector. When the factory sees this it goes...

if (fdName.equals("fixed"))

{ fdClass = FixedPingFailureDetector.class; }

else if (fdName.equals("chen"))

{ fdClass = ChenFailureDetector.class; }

else if (fdName.equals("bertier"))

{ fdClass = BertierFailureDetector.class; }

else if (fdName.equals("phiaccrual"))

{ fdClass = PhiAccrualFailureDetector.class; }

else if (fdName.equals("sliced"))

{ fdClass = SlicedPingFailureDetector.class; }

else

{ fdClass = Class.forName(fdName); }
Show
Ivan Kelly added a comment - Code looks good Ambar, good job I think the factory code could be improved slightly though. Currently if you want a custom FD you pass in a FDname and an FDClass. I would merge these two, so that you only pass in FDname to the factory, and the factory can decide whether this is a builtin or a custom. So as an example, FDname can be fixed, chen, phiaccrual or com.blah.foobar.MyFailureDetector. When the factory sees this it goes... if (fdName.equals("fixed")) { fdClass = FixedPingFailureDetector.class; } else if (fdName.equals("chen")) { fdClass = ChenFailureDetector.class; } else if (fdName.equals("bertier")) { fdClass = BertierFailureDetector.class; } else if (fdName.equals("phiaccrual")) { fdClass = PhiAccrualFailureDetector.class; } else if (fdName.equals("sliced")) { fdClass = SlicedPingFailureDetector.class; } else { fdClass = Class.forName(fdName); }
Hide
Abmar Barros added a comment -

Hi Ivan, thanks for the feedback.

I thought the idea was to declare an alias to a custom failure detector, so setting its options would be more user friendly. e.g.:

sessionsFD = myfd
sessionsFD.myfd = com.softwaresoft.zk.myFailureDetector
sessionsFD.myfd.blahblah = foobar
sessionsFD.myfd.myoption = barfoo

In this case, I need both className and fdName parameters in the factory code (and this is how it is implemented in the patch).
What do you think?

Show
Abmar Barros added a comment - Hi Ivan, thanks for the feedback. I thought the idea was to declare an alias to a custom failure detector, so setting its options would be more user friendly. e.g.: sessionsFD = myfd sessionsFD.myfd = com.softwaresoft.zk.myFailureDetector sessionsFD.myfd.blahblah = foobar sessionsFD.myfd.myoption = barfoo In this case, I need both className and fdName parameters in the factory code (and this is how it is implemented in the patch). What do you think?
Hide
Flavio Junqueira added a comment -

Hi Abmar, I believe my comments from Aug 18 have not been addressed. In particular, I still see the same javadoc issues and no test exercising the failure detectors with a running ensemble. I believe the default failure detector is naturally exercised in various tests, but it would be good to have tests that also exercise the other failure detectors in a running ensemble. Isn't it right?

There are still several references in javadocs and in the documentation to "heartbeat". Should we replace them with "ping" for consistency?

I'm also seeing some tests failing, like ObserverTest and NioNettySuiteHammerTest, but I'm not sure if this is related to this patch. I will explore a little further.

Show
Flavio Junqueira added a comment - Hi Abmar, I believe my comments from Aug 18 have not been addressed. In particular, I still see the same javadoc issues and no test exercising the failure detectors with a running ensemble. I believe the default failure detector is naturally exercised in various tests, but it would be good to have tests that also exercise the other failure detectors in a running ensemble. Isn't it right? There are still several references in javadocs and in the documentation to "heartbeat". Should we replace them with "ping" for consistency? I'm also seeing some tests failing, like ObserverTest and NioNettySuiteHammerTest, but I'm not sure if this is related to this patch. I will explore a little further.
Hide
Abmar Barros added a comment -

In this patch:

• Replaced 'heartbeat' for 'ping' in the forrest documentation
• Expanded the hammer (client and quorum) and recovery tests to exercise all failure detectors implementations.

Flavio, I have run the tests you mentioned and they seem to fail in the trunk also.

Show
Abmar Barros added a comment - In this patch: Replaced 'heartbeat' for 'ping' in the forrest documentation Revised javadocs Expanded the hammer (client and quorum) and recovery tests to exercise all failure detectors implementations. Flavio, I have run the tests you mentioned and they seem to fail in the trunk also.
Hide
Ivan Kelly added a comment -

I thought I had replied to your comment on the 13th, but it seems not. Looking at the code again, the way you did the factory is good. I misunderstood how zookeeper handled configuration.

Show
Ivan Kelly added a comment - I thought I had replied to your comment on the 13th, but it seems not. Looking at the code again, the way you did the factory is good. I misunderstood how zookeeper handled configuration.
Hide
Ivan Kelly added a comment -

+1

Show
Ivan Kelly added a comment - +1
Hide
Flavio Junqueira added a comment -

Thanks for the updated patch, Abmar. The new tests, however, are working as expected. More specifically, the methods in QuorumBase (createLearnersFD and createSessionsFD) are not being overridden as expected, which affects all new hammer tests. I haven't checked the other tests, but I suspect they suffer from the same problem.

I'm canceling the patch for now.

Show
Flavio Junqueira added a comment - Thanks for the updated patch, Abmar. The new tests, however, are working as expected. More specifically, the methods in QuorumBase (createLearnersFD and createSessionsFD) are not being overridden as expected, which affects all new hammer tests. I haven't checked the other tests, but I suspect they suffer from the same problem. I'm canceling the patch for now.
Hide
Flavio Junqueira added a comment -

In the previous comment, hopefully it was clear that I meant to say that the new tests are NOT working as expected. Apologies for the typo.

Show
Flavio Junqueira added a comment - In the previous comment, hopefully it was clear that I meant to say that the new tests are NOT working as expected. Apologies for the typo.
Hide
Abmar Barros added a comment -

Thanks for the feedback Flavio, I have fixed the issue you mentioned. Only the new quorum hammer tests were not working propertly, they were creating the default failure detector in all the test cases. This patch addresses this issue.

Show
Abmar Barros added a comment - Thanks for the feedback Flavio, I have fixed the issue you mentioned. Only the new quorum hammer tests were not working propertly, they were creating the default failure detector in all the test cases. This patch addresses this issue.
Hide
Flavio Junqueira added a comment -

+1, I'm pretty happy with the patch.

Show
Flavio Junqueira added a comment - +1, I'm pretty happy with the patch.
Hide
Flavio Junqueira added a comment -

I forgot to mention that the patch does not apply cleanly. I had to delete the first two lines (generated by eclipse), but once I did it applied cleanly. Abmar, could you upload a new patch? My +1 still holds...

Show
Flavio Junqueira added a comment - I forgot to mention that the patch does not apply cleanly. I had to delete the first two lines (generated by eclipse), but once I did it applied cleanly. Abmar, could you upload a new patch? My +1 still holds...
Hide
Abmar Barros added a comment -

Fixed the issue on the patch generation that Flavio mentioned.

Show
Abmar Barros added a comment - Fixed the issue on the patch generation that Flavio mentioned.
Hide
Patrick Hunt added a comment -

Patch applied cleanly for me. I ran the tests and they all passed.

I did notice that java test time for me moved from ~18 minutes to ~28 minutes. I don't think we should hold up this patch for this, however going fwd we should think about how to enable a quicker "commit test" (run by devs before submitting/committing) as well as longer running hudson & integration tests. Where the commit test runs more quickly (hopefully < 10minutes). We will probably need to use something like junit test category support.

Nice job, thanks.

Show
Patrick Hunt added a comment - Patch applied cleanly for me. I ran the tests and they all passed. I did notice that java test time for me moved from ~18 minutes to ~28 minutes. I don't think we should hold up this patch for this, however going fwd we should think about how to enable a quicker "commit test" (run by devs before submitting/committing) as well as longer running hudson & integration tests. Where the commit test runs more quickly (hopefully < 10minutes). We will probably need to use something like junit test category support. Nice job, thanks.
Hide
Abmar Barros added a comment -

After making some more experiments with the Phi Accrual, I have noticed that the exponential distribution fits the ping inter-arrival sampling window better.
Then, I have added a new option for the PhiAccrual called 'dist', that is the distribution used to model the inter-arrivals.
Two possible values for this parameter are 'norm' and 'exp', and the default is 'exp'. When we set the PhiAccrual to use the exponential distribution, it will work similar to the Cassandra's failure detector.

Show
Abmar Barros added a comment - After making some more experiments with the Phi Accrual, I have noticed that the exponential distribution fits the ping inter-arrival sampling window better. Then, I have added a new option for the PhiAccrual called 'dist', that is the distribution used to model the inter-arrivals. Two possible values for this parameter are 'norm' and 'exp', and the default is 'exp'. When we set the PhiAccrual to use the exponential distribution, it will work similar to the Cassandra's failure detector.
Hide
Flavio Junqueira added a comment -

Hi Abmar, Thanks for the addition to the patch. I was wondering if it is really a good idea to have both options, normal and exponential, implemented. Since your experiments have shown that exponential performs better, why don't use it only? Also, I was wondering if you have posted expertimental numbers showing that exponential performs better.

In the case we go with exponential only, then we don't need the modification to ivy.xml, right?

And last comment, it doesn't look like the classes implementing PhiTimeoutEvaluator need to be public. Is this right?

Show
Flavio Junqueira added a comment - Hi Abmar, Thanks for the addition to the patch. I was wondering if it is really a good idea to have both options, normal and exponential, implemented. Since your experiments have shown that exponential performs better, why don't use it only? Also, I was wondering if you have posted expertimental numbers showing that exponential performs better. In the case we go with exponential only, then we don't need the modification to ivy.xml, right? And last comment, it doesn't look like the classes implementing PhiTimeoutEvaluator need to be public. Is this right?
Hide
Abmar Barros added a comment -

Hi Flavio, Thanks for the comments. In this patch I have removed the normal distribution option for the PhiAccrual, and also the math-commons dependency in the ivy.xml file. With this modification, the PhiTimeoutEvaluator (and its implementations) are no longer needed.

I also have refactored some of the common code regarding the ping sampling window. This made the code of the adaptive FDs clearer.

Regarding my last experiments, I will post the results on the wiki asap.

Show
Abmar Barros added a comment - Hi Flavio, Thanks for the comments. In this patch I have removed the normal distribution option for the PhiAccrual, and also the math-commons dependency in the ivy.xml file. With this modification, the PhiTimeoutEvaluator (and its implementations) are no longer needed. I also have refactored some of the common code regarding the ping sampling window. This made the code of the adaptive FDs clearer. Regarding my last experiments, I will post the results on the wiki asap.
Hide
Abmar Barros added a comment -
• Updated with trunk.
• Fixed the standard deviation calculus in the InterArrivalSamplingWindow.
Show
Abmar Barros added a comment - Updated with trunk. Fixed the standard deviation calculus in the InterArrivalSamplingWindow.
Hide
Flavio Junqueira added a comment -

Thanks, Abmar. It looks good to me. I have one quick comment, though. Is there any configuration value that could be causing tests to run slower? I have the impression that tests are running slightly slower with your patch. One in particular that called my attention was QuorumZxidSyncTest:

Trunk: [junit] Running org.apache.zookeeper.test.QuorumZxidSyncTest
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 94.55 sec

702: [junit] Running org.apache.zookeeper.test.QuorumZxidSyncTest
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 139.985 sec


and this seems to be pretty consistent.

Show
Flavio Junqueira added a comment - Thanks, Abmar. It looks good to me. I have one quick comment, though. Is there any configuration value that could be causing tests to run slower? I have the impression that tests are running slightly slower with your patch. One in particular that called my attention was QuorumZxidSyncTest: Trunk: [junit] Running org.apache.zookeeper.test.QuorumZxidSyncTest [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 94.55 sec 702: [junit] Running org.apache.zookeeper.test.QuorumZxidSyncTest [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 139.985 sec and this seems to be pretty consistent.
Hide
Abmar Barros added a comment -

Thanks for the feedback Flavio. After some investigation, I have noticed that, with the patch applied, the server threads take longer to be released in the tests. I have measured the time that the QuorumPeer.join() invocations take on the QuorumBase.shutdown() method, and it seems that most of the time difference relies on these invocations.
However, I am still trying to find out what is causing these threads to be delayed. I need to investigate it further.

Show
Abmar Barros added a comment - Thanks for the feedback Flavio. After some investigation, I have noticed that, with the patch applied, the server threads take longer to be released in the tests. I have measured the time that the QuorumPeer.join() invocations take on the QuorumBase.shutdown() method, and it seems that most of the time difference relies on these invocations. However, I am still trying to find out what is causing these threads to be delayed. I need to investigate it further.
Hide
Abmar Barros added a comment -

In this patch:

• Updated with trunk
• Fixed the timing issue - the timeout and the pinging interval were not properly set on the LearnerHandler class, so the tests took longer to be executed.
Show
Abmar Barros added a comment - In this patch: Updated with trunk Fixed the timing issue - the timeout and the pinging interval were not properly set on the LearnerHandler class, so the tests took longer to be executed.
Hide

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12460193/ZOOKEEPER-702.patch
against trunk revision 1036967.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 72 new or modified tests.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

-1 findbugs. The patch appears to introduce 14 new Findbugs (version 1.3.9) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

This message is automatically generated.

Show
Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12460193/ZOOKEEPER-702.patch against trunk revision 1036967. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 72 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 14 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/42//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/42//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/42//console This message is automatically generated.
Hide

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12460193/ZOOKEEPER-702.patch
against trunk revision 1036967.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 72 new or modified tests.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

-1 findbugs. The patch appears to introduce 14 new Findbugs (version 1.3.9) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

This message is automatically generated.

Show
Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12460193/ZOOKEEPER-702.patch against trunk revision 1036967. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 72 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 14 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/48//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/48//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/48//console This message is automatically generated.
Hide
Abmar Barros added a comment -

This patch:

• Fixes the findbugs warnings
• Fixes the testSessionTimeout() failure on the SessionTest
Show
Abmar Barros added a comment - This patch: Fixes the findbugs warnings Fixes the testSessionTimeout() failure on the SessionTest
Hide

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12460294/ZOOKEEPER-702.patch
against trunk revision 1036967.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 72 new or modified tests.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

-1 findbugs. The patch appears to introduce 8 new Findbugs (version 1.3.9) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

This message is automatically generated.

Show
Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12460294/ZOOKEEPER-702.patch against trunk revision 1036967. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 72 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 8 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/49//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/49//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/49//console This message is automatically generated.
Hide
Flavio Junqueira added a comment -

It sounds like there is still one findbugs warning to fix.

Show
Flavio Junqueira added a comment - It sounds like there is still one findbugs warning to fix.
Hide
Abmar Barros added a comment -

Hi Flavio, thanks for the feedback.
I am afraid these warnings are not related to the patch itself.
Should we create an issue to remove the dead store to the variable 'size' in QuorumCnxManager$RecvWorker.run()? Show Abmar Barros added a comment - Hi Flavio, thanks for the feedback. I am afraid these warnings are not related to the patch itself. Should we create an issue to remove the dead store to the variable 'size' in QuorumCnxManager$RecvWorker.run()?
Hide
Flavio Junqueira added a comment -

You're right, I didn't check what the warning was about and assumed it was due to your patch. Resubmitting it.

And yes, I'll create a jira for the findbugs warning.

Show
Flavio Junqueira added a comment - You're right, I didn't check what the warning was about and assumed it was due to your patch. Resubmitting it. And yes, I'll create a jira for the findbugs warning.
Hide
Gustavo Niemeyer added a comment -

This is very interesting work, and I'm looking forward to playing with the Phi failure
detector. Thanks Abmar.

One thing I'm noticing is that the patch is working only on the Java side so far, which
means the C client and all the bindings which go against it won't currently benefit
from it.

Are there any plans for introducing support in the C libraries as well?

Show
Gustavo Niemeyer added a comment - This is very interesting work, and I'm looking forward to playing with the Phi failure detector. Thanks Abmar. One thing I'm noticing is that the patch is working only on the Java side so far, which means the C client and all the bindings which go against it won't currently benefit from it. Are there any plans for introducing support in the C libraries as well?
Hide

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12460294/ZOOKEEPER-702.patch
against trunk revision 1055924.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 72 new or modified tests.

-1 patch. The patch command could not apply the patch.

This message is automatically generated.

Show
Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12460294/ZOOKEEPER-702.patch against trunk revision 1055924. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 72 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/98//console This message is automatically generated.
Hide
Abmar Barros added a comment -

Updated patch with trunk.

Gustavo, I am currently working on the FD module for the C client - Jira issue #848 addresses this feature.

Show
Abmar Barros added a comment - Updated patch with trunk. Gustavo, I am currently working on the FD module for the C client - Jira issue #848 addresses this feature.
Hide

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12469882/ZOOKEEPER-702.patch
against trunk revision 1062244.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 72 new or modified tests.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

-1 findbugs. The patch appears to introduce 9 new Findbugs (version 1.3.9) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

This message is automatically generated.

Show
Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12469882/ZOOKEEPER-702.patch against trunk revision 1062244. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 72 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 9 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/107//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/107//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/107//console This message is automatically generated.
Hide
Flavio Junqueira added a comment -

Trying hudson again.

Show
Flavio Junqueira added a comment - Trying hudson again.
Hide

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12469882/ZOOKEEPER-702.patch
against trunk revision 1072085.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 72 new or modified tests.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

This message is automatically generated.

Show
Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12469882/ZOOKEEPER-702.patch against trunk revision 1072085. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 72 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/153//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/153//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/153//console This message is automatically generated.
Hide

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12469882/ZOOKEEPER-702.patch
against trunk revision 1072085.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 72 new or modified tests.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

This message is automatically generated.

Show
Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12469882/ZOOKEEPER-702.patch against trunk revision 1072085. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 72 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/172//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/172//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/172//console This message is automatically generated.
Hide
Abmar Barros added a comment -
• Updated patch with trunk
• Added the stddev as a metric for the timeout estimation (which was removed in the last patch).
• Remove ping interval update when timeout was updated in the adaptive mechanisms.
Show
Abmar Barros added a comment - Updated patch with trunk Added the stddev as a metric for the timeout estimation (which was removed in the last patch). Remove ping interval update when timeout was updated in the adaptive mechanisms.
Hide

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12471908/ZOOKEEPER-702.patch
against trunk revision 1072085.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 72 new or modified tests.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

-1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

This message is automatically generated.

Show
Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12471908/ZOOKEEPER-702.patch against trunk revision 1072085. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 72 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/173//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/173//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/173//console This message is automatically generated.
Hide
Abmar Barros added a comment -
• Fixes findbug warning
Show
Abmar Barros added a comment - Fixes findbug warning
Hide

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12471939/ZOOKEEPER-702.patch
against trunk revision 1074995.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 72 new or modified tests.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

This message is automatically generated.

Show
Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12471939/ZOOKEEPER-702.patch against trunk revision 1074995. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 72 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/174//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/174//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/174//console This message is automatically generated.
Hide

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12471939/ZOOKEEPER-702.patch
against trunk revision 1074995.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 72 new or modified tests.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

This message is automatically generated.

Show
Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12471939/ZOOKEEPER-702.patch against trunk revision 1074995. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 72 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/177//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/177//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/177//console This message is automatically generated.
Hide
Camille Fournier added a comment -

Created ReviewBoard for code review:
https://reviews.apache.org/r/483/

Show
Camille Fournier added a comment - Created ReviewBoard for code review: https://reviews.apache.org/r/483/
Hide
Abmar Barros added a comment -
• Addressed Camille's suggestions on the review board.
Show
Abmar Barros added a comment - Addressed Camille's suggestions on the review board.
Hide

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12475676/ZOOKEEPER-702.patch
against trunk revision 1089617.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 75 new or modified tests.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

This message is automatically generated.

Show
Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12475676/ZOOKEEPER-702.patch against trunk revision 1089617. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 75 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/221//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/221//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/221//console This message is automatically generated.
Hide
Abmar Barros added a comment -

This patch addresses the last suggestions given by Camille and Henry at the review board.

Show
Abmar Barros added a comment - This patch addresses the last suggestions given by Camille and Henry at the review board.
Hide

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12477371/ZOOKEEPER-702.patch
against trunk revision 1091841.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 75 new or modified tests.

-1 javadoc. The javadoc tool appears to have generated 1 warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

-1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

This message is automatically generated.

Show
Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12477371/ZOOKEEPER-702.patch against trunk revision 1091841. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 75 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/233//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/233//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/233//console This message is automatically generated.
Hide
Camille Fournier added a comment -

What changes will need to be made in the c client for this to work there?

Show
Camille Fournier added a comment - What changes will need to be made in the c client for this to work there?
Hide
Abmar Barros added a comment -

Hi Camille

Jira issue #848 addresses this feature.
The C client must be updated in a similar way as the Java client was.
We must define the points in the code where the failure detection module should be called.
I have put some effort in rewriting the detectors in C. The code is hosted at https://github.com/abmargb/zk-fdmodule-c.

Show
Abmar Barros added a comment - Hi Camille Jira issue #848 addresses this feature. The C client must be updated in a similar way as the Java client was. We must define the points in the code where the failure detection module should be called. I have put some effort in rewriting the detectors in C. The code is hosted at https://github.com/abmargb/zk-fdmodule-c .
Hide
Camille Fournier added a comment -

Hi Abmar,

Sorry for the delay in looking at this. Something is broken with xdocs and I can't get a clean CR out of the latest patch against trunk. Can you try again? I'll start looking at the rest of it in the meantime.

Thanks.

Show
Camille Fournier added a comment - Hi Abmar, Sorry for the delay in looking at this. Something is broken with xdocs and I can't get a clean CR out of the latest patch against trunk. Can you try again? I'll start looking at the rest of it in the meantime. Thanks.
Hide
Abmar Barros added a comment -

Hi Camille,

This patch is updated to trunk.
Do you know whether the RB issues were solved?

Show
Abmar Barros added a comment - Hi Camille, This patch is updated to trunk. Do you know whether the RB issues were solved?
Hide

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12480552/ZOOKEEPER-702.patch
against trunk revision 1125581.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 78 new or modified tests.

-1 patch. The patch command could not apply the patch.

This message is automatically generated.

Show
Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12480552/ZOOKEEPER-702.patch against trunk revision 1125581. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 78 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/293//console This message is automatically generated.
Hide
Patrick Hunt added a comment -

Something still seems amiss in the patch (see the console output)

Show
Patrick Hunt added a comment - Something still seems amiss in the patch (see the console output)
Hide
Abmar Barros added a comment -

Trying again.

Show
Abmar Barros added a comment - Trying again.
Hide

+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12480577/ZOOKEEPER-702.patch
against trunk revision 1125581.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 78 new or modified tests.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

This message is automatically generated.

Show
Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12480577/ZOOKEEPER-702.patch against trunk revision 1125581. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 78 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/294//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/294//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/294//console This message is automatically generated.
Hide
Camille Fournier added a comment -

Looking at this in my own rep, created https://issues.apache.org/jira/browse/INFRA-3660 to try to get some help for why this won't work in the ZK reviewboard.

Show
Camille Fournier added a comment - Looking at this in my own rep, created https://issues.apache.org/jira/browse/INFRA-3660 to try to get some help for why this won't work in the ZK reviewboard.
Hide

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/802/
-----------------------------------------------------------

Review request for zookeeper.

Summary
-------

testing upload of patch attached to ZOOKEEPER-702

https://issues.apache.org/jira/browse/ZOOKEEPER-702

Diffs

/src/docs/src/documentation/content/xdocs/index.xml 1065709
/src/docs/src/documentation/content/xdocs/zookeeperFailureDetector.xml PRE-CREATION
/src/java/main/org/apache/zookeeper/ClientCnxn.java 1127985
/src/java/main/org/apache/zookeeper/ClientCnxnSocket.java 1127985
/src/java/main/org/apache/zookeeper/ClientCnxnSocketNIO.java 1127985
/src/java/main/org/apache/zookeeper/ZooKeeper.java 1127985
/src/java/main/org/apache/zookeeper/ZooKeeperMain.java 1127985
/src/java/main/org/apache/zookeeper/common/fd/AbstractFailureDetector.java PRE-CREATION
/src/java/main/org/apache/zookeeper/common/fd/BertierFailureDetector.java PRE-CREATION
/src/java/main/org/apache/zookeeper/common/fd/ChenFailureDetector.java PRE-CREATION
/src/java/main/org/apache/zookeeper/common/fd/FailureDetector.java PRE-CREATION
/src/java/main/org/apache/zookeeper/common/fd/FailureDetectorFactory.java PRE-CREATION
/src/java/main/org/apache/zookeeper/common/fd/FailureDetectorOptParser.java PRE-CREATION
/src/java/main/org/apache/zookeeper/common/fd/FixedPingFailureDetector.java PRE-CREATION
/src/java/main/org/apache/zookeeper/common/fd/InterArrivalSamplingWindow.java PRE-CREATION
/src/java/main/org/apache/zookeeper/common/fd/MessageType.java PRE-CREATION
/src/java/main/org/apache/zookeeper/common/fd/Monitored.java PRE-CREATION
/src/java/main/org/apache/zookeeper/common/fd/PhiAccrualFailureDetector.java PRE-CREATION
/src/java/main/org/apache/zookeeper/common/fd/SlicedPingFailureDetector.java PRE-CREATION
/src/java/main/org/apache/zookeeper/server/ServerConfig.java 1065709
/src/java/main/org/apache/zookeeper/server/SessionTracker.java 1065709
/src/java/main/org/apache/zookeeper/server/SessionTrackerImpl.java 1095174
/src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java 1127985
/src/java/main/org/apache/zookeeper/server/ZooKeeperServerMain.java 1095174
/src/java/main/org/apache/zookeeper/server/quorum/FollowerZooKeeperServer.java 1095174
/src/java/main/org/apache/zookeeper/server/quorum/Learner.java 1095174
/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java 1095174
/src/java/main/org/apache/zookeeper/server/quorum/LearnerSessionTracker.java 1065709
/src/java/main/org/apache/zookeeper/server/quorum/LearnerZooKeeperServer.java 1065709
/src/java/main/org/apache/zookeeper/server/quorum/ObserverZooKeeperServer.java 1095174
/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java 1127985
/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeerConfig.java 1095174
/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeerMain.java 1095174
/src/java/main/org/apache/zookeeper/server/quorum/QuorumZooKeeperServer.java 1127985
/src/java/test/org/apache/zookeeper/TestableZooKeeper.java 1065709
/src/java/test/org/apache/zookeeper/test/ClientBase.java 1127985
/src/java/test/org/apache/zookeeper/test/DisconnectableZooKeeper.java 1091841
/src/java/test/org/apache/zookeeper/test/QuorumBase.java 1127985
/src/java/test/org/apache/zookeeper/test/QuorumFDHammerTest.java PRE-CREATION
/src/java/test/org/apache/zookeeper/test/RecoveryTest.java 1091841
/src/java/test/org/apache/zookeeper/test/SessionTest.java 1091841
/src/java/test/org/apache/zookeeper/test/fd/BertierClientHammerTest.java PRE-CREATION
/src/java/test/org/apache/zookeeper/test/fd/BertierFDTest.java PRE-CREATION
/src/java/test/org/apache/zookeeper/test/fd/BertierQuorumHammerTest.java PRE-CREATION
/src/java/test/org/apache/zookeeper/test/fd/BertierRecoveryTest.java PRE-CREATION
/src/java/test/org/apache/zookeeper/test/fd/BertierSessionTest.java PRE-CREATION
/src/java/test/org/apache/zookeeper/test/fd/ChenClientHammerTest.java PRE-CREATION
/src/java/test/org/apache/zookeeper/test/fd/ChenFDTest.java PRE-CREATION
/src/java/test/org/apache/zookeeper/test/fd/ChenQuorumHammerTest.java PRE-CREATION
/src/java/test/org/apache/zookeeper/test/fd/ChenRecoveryTest.java PRE-CREATION
/src/java/test/org/apache/zookeeper/test/fd/ChenSessionTest.java PRE-CREATION
/src/java/test/org/apache/zookeeper/test/fd/FixedPingFDTest.java PRE-CREATION
/src/java/test/org/apache/zookeeper/test/fd/InterArrivalSamplingWindowTest.java PRE-CREATION
/src/java/test/org/apache/zookeeper/test/fd/PhiAccrualClientHammerTest.java PRE-CREATION
/src/java/test/org/apache/zookeeper/test/fd/PhiAccrualFDTest.java PRE-CREATION
/src/java/test/org/apache/zookeeper/test/fd/PhiAccrualQuorumHammerTest.java PRE-CREATION
/src/java/test/org/apache/zookeeper/test/fd/PhiAccrualRecoveryTest.java PRE-CREATION
/src/java/test/org/apache/zookeeper/test/fd/PhiAccrualSessionTest.java PRE-CREATION
/src/java/test/org/apache/zookeeper/test/fd/SlicedPingFDTest.java PRE-CREATION

Testing
-------

no testing, just testing INFRA-3660

Thanks,

Show
jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/802/ ----------------------------------------------------------- Review request for zookeeper. Summary ------- testing upload of patch attached to ZOOKEEPER-702 This addresses bug ZOOKEEPER-702 . https://issues.apache.org/jira/browse/ZOOKEEPER-702 Diffs /src/docs/src/documentation/content/xdocs/index.xml 1065709 /src/docs/src/documentation/content/xdocs/zookeeperFailureDetector.xml PRE-CREATION /src/java/main/org/apache/zookeeper/ClientCnxn.java 1127985 /src/java/main/org/apache/zookeeper/ClientCnxnSocket.java 1127985 /src/java/main/org/apache/zookeeper/ClientCnxnSocketNIO.java 1127985 /src/java/main/org/apache/zookeeper/ZooKeeper.java 1127985 /src/java/main/org/apache/zookeeper/ZooKeeperMain.java 1127985 /src/java/main/org/apache/zookeeper/common/fd/AbstractFailureDetector.java PRE-CREATION /src/java/main/org/apache/zookeeper/common/fd/BertierFailureDetector.java PRE-CREATION /src/java/main/org/apache/zookeeper/common/fd/ChenFailureDetector.java PRE-CREATION /src/java/main/org/apache/zookeeper/common/fd/FailureDetector.java PRE-CREATION /src/java/main/org/apache/zookeeper/common/fd/FailureDetectorFactory.java PRE-CREATION /src/java/main/org/apache/zookeeper/common/fd/FailureDetectorOptParser.java PRE-CREATION /src/java/main/org/apache/zookeeper/common/fd/FixedPingFailureDetector.java PRE-CREATION /src/java/main/org/apache/zookeeper/common/fd/InterArrivalSamplingWindow.java PRE-CREATION /src/java/main/org/apache/zookeeper/common/fd/MessageType.java PRE-CREATION /src/java/main/org/apache/zookeeper/common/fd/Monitored.java PRE-CREATION /src/java/main/org/apache/zookeeper/common/fd/PhiAccrualFailureDetector.java PRE-CREATION /src/java/main/org/apache/zookeeper/common/fd/SlicedPingFailureDetector.java PRE-CREATION /src/java/main/org/apache/zookeeper/server/ServerConfig.java 1065709 /src/java/main/org/apache/zookeeper/server/SessionTracker.java 1065709 /src/java/main/org/apache/zookeeper/server/SessionTrackerImpl.java 1095174 /src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java 1127985 /src/java/main/org/apache/zookeeper/server/ZooKeeperServerMain.java 1095174 /src/java/main/org/apache/zookeeper/server/quorum/FollowerZooKeeperServer.java 1095174 /src/java/main/org/apache/zookeeper/server/quorum/Leader.java 1127985 /src/java/main/org/apache/zookeeper/server/quorum/LeaderZooKeeperServer.java 1065709 /src/java/main/org/apache/zookeeper/server/quorum/Learner.java 1095174 /src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java 1095174 /src/java/main/org/apache/zookeeper/server/quorum/LearnerSessionTracker.java 1065709 /src/java/main/org/apache/zookeeper/server/quorum/LearnerZooKeeperServer.java 1065709 /src/java/main/org/apache/zookeeper/server/quorum/ObserverZooKeeperServer.java 1095174 /src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java 1127985 /src/java/main/org/apache/zookeeper/server/quorum/QuorumPeerConfig.java 1095174 /src/java/main/org/apache/zookeeper/server/quorum/QuorumPeerMain.java 1095174 /src/java/main/org/apache/zookeeper/server/quorum/QuorumZooKeeperServer.java 1127985 /src/java/main/org/apache/zookeeper/server/quorum/ReadOnlyZooKeeperServer.java 1125581 /src/java/test/org/apache/zookeeper/TestableZooKeeper.java 1065709 /src/java/test/org/apache/zookeeper/test/ClientBase.java 1127985 /src/java/test/org/apache/zookeeper/test/DisconnectableZooKeeper.java 1091841 /src/java/test/org/apache/zookeeper/test/QuorumBase.java 1127985 /src/java/test/org/apache/zookeeper/test/QuorumFDHammerTest.java PRE-CREATION /src/java/test/org/apache/zookeeper/test/ReadOnlyModeTest.java 1125581 /src/java/test/org/apache/zookeeper/test/RecoveryTest.java 1091841 /src/java/test/org/apache/zookeeper/test/SessionTest.java 1091841 /src/java/test/org/apache/zookeeper/test/fd/BertierClientHammerTest.java PRE-CREATION /src/java/test/org/apache/zookeeper/test/fd/BertierFDTest.java PRE-CREATION /src/java/test/org/apache/zookeeper/test/fd/BertierQuorumHammerTest.java PRE-CREATION /src/java/test/org/apache/zookeeper/test/fd/BertierRecoveryTest.java PRE-CREATION /src/java/test/org/apache/zookeeper/test/fd/BertierSessionTest.java PRE-CREATION /src/java/test/org/apache/zookeeper/test/fd/ChenClientHammerTest.java PRE-CREATION /src/java/test/org/apache/zookeeper/test/fd/ChenFDTest.java PRE-CREATION /src/java/test/org/apache/zookeeper/test/fd/ChenQuorumHammerTest.java PRE-CREATION /src/java/test/org/apache/zookeeper/test/fd/ChenRecoveryTest.java PRE-CREATION /src/java/test/org/apache/zookeeper/test/fd/ChenSessionTest.java PRE-CREATION /src/java/test/org/apache/zookeeper/test/fd/FixedPingFDTest.java PRE-CREATION /src/java/test/org/apache/zookeeper/test/fd/InterArrivalSamplingWindowTest.java PRE-CREATION /src/java/test/org/apache/zookeeper/test/fd/PhiAccrualClientHammerTest.java PRE-CREATION /src/java/test/org/apache/zookeeper/test/fd/PhiAccrualFDTest.java PRE-CREATION /src/java/test/org/apache/zookeeper/test/fd/PhiAccrualQuorumHammerTest.java PRE-CREATION /src/java/test/org/apache/zookeeper/test/fd/PhiAccrualRecoveryTest.java PRE-CREATION /src/java/test/org/apache/zookeeper/test/fd/PhiAccrualSessionTest.java PRE-CREATION /src/java/test/org/apache/zookeeper/test/fd/SlicedPingFDTest.java PRE-CREATION Diff: https://reviews.apache.org/r/802/diff Testing ------- no testing, just testing INFRA-3660 Thanks, admin
Hide
Flavio Junqueira added a comment -

Should we keep this issue marked for 3.4? Camille, Abmar, could you give an update, please?

Show
Flavio Junqueira added a comment - Should we keep this issue marked for 3.4? Camille, Abmar, could you give an update, please?
Hide
Camille Fournier added a comment -

Well, I finally got the rb updated with the latest diff. I am willing to be one of two reviewers for this, but I am definitely not comfortable signing off on such a large change by myself (and note that I am also rather blind to whitespace issues despite my best efforts). If one of you is willing to also do the final review, I will deal with checking it in, doc generation, etc.

https://reviews.apache.org/r/483/

Show
Camille Fournier added a comment - Well, I finally got the rb updated with the latest diff. I am willing to be one of two reviewers for this, but I am definitely not comfortable signing off on such a large change by myself (and note that I am also rather blind to whitespace issues despite my best efforts). If one of you is willing to also do the final review, I will deal with checking it in, doc generation, etc. https://reviews.apache.org/r/483/
Hide

camille/flavio,
I am also a little hesitant to add this to 3.4. Can we do this: Ill cut out a branch this weekend and then we immediately check it in into trunk to make sure we get this into 3.5? 3.4 seems to have quite a bit of features and we'll need some effort to stabilize 3.4. Any more features will just increase our effort to stabilization of the release. What do you guys think?

Show
Mahadev konar added a comment - camille/flavio, I am also a little hesitant to add this to 3.4. Can we do this: Ill cut out a branch this weekend and then we immediately check it in into trunk to make sure we get this into 3.5? 3.4 seems to have quite a bit of features and we'll need some effort to stabilize 3.4. Any more features will just increase our effort to stabilization of the release. What do you guys think?
Hide
Flavio Junqueira added a comment -

I have reviewed this patch some time back, and I'm happy to review it again if necessary.

Show
Flavio Junqueira added a comment - Your plan sounds reasonable to me, Mahadev. I have reviewed this patch some time back, and I'm happy to review it again if necessary.
Hide

I think we should move this out to 3.5. Am ok with committing this (after review ofcourse) to trunk as soon as 3.4 branches out.

Show
Mahadev konar added a comment - I think we should move this out to 3.5. Am ok with committing this (after review ofcourse) to trunk as soon as 3.4 branches out.
Hide

The patch no longer applies. Abmar, can you take a look?

Show
Mahadev konar added a comment - The patch no longer applies. Abmar, can you take a look?
Hide
Nicolas Liochon added a comment -

Is there any news of this proposal? It was close to be committed 2 years ago...

Show
Nicolas Liochon added a comment - Is there any news of this proposal? It was close to be committed 2 years ago...
Hide
Flavio Junqueira added a comment -

I know, it is a shame that we haven't been able to get this one in. It was a difficult one because there were too many changes in a single patch. It reminds me of ZOOKEEPER-107...

Show
Flavio Junqueira added a comment - I know, it is a shame that we haven't been able to get this one in. It was a difficult one because there were too many changes in a single patch. It reminds me of ZOOKEEPER-107 ...
Hide
Nicolas Liochon added a comment -

The good news in this comparison is that ZOOKEEPER-107 finally made it

Show
Nicolas Liochon added a comment - The good news in this comparison is that ZOOKEEPER-107 finally made it
Hide
Patrick Hunt added a comment -

Best way to address big patches is to split it up into smaller ones. Seriously, do a set of smaller changes that are easier to review/commit.

Show
Patrick Hunt added a comment - Best way to address big patches is to split it up into smaller ones. Seriously, do a set of smaller changes that are easier to review/commit.

## People

• Assignee:
Abmar Barros
Reporter:
Henry Robinson