Details
Description
If a rebalance is performed after turning the OS clock back, then the kafka server enters in a loop and the rebalance cannot be completed until the system returns to the previous date/hour.
Steps to Reproduce:
- Start a consumer for TOPIC_NAME with group id GROUP_NAME. It will be owner of all the partitions.
- Turn the system (OS) clock back. For instance 1 hour.
- Start a new consumer for TOPIC_NAME using the same group id, it will force a rebalance.
After these actions the kafka server logs constantly display the messages below, and after a while both consumers do not receive more packages. This condition lasts at least the time that the clock went back, for this example 1 hour, and finally after this time kafka comes back to work.
[2016-08-08 11:30:23,023] INFO [GroupCoordinator 0]: Preparing to restabilize group GROUP_NAME with old generation 2 (kafka.coordinator.GroupCoordinator)
[2016-08-08 11:30:23,025] INFO [GroupCoordinator 0]: Stabilized group GROUP_NAME generation 3 (kafka.coordinator.GroupCoordinator)
[2016-08-08 11:30:23,027] INFO [GroupCoordinator 0]: Preparing to restabilize group GROUP_NAME with old generation 3 (kafka.coordinator.GroupCoordinator)
[2016-08-08 11:30:23,029] INFO [GroupCoordinator 0]: Group GROUP_NAME generation 3 is dead and removed (kafka.coordinator.GroupCoordinator)
[2016-08-08 11:30:23,032] INFO [GroupCoordinator 0]: Preparing to restabilize group GROUP_NAME with old generation 0 (kafka.coordinator.GroupCoordinator)
[2016-08-08 11:30:23,032] INFO [GroupCoordinator 0]: Stabilized group GROUP_NAME generation 1 (kafka.coordinator.GroupCoordinator)
[2016-08-08 11:30:23,033] INFO [GroupCoordinator 0]: Preparing to restabilize group GROUP_NAME with old generation 1 (kafka.coordinator.GroupCoordinator)
[2016-08-08 11:30:23,034] INFO [GroupCoordinator 0]: Group GROUP generation 1 is dead and removed (kafka.coordinator.GroupCoordinator)
[2016-08-08 11:30:23,043] INFO [GroupCoordinator 0]: Preparing to restabilize group GROUP_NAME with old generation 0 (kafka.coordinator.GroupCoordinator)
[2016-08-08 11:30:23,044] INFO [GroupCoordinator 0]: Stabilized group GROUP_NAME generation 1 (kafka.coordinator.GroupCoordinator)
[2016-08-08 11:30:23,044] INFO [GroupCoordinator 0]: Preparing to restabilize group GROUP_NAME with old generation 1 (kafka.coordinator.GroupCoordinator)
[2016-08-08 11:30:23,045] INFO [GroupCoordinator 0]: Group GROUP_NAME generation 1 is dead and removed (kafka.coordinator.GroupCoordinator)
Due to the fact that some systems could have enabled NTP or an administrator option to change the system clock (date/time) it's important to do it safely, currently the only way to do it safely is following the next steps:
1- Tear down the Kafka server.
2- Change the date/time
3- Tear up the Kafka server.
But, this approach can be done only if the change was performed by the administrator, not for NTP. Also in many systems turning down the Kafka server might cause the INFORMATION TO BE LOST.
Attachments
Issue Links
- links to
As discussed in the thread, Kafka uses System.currentTimeMillis() in a number of places, which means that changing the system clock backwards is bound to cause issues. I wouldn't be surprised if there are other problems in addition to the one mentioned here.