Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
Currently batch message has bugs:
1. Batch message is triggering a lot of duplicated state transition messages sent from controller, result in "state does not match" error on participant side. This will further create a lot of ERROR znodes in ZK, which adds up both read/write workload in participant and controller
2. We see a lot of concurrent update exceptions as well
9909348:[2018-03-30 18:59:55,025] [ERROR] [pool-1-thread-1917] [org.apache.helix.messaging.handling.HelixTask:113] - Exception while executing a message. java.util.ConcurrentModificat ionException msgId: fbdc37d4-ec95-47cb-950c-f9d3d224bbb3 type: STATE_TRANSITION 9909349-java.util.ConcurrentModificationException 9909350- at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1115) 9909351- at java.util.TreeMap$KeyIterator.next(TreeMap.java:1169) 9909352- at org.apache.helix.ZNRecord.merge(ZNRecord.java:497) 9909353- at org.apache.helix.GroupCommit.commit(GroupCommit.java:121) 9909354- at org.apache.helix.manager.zk.ZKHelixDataAccessor.updateProperty(ZKHelixDataAccessor.java:182) 9909355- at org.apache.helix.manager.zk.ZKHelixDataAccessor.updateProperty(ZKHelixDataAccessor.java:170) 9909356- at org.apache.helix.messaging.handling.BatchMessageHandler.postHandleMessage(BatchMessageHandler.java:118) 9909357- at org.apache.helix.messaging.handling.BatchMessageHandler.handleMessage(BatchMessageHandler.java:203) 9909358- at org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:96)
The above 2 errors are resulted in the fact that in HelixTaskExecutor, all HelixTask objects from same batch of messages are sharing the same changeContext object. For batch message, HelixTask will create current state update map to record current state updates, and therefore result in a racing condition in current state recording - it is very normal that due to such bug, resource's current state is changed on participant side, current state is not updated in ZK, and after message is removed, controller still think that state transition is not finished, and send duplicated state transition message.
The error situation will only be triggered when the load is high, so not covered by our unit / e2e tests
To fix the issue, we should create deep copies of NotificationContext object for each HelixTask in HelixTaskExecutor. I tried this fix using large data sets, and it worked.
Attachments
Issue Links
- links to