when we tracking down this issue, we have two directions:
Weijin is tracking on why the event is "8", where there should not be any event that is "8" in the event system, and in other core dumps we are sure that the event is not what it should be as a really event, it is shown as a random data, that turns out to be something really interest: 1, it should be that the old data(may or may not be the same event) is freed, and the event is not canceled. 2, someone overwrite the data in this event. Weijin track down this way and it turns out that the action cancel codes may rise some problem under certain situation. He made a patch into our tree, and we applied it on half of our servers, it runs without any crash for weeks.
At the same time, Koutai is working on make the vector write & read more safe, even in some very strange situation. And patched half of our servers, runs without any crash too.
after carefully discuss, we conclude that Weijing's patch is what we need to keep, and here comes the patch.
TS-857, when I look it back, there is some strange event in the back trace, we have only , is that the same issue hare? where is the action canceled without mutex protected? if we can consider TS-1114 a good fix, then we should think about TS-857 a crash same as it.
so far, I am not sure how many crashes after patched with
TS-1114, I just don't get too much new back trace for this issue, TS-1114 may covered many strange crashes as it will make system really strange.