Details
Description
After applying the fixes from DISPATCH-1124 and DISPATCH-1129 receivers in long-running multicast presettled tests still fail with corrupted data sequences. There is no single symptom but several:
- Receivers use all system memory and cache and getting hit by the OOM killer
- underrun
- illegal value for field
Research shows that function qdr_forward_drop_presettled_CT_LH is routinely dropping presettled deliveries that have already made forward progress in transmitting bytes to the wire. After that happens there is a race condition as to whether the message is successfully transmitted or the message is torn down in the middle of transmission.
For reproducing this error the sender must supply messages significantly faster than the receiving router can forward them to the next router. This triggers the presettled drops. My test setup does this by having the sender and the receiving router on the same laptop and having the next router connected over a relatively slow WiFi.