Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.5.0
-
None
Description
If a ReattachExecute is sent very quickly after ExecutePlan, the following could happen:
- ExecutePlan didn't reach executeHolder.runGrpcResponseSender(responseSender) in SparkConnectExecutePlanHandler yet.
- ReattachExecute races around and reaches executeHolder.runGrpcResponseSender(responseSender) in SparkConnectReattachExecuteHandler first.
- When ExecutePlan reaches executeHolder.runGrpcResponseSender(responseSender), and executionObserver.attachConsumer(this) is called in ExecuteGrpcResponseSender of ExecutePlan, it will kick out the ExecuteGrpcResponseSender or ReattachExecute.
So even though ReattachExecute came later, it will get interrupted by the earlier ExecutePlan and finish with a SparkSQLException(errorClass = "INVALID_CURSOR.DISCONNECTED", Map.empty) (which was assumed to be a situation where a stale hanging RPC is replaced by a reconnection.
That would be very unlikely to happen in practice, because ExecutePlan shouldn't be abandoned so fast, but because of https://issues.apache.org/jira/browse/SPARK-44833 it is slightly more likely (though there there is also a 50ms sleep before retry, which again make it unlikely)