Started looking more into the time-out issues, and discovered that I'm not setting --batchTimeout and it is defaulting to Long.MAX_VALUE. So the client is always going to still be waiting and the tablet server will never be able to safely abort the operation. At least, I hope that I'm reading these properties correctly.
The longest hold time that I see in the logs is 170s, and my general.rpc.timeout is 120s, so the HoldTimeoutException makes sense, from a sanity check point of view. Doubling the wait time on the tserver would likely solve the issue here, yes, but I'm worried that's not a stable approach.
The question that remains is how best to prevent ingest clients from dying, now. There may be different short term and long term solutions that are appropriate.
Short term: Let ingest retry the failed mutations. If the tablet server rejects the mutation for whatever reason, and the client is aware of it, then this wouldn't constitute data loss.
Long term: Provide an API for clients to specify their wait time to the servers. Then, tablet servers can more intelligently decide when to abort scans and when not to. Would need to protect the cluster from several poorly configured clients asking for very long hold times. Possibly other availability issues would be present as well.