We are using Hadoop delegation token authentication functionality in Apache Solr. As part of the integration testing, I found following issue with the delegation token cancelation functionality.
Consider a setup with 2 Solr servers (S1 and S2) which are configured to use delegation token functionality backed by Zookeeper. Now invoke following steps,
[Step 1] Send a request to S1 to create a delegation token.
(Delegation token DT is created successfully)
[Step 2] Send a request to cancel DT to S2
(DT is canceled successfully. client receives HTTP 200 response)
[Step 3] Send a request to cancel DT to S2 again
(DT cancelation fails. client receives HTTP 404 response)
[Step 4] Send a request to cancel DT to S1
At this point we get two different responses.
- DT cancelation fails. client receives HTTP 404 response
- DT cancelation succeeds. client receives HTTP 200 response
Also as per the current implementation, each server maintains an in_memory cache of current tokens which is updated using the ZK watch mechanism. e.g. the ZK watch on S1 will ensure that the in_memory cache is synchronized after step 2.
After investigation, I found the root cause for this behavior is due to the race condition between step 4 and the firing of ZK watch on S1. Whenever the watch fires before the step 4 - we get HTTP 404 response (as expected). When that is not the case - we get HTTP 200 response along with following ERROR message in the log,
From client perspective, the server should return HTTP 404 error when the cancel request is sent out for an invalid token.
Ref: Here is the relevant Solr unit test for reference,