Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
None
-
None
-
None
Description
From the user's list,
(Gael Jourdan-Weil)
https://www.mail-archive.com/solr-user@lucene.apache.org/msg151867.html
I think I've come down to the root cause of this mess in our case.
Everything is confirming that the TLOG state is "BUFFERING" rather than "ACTIVE".
1/ This can be seen with the metrics API as well where we observe:
"TLOG.replay.remaining.bytes":48997506,
"TLOG.replay.remaining.logs":1,
"TLOG.state":1,
2/ When a hard commit occurs, we can see it in the logs and as the index files are updated ; but we can also see that postCommit and preCommit UpdateLog methods are called but exits immediately which looking at the code indicates the state is "BUFFERING".
So, why is this TLOG still in "BUFFERING" state?
From the code, the only place where state is set to "BUFFERING" seems to be UpdateLog.bufferUpdates.
From the logs, in our case it comes from recovery process. We see the message "Begin buffering updates. core=[col_blue_shard1]".
Just after we can see "Publishing state of core [col_blue_shard1] as recovering, leader is http://srv2/solr/col_blue_shard1/ and I am http://srv1/solr/col_blue_shard1/".
Until here, everything is expected I guess but why the TLOG state is not set to "ACTIVE" a bit later?
Well, the "Begin buffering updates" occured and 500ms later we can see:
- "Updated live nodes from ZooKeeper... (2) -> (1)" (I think at this time we shut down srv2, this is our main cause of problem)
- "I am going to be the leader srv1"
- "Stopping recovery for core=[col_blue_shard1] coreNodeName=[core_node1]"
And 2s later: - "Attempting to PeerSync from http://srv2/solr/es_blue_shard1/ - recoveringAfterStartup=[true]"
- "Error while trying to recover. core=es_blue_shard1:org.apache.solr.common.SolrException: Failed to get fingerprint from leader"
- "Finished recovery process, successful=[false]"
At this point, I think the root cause on our side is a rolling update that we did too quickly: we stopped node2 while node1 while recovering from it.
It's still not clear how everything went back to "active" state after such a failed recovery and a TLOG still in "BUFFERING".
We shouldn't have been in recovery in the first place and I think we know why, this is a first thing that we have adressed.
Then we need to add some pauses in our rolling update strategy.
Does it makes sense? Can you think of something else to check/improve?
Best Regards,
Gaƫl
Attachments
Issue Links
- relates to
-
SOLR-13486 race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestTlogReplayVsRecovery)
- Open
- links to