Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-14679

TLOGs grow forever, never get out of BUFFERING state

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • None
    • 9.1, main (10.0), 8.11.3
    • None
    • None

    Description

      From the user's list,
      (Gael Jourdan-Weil)

      https://www.mail-archive.com/solr-user@lucene.apache.org/msg151867.html

      I think I've come down to the root cause of this mess in our case.

      Everything is confirming that the TLOG state is "BUFFERING" rather than "ACTIVE".
      1/ This can be seen with the metrics API as well where we observe:
      "TLOG.replay.remaining.bytes":48997506,
      "TLOG.replay.remaining.logs":1,
      "TLOG.state":1,
      2/ When a hard commit occurs, we can see it in the logs and as the index files are updated ; but we can also see that postCommit and preCommit UpdateLog methods are called but exits immediately which looking at the code indicates the state is "BUFFERING".

      So, why is this TLOG still in "BUFFERING" state?

      From the code, the only place where state is set to "BUFFERING" seems to be UpdateLog.bufferUpdates.
      From the logs, in our case it comes from recovery process. We see the message "Begin buffering updates. core=[col_blue_shard1]".
      Just after we can see "Publishing state of core [col_blue_shard1] as recovering, leader is http://srv2/solr/col_blue_shard1/ and I am http://srv1/solr/col_blue_shard1/".

      Until here, everything is expected I guess but why the TLOG state is not set to "ACTIVE" a bit later?

      Well, the "Begin buffering updates" occured and 500ms later we can see:

      • "Updated live nodes from ZooKeeper... (2) -> (1)" (I think at this time we shut down srv2, this is our main cause of problem)
      • "I am going to be the leader srv1"
      • "Stopping recovery for core=[col_blue_shard1] coreNodeName=[core_node1]"
        And 2s later:
      • "Attempting to PeerSync from http://srv2/solr/es_blue_shard1/ - recoveringAfterStartup=[true]"
      • "Error while trying to recover. core=es_blue_shard1:org.apache.solr.common.SolrException: Failed to get fingerprint from leader"
      • "Finished recovery process, successful=[false]"

      At this point, I think the root cause on our side is a rolling update that we did too quickly: we stopped node2 while node1 while recovering from it.

      It's still not clear how everything went back to "active" state after such a failed recovery and a TLOG still in "BUFFERING".

      We shouldn't have been in recovery in the first place and I think we know why, this is a first thing that we have adressed.
      Then we need to add some pauses in our rolling update strategy.

      Does it makes sense? Can you think of something else to check/improve?

      Best Regards,
      Gaƫl

      Attachments

        Issue Links

          Activity

            People

              houston Houston Putman
              erickerickson Erick Erickson
              Votes:
              3 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m