Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-3017

master crashes on attempt to replay orphaned ops in WAL, not reporting the root cause of the problem

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 1.7.0, 1.8.0, 1.7.1, 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1
    • 1.12.0
    • master
    • None

    Description

      This bug is about misreporting the root cause of the problem, so it's not easy to correlate the error message with the actual problem and at the phase of the process lifecycle. After analysis, it turned to be just another manifestation/consequence of KUDU-3016.

      I saw master crashing with the following error reported in the log:

      F1206 01:32:15.488359 1324967 tablet_replica.cc:138] Check failed: state_ == SHUTDOWN || state_ == FAILED TabletReplica not fully shut down. State: BOOTSTRAPPING
      

      It's not easy to tell at what point of master lifecycle it happened, but after looking around in the log and into the generated core file it became clear the problem was just a consequence of the conditions that triggered KUDU-3016 at first place:

      Extra info from the log:

      I1206 01:32:15.419330 1324967 tablet_bootstrap.cc:439] T 00000000000000000000000000000000 P 77360e3dee9f4a748e75f830554326a8: Bootstrap complete.
      I1206 01:32:15.471163 1324967 raft_consensus.cc:340] T 00000000000000000000000000000000 P 77360e3dee9f4a748e75f830554326a8 [term 164 FOLLOWER]: Replica starting. Triggering 11 pending transactions. Active config: opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: "77360e3dee9f4a748e75f830554326a8" member_type: VOTER last_known_addr { host: "master0" port: 7051 } } peers { permanent_uuid: "2a23cf2aee7549fbb63e6f8bcfb08cc3" member_type: VOTER last_known_addr { host: "master1" port: 7051 } } peers { permanent_uuid: "97326d428af84cf88d95eefe32eca0bd" member_type: VOTER last_known_addr { host: "master2" port: 7051 } }
      W1206 01:32:15.488217 1324967 transaction_tracker.cc:122] transaction on tablet 00000000000000000000000000000000 rejected due to memory pressure: the memory usage of this transaction (91215642) plus the current consumption (0) exceeds the transaction memory limit (67108864) or the limit of an ancestral memory tracker.
      

      See the attached file for the stack trace captured in the core file.

      Attachments

        1. core.stack.xz
          1.0 kB
          Alexey Serbin

        Issue Links

          Activity

            People

              Unassigned Unassigned
              aserbin Alexey Serbin
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: