Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-1670

Crash when bootstrapping recovery wal

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Duplicate
    • 1.0.0
    • n/a
    • tablet
    • None

    Description

      alter_table-randomized-itest turned up an interesting crash in the tablet bootstrap / wal recovery codepath. Relevant log lines (full log attached):

      I0930 09:20:45.003783 17173 ts_tablet_manager.cc:632] T fd6c30c11e8f488aa7eac04a76b5a246 P cfa92dcbfd8f41798ec6d6fce44e6c2e: Bootstrapping tablet
      I0930 09:20:45.008419 17173 tablet_bootstrap.cc:380] T fd6c30c11e8f488aa7eac04a76b5a246 P cfa92dcbfd8f41798ec6d6fce44e6c2e: Bootstrap starting.
      I0930 09:20:45.025218 17173 tablet_bootstrap.cc:542] T fd6c30c11e8f488aa7eac04a76b5a246 P cfa92dcbfd8f41798ec6d6fce44e6c2e: Time spent opening tablet: real 0.000s	user 0.001s	sys 0.000s
      I0930 09:20:45.025753 17173 tablet_bootstrap.cc:598] T fd6c30c11e8f488aa7eac04a76b5a246 P cfa92dcbfd8f41798ec6d6fce44e6c2e: Will attempt to recover log segment /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246/wal-000000001 to /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001
      I0930 09:20:45.026008 17173 tablet_bootstrap.cc:606] T fd6c30c11e8f488aa7eac04a76b5a246 P cfa92dcbfd8f41798ec6d6fce44e6c2e: Moving log directory /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246 to recovery directory /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery in preparation for log replay
      W0930 09:20:45.026758 17173 log_util.cc:475] Log segment file /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001 has 12 initial NULL bytes instead of magic and header length: \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 and will be treated as a blank segment.
      I0930 09:20:45.027081 17173 log_util.cc:312] Log segment /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001 has no footer. This segment was likely being written when the server previously shut down.
      I0930 09:20:45.027308 17173 log_reader.cc:151] Log segment /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001 was likely left in-progress after a previous crash. Will try to rebuild footer by scanning data.
      I0930 09:20:45.027604 17173 log_util.cc:575] Scanning /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001 for valid entry headers following offset 12...
      I0930 09:20:45.842016 17173 log_util.cc:612] Found no log entry headers
      I0930 09:20:45.873320 17173 log_util.cc:215] Ignoring log segment corruption in /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001 because there are no log entries following the corrupted one. The server probably crashed in the middle of writing an entry to the write-ahead log or downloaded an active log via tablet copy. Error detail: Corruption: CRC mismatch in log entry header: Log file corruption detected. Failed trying to read batch #0 at offset 0 for log segment /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001: Prior entries:
      I0930 09:20:45.873550 17173 log_util.cc:365] Successfully rebuilt footer for segment: /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001 (valid entries through byte offset 0)
      F0930 09:20:45.873761 17173 log_util.h:208] Check failed: header_.IsInitialized() 
      *** Check failure stack trace: ***
          @     0x7fa678822e0d  google::LogMessage::SendToLog() at ??:0
          @     0x7fa6788237e9  google::LogMessage::Flush() at ??:0
          @     0x7fa67882722b  google::LogMessageFatal::~LogMessageFatal() at ??:0
          @     0x7fa67f074331  kudu::log::ReadableLogSegment::header() at ??:0
          @     0x7fa67d623e37  kudu::log::LogReader::Init() at ??:0
          @     0x7fa67d624488  kudu::log::LogReader::OpenFromRecoveryDir() at ??:0
          @     0x7fa67dc38dcc  kudu::tablet::TabletBootstrap::OpenLogReaderInRecoveryDir() at ??:0
          @     0x7fa67dc371fa  kudu::tablet::TabletBootstrap::Bootstrap() at ??:0
          @     0x7fa67dc36c6d  kudu::tablet::BootstrapTablet() at ??:0
          @     0x7fa67f0a1d98  kudu::tserver::TSTabletManager::OpenTablet() at ??:0
          @     0x7fa67f0abeb8  boost::_mfi::mf2<>::operator()() at ??:0
          @     0x7fa67f0abe04  boost::_bi::list3<>::operator()<>() at ??:0
          @     0x7fa67f0abd74  boost::_bi::bind_t<>::operator()() at ??:0
          @     0x7fa67f0abb12  boost::detail::function::void_function_obj_invoker0<>::invoke() at ??:0
          @     0x7fa67ca1f432  boost::function0<>::operator()() at ??:0
          @     0x7fa6790744fe  kudu::FunctionRunnable::Run() at ??:0
          @     0x7fa6790732de  kudu::ThreadPool::DispatchThread() at ??:0
          @     0x7fa679075b9f  boost::_mfi::mf1<>::operator()() at ??:0
          @     0x7fa679075afc  boost::_bi::list2<>::operator()<>() at ??:0
          @     0x7fa679075a74  boost::_bi::bind_t<>::operator()() at ??:0
          @     0x7fa679075872  boost::detail::function::void_function_obj_invoker0<>::invoke() at ??:0
          @     0x7fa67ca1f432  boost::function0<>::operator()() at ??:0
          @     0x7fa67906cc55  kudu::Thread::SuperviseThread() at ??:0
          @           0x42506c  __tsan_thread_start_func at ??:0
          @       0x34186079d1  (unknown) at ??:0
          @       0x34182e88fd  (unknown) at ??:0
          @              (nil)  (unknown)
      

      This is my first time spelunking through this code, but it seems we have some broken invariants in this case. We have a ReadableLogSegment with an uninitialized header, but with its is_initialized_ flag set to true. The following comment from log_util.cc:388 leads me to believe this is not expected:

          // If a log file has been pre-allocated but not initialized, then
          // 'header_size' will be 0 even the file size is > 0; in this
          // case, 'is_initialized_' remains set to false and return
          // Status::OK() early. LogReader ignores segments where
          // IsInitialized() returns false.
      

      You can see from the warning log lines above that we are indeed in the case where the magic bytes are all 0. By reading through the code it's clear to me how this happened (is_initalized_ gets set at log_util.cc:321), but I'm not quite familiar with the code enough to figure out how to fix it, or what the relevant tests should be. The stack at the point where this invariant is broken is (growing down):

      ReadableLogSegment::Init() // No arg version, eventually sets is_initialized_=true since none of the initialization steps fail
      ReadableLogSegment::ReadHeader // Returns Status::OK back to Init, because header_size==0
      ReadableLogSegment::ReadHeaderMagicAndHeaderLength // Returns Status::OK
      ReadableLogSegment::ParseHeaderMagicAndHeaderLength // Returns Status::OK, because magic bytes are all 0, sets parsed_len=0 (parsed_len is the same as header_size in the ReadHeader frame)
      

      Attachments

        Issue Links

          Activity

            People

              tlipcon Todd Lipcon
              danburkert Dan Burkert
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: