Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-1329

Master crashes in TabletInfo::GetTabletsInRange

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 0.7.0
    • Fix Version/s: 0.8.0
    • Component/s: master
    • Labels:
      None
    • Target Version/s:

      Description

      Also on the YCSB cluster, the master crashed within a minute of three tservers crashing. The stack trace:

      (gdb) bt
      #0  NoBarrier_AtomicIncrement (this=0x1) at /usr/src/debug/kudu-0.7.0-kudu0.7.0/src/kudu/gutil/atomicops-internals-x86.h:122
      #1  RefCountIncN (this=0x1) at /usr/src/debug/kudu-0.7.0-kudu0.7.0/src/kudu/gutil/atomic_refcount.h:59
      #2  RefCountInc (this=0x1) at /usr/src/debug/kudu-0.7.0-kudu0.7.0/src/kudu/gutil/atomic_refcount.h:78
      #3  kudu::subtle::RefCountedThreadSafeBase::AddRef (this=0x1) at /usr/src/debug/kudu-0.7.0-kudu0.7.0/src/kudu/gutil/ref_counted.cc:76
      #4  0x0000000000782743 in AddRef (this=0x4bfd980, req=Unhandled dwarf expression opcode 0xf3
      ) at /usr/src/debug/kudu-0.7.0-kudu0.7.0/src/kudu/gutil/ref_counted.h:137
      #5  scoped_refptr (this=0x4bfd980, req=Unhandled dwarf expression opcode 0xf3
      ) at /usr/src/debug/kudu-0.7.0-kudu0.7.0/src/kudu/gutil/ref_counted.h:234
      #6  make_scoped_refptr<kudu::master::TabletInfo> (this=0x4bfd980, req=Unhandled dwarf expression opcode 0xf3
      ) at /usr/src/debug/kudu-0.7.0-kudu0.7.0/src/kudu/gutil/ref_counted.h:335
      #7  kudu::master::TableInfo::GetTabletsInRange (this=0x4bfd980, req=Unhandled dwarf expression opcode 0xf3
      ) at /usr/src/debug/kudu-0.7.0-kudu0.7.0/src/kudu/master/catalog_manager.cc:3257
      #8  0x00000000007912c7 in kudu::master::CatalogManager::GetTableLocations (this=0x3ce8140, req=0x4a5a780, resp=0x47b2c80)
          at /usr/src/debug/kudu-0.7.0-kudu0.7.0/src/kudu/master/catalog_manager.cc:3047
      #9  0x00000000007594fb in kudu::master::MasterServiceImpl::GetTableLocations (this=0x3cea600, req=0x4a5a780, resp=0x47b2c80, rpc=0x409a960)
          at /usr/src/debug/kudu-0.7.0-kudu0.7.0/src/kudu/master/master_service.cc:317
      #10 0x00000000016b3cac in kudu::master::MasterServiceIf::Handle (this=0x3cea600, call=0x4656900)
          at /usr/src/debug/kudu-0.7.0-kudu0.7.0/build/release/src/kudu/master/master.service.cc:236
      #11 0x00000000016c8998 in kudu::rpc::ServicePool::RunThread (this=0x3ce8dc0) at /usr/src/debug/kudu-0.7.0-kudu0.7.0/src/kudu/rpc/service_pool.cc:174
      #12 0x00000000017c51da in operator() (arg=0x3c96f70) at /opt/toolchain/boost-pic-1.55.0/include/boost/function/function_template.hpp:767
      #13 kudu::Thread::SuperviseThread (arg=0x3c96f70) at /usr/src/debug/kudu-0.7.0-kudu0.7.0/src/kudu/util/thread.cc:580
      #14 0x00007fa169e3f9d1 in start_thread () from /lib64/libpthread.so.0
      #15 0x00007fa1694e68fd in clone () from /lib64/libc.so.6
      

      It appears that the TabletInfo pointer retrieved from the map iterator has a value of 0x1, speaking to some possible heap corruption that took place earlier. The master was busy trying (and failing) to create replicas:

      W0212 16:23:13.523007 57339 catalog_manager.cc:2561] Tablet 807cfe95b1024921bdacfc26e2451ab2 (table ycsb-1455322959 [id=441e8abd9feb4366939e9d0b3dd03eb1]) was not created within the allowed timeout. Replacing with a new tablet 8152b1bc19604cbbbfcd8b2200534de6
      W0212 16:23:13.523193 57339 catalog_manager.cc:2561] Tablet 6345d1a97c704a64842c84d68771828c (table ycsb-1455322959 [id=441e8abd9feb4366939e9d0b3dd03eb1]) was not created within the allowed timeout. Replacing with a new tablet 10ab5e4c146b4348b5bbe64659322cf7
      W0212 16:23:13.523385 57339 catalog_manager.cc:2561] Tablet 759cb779aa0b4ddbac460a9226ee47e7 (table ycsb-1455322959 [id=441e8abd9feb4366939e9d0b3dd03eb1]) was not created within the allowed timeout. Replacing with a new tablet d0b93ffed8ec4429b25fef6b46b371cc
      W0212 16:23:13.523569 57339 catalog_manager.cc:2561] Tablet b0c8b5e4368b488897660c44fa6ffd5b (table ycsb-1455322959 [id=441e8abd9feb4366939e9d0b3dd03eb1]) was not created within the allowed timeout. Replacing with a new tablet dbb35c940ecd4fcbae9885e5b49f3c7a
      W0212 16:23:13.523756 57339 catalog_manager.cc:2561] Tablet 8c078b47342840adbacb048a2cdab552 (table ycsb-1455322959 [id=441e8abd9feb4366939e9d0b3dd03eb1]) was not created within the allowed timeout. Replacing with a new tablet 5c8dcefa3d0e455eaefdc7ae6b1456d9
      W0212 16:23:13.524001 57339 catalog_manager.cc:2737] Aborting the current task due to error: Invalid argument: An error occured while selecting replicas for tablet ac37cad546b8480dbf23d1c8d6af49b6: Invalid argument: Not enough tablet servers are online for table 'ycsb-1455322959'. Need at least 3 replicas, but only 2 tablet servers are available: Not enough tablet servers are online for table 'ycsb-1455322959'. Need at least 3 replicas, but only 2 tablet servers are available
      E0212 16:23:13.524375 57339 catalog_manager.cc:370] Error processing pending assignments, aborting the current task: Invalid argument: An error occured while selecting replicas for tablet ac37cad546b8480dbf23d1c8d6af49b6: Invalid argument: Not enough tablet servers are online for table 'ycsb-1455322959'. Need at least 3 replicas, but only 2 tablet servers are available: Not enough tablet servers are online for table 'ycsb-1455322959'. Need at least 3 replicas, but only 2 tablet servers are available
      

      Todd and I spent some time looking at the code but couldn't figure out what would cause such a crash. At first we suspected the downgrade of the TableMetadataLock from WRITE to READ in commit a80b83f, but couldn't find any evidence that it would be a problem. Todd also suspects that a change in the number of live tservers between "time of create table" and "time of select replica" could be responsible as that particular scenario lacks test coverage. We concluded that the next step is to reproduce the crash in an integration test.

      I'm filing against 0.8.0 since without a repro case or a fix it's hard to see how it can be fixed in time for 0.7.0.

        Attachments

          Activity

            People

            • Assignee:
              adar Adar Dembo
              Reporter:
              adar Adar Dembo
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: