Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-950

Possible race in Master lifecycle

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • Public beta
    • 1.0.0
    • master
    • None

    Description

      It looks like there is a startup race in the master. We should likely just implement the same bind-then-listen type startup logic on the Master that we use in the TS to protect against this.

      I saw this failure in external_mini_cluster-test on gerrit @ http://sandbox.jenkins.cloudera.com/job/kudu-gerrit/9359/BUILD_TYPE=RELEASE,label=kudu-gerrit-slaves/:

      I0807 02:53:37.758509 32373 external_mini_cluster.cc:508] Running /data1/jenkins-workspace/kudu-gerrit/BUILD_TYPE/RELEASE/label/kudu-gerrit-slaves/build/release/kudu-master
      kudu-master
      --master_wal_dir=/data1/test-tmp/external_mini_cluster-test.EMCTest.TestBasicOperation.1438941209613987-32373/minicluster-data/master-0
      --master_data_dirs=/data1/test-tmp/external_mini_cluster-test.EMCTest.TestBasicOperation.1438941209613987-32373/minicluster-data/master-0
      --master_rpc_bind_addresses=127.0.0.1:11010
      --webserver_interface=localhost
      --master_web_port=40946
      --metrics_log_interval_ms=1000
      --log_dir=/data1/test-tmp/external_mini_cluster-test.EMCTest.TestBasicOperation.1438941209613987-32373/minicluster-data/master-0
      --master_addresses=127.0.0.1:11010,127.0.0.1:11011,127.0.0.1:11012
      --enable_leader_failure_detection=true
      --server_dump_info_path=/data1/test-tmp/external_mini_cluster-test.EMCTest.TestBasicOperation.1438941209613987-32373/minicluster-data/master-0/info.pb
      --server_dump_info_format=pb
      --logtostderr
      --logbuflevel=-1
      I0807 02:53:37.772274   453 mem_tracker.cc:98] MemTracker: hard memory limit is 23.515860 GB
      I0807 02:53:37.772506   453 mem_tracker.cc:100] MemTracker: soft memory limit is 14.109516 GB
      I0807 02:53:37.774209   453 master_main.cc:27] Initializing master server...
      I0807 02:53:37.776566   453 fs_manager.cc:200] Opened local filesystem: /data1/test-tmp/external_mini_cluster-test.EMCTest.TestBasicOperation.1438941209613987-32373/minicluster-data/master-0
      uuid: "5add707ecfe54a4d8cd9bde9768ebe8f"
      format_stamp: "Formatted at 2015-08-07 09:53:29 on boost-static-burst-slave-0b55.vpc.cloudera.com"
      I0807 02:53:37.779443   453 hybrid_clock.cc:122] HybridClock initialized. Resolution in nanos?: 1 Wait times tolerance adjustment: 1.0005 Current error: 478950
      I0807 02:53:37.779597   453 master_main.cc:30] Starting Master server...
      I0807 02:53:37.782593   453 rpc_server.cc:125] RPC server started. Bound to: 127.0.0.1:11010
      I0807 02:53:37.782708   453 webserver.cc:121] Starting webserver on localhost:40946
      I0807 02:53:37.782762   453 webserver.cc:130] Document root disabled
      I0807 02:53:37.783236   453 webserver.cc:213] Webserver started. Bound to: http://127.0.0.1:40946/
      F0807 02:53:37.799149   472 catalog_manager.cc:1513] Check failed: sys_catalog_.get() != NULL sys_catalog_ must be initialized!
      *** Check failure stack trace: ***
          @           0x72776d  google::LogMessage::Fail() at /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:234
          @           0x72bc4d  google::LogMessage::SendToLog() at /usr/include/boost/random/mersenne_twister.hpp:251
          @           0x729abb  google::LogMessage::Flush() at /data1/jenkins-workspace/kudu-gerrit/BUILD_TYPE/RELEASE/label/kudu-gerrit-slaves/thirdparty/installed/include/boost/uuid/seed_rng.hpp:81
          @           0x729de1  google::LogMessageFatal::~LogMessageFatal() at /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_ios.h:452
          @           0x702bdb  kudu::master::CatalogManager::GetTabletPeer() at /data1/jenkins-workspace/kudu-gerrit/BUILD_TYPE/RELEASE/label/kudu-gerrit-slaves/src/kudu/gutil/ref_counted.h:277
          @           0x76d784  kudu::tserver::ConsensusServiceImpl::UpdateConsensus() at /data1/jenkins-workspace/kudu-gerrit/BUILD_TYPE/RELEASE/label/kudu-gerrit-slaves/thirdparty/gperftools-2.2.1/src/heap-profiler.cc:322
          @          0x1164edf  kudu::consensus::ConsensusServiceIf::Handle() at ??:0
          @          0x11ce978  kudu::rpc::ServicePool::RunThread() at ??:0
          @          0x128b90f  kudu::Thread::SuperviseThread() at ??:0
          @     0x7f675d918851  start_thread at ??:0
          @     0x7f675cb8a94d  clone at ??:0
          @              (nil)  (unknown)
      W0807 02:53:38.121330 32475 consensus_peers.cc:247] T 00000000000000000000000000000000 P 82f86f33a42a4b14a8b7f1ce307e0976 -> Peer 5add707ecfe54a4d8cd9bde9768ebe8f (127.0.0.1:11010): Couldn't send request to peer 5add707ecfe54a4d8cd9bde9768ebe8f for tablet 00000000000000000000000000000000 Status: Network error: Recv() got EOF from remote (error 108). Retrying in the next heartbeat period. Already tried 1 times.
      /data1/jenkins-workspace/kudu-gerrit/BUILD_TYPE/RELEASE/label/kudu-gerrit-slaves/src/kudu/integration-tests/external_mini_cluster-test.cc:92: Failure
      Failed
      Bad status: Runtime error: Process exited with rc=134: /data1/jenkins-workspace/kudu-gerrit/BUILD_TYPE/RELEASE/label/kudu-gerrit-slaves/build/release/kudu-master
      I0807 02:53:38.127594 32373 external_mini_cluster.cc:597] Killing /data1/jenkins-workspace/kudu-gerrit/BUILD_TYPE/RELEASE/label/kudu-gerrit-slaves/build/release/kudu-master with pid 32422
      I0807 02:53:38.130091 32373 external_mini_cluster.cc:597] Killing /data1/jenkins-workspace/kudu-gerrit/BUILD_TYPE/RELEASE/label/kudu-gerrit-slaves/build/release/kudu-master with pid 32467
      I0807 02:53:38.135949 32373 external_mini_cluster.cc:597] Killing /data1/jenkins-workspace/kudu-gerrit/BUILD_TYPE/RELEASE/label/kudu-gerrit-slaves/build/release/kudu-tablet_server with pid 32512
      I0807 02:53:38.139133 32373 external_mini_cluster.cc:597] Killing /data1/jenkins-workspace/kudu-gerrit/BUILD_TYPE/RELEASE/label/kudu-gerrit-slaves/build/release/kudu-tablet_server with pid 32660
      I0807 02:53:38.141916 32373 external_mini_cluster.cc:597] Killing /data1/jenkins-workspace/kudu-gerrit/BUILD_TYPE/RELEASE/label/kudu-gerrit-slaves/build/release/kudu-tablet_server with pid 321
      I0807 02:53:38.145797 32373 test_util.cc:56] -----------------------------------------------
      I0807 02:53:38.145820 32373 test_util.cc:57] Had fatal failures, leaving test files at /data1/test-tmp/external_mini_cluster-test.EMCTest.TestBasicOperation.1438941209613987-32373
      [  FAILED  ] EMCTest.TestBasicOperation (8531 ms)
      

      Attachments

        Activity

          People

            mpercy Mike Percy
            mpercy Mike Percy
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: