Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-1933

OpId index 32-bit overflow (was: Master crashes after too many TS re-registrations)

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.3.0
    • Fix Version/s: 1.3.1, 1.4.0
    • Component/s: consensus, master, tserver
    • Labels:
      None

      Description

      I had a cluster with mis-matched versions inside the 1.3 release (something no one would see using released versions) and ended up with the tablet servers constantly retrying to register with the master. After a few days of this, the master died this way:

      I0308 00:25:47.038650  7619 ts_descriptor.cc:125] Processing retry of TS registration from permanent_uuid: "d8009e07d82b4e66a7ab50f85e60bc30" instance_seqno: 1487888450146835
      I0308 00:25:47.038702  7619 ts_manager.cc:84] Re-registered known tserver with Master: d8009e07d82b4e66a7ab50f85e60bc30 (ve0136.halxg.cloudera.com:7050)
      I0308 00:25:47.043874  7616 ts_descriptor.cc:125] Processing retry of TS registration from permanent_uuid: "335d132897de4bdb9b87443f2c487a42" instance_seqno: 1487888474889244
      I0308 00:25:47.043912  7616 ts_manager.cc:84] Re-registered known tserver with Master: 335d132897de4bdb9b87443f2c487a42 (ve0126.halxg.cloudera.com:7050)
      I0308 00:25:47.108677  7617 ts_descriptor.cc:125] Processing retry of TS registration from permanent_uuid: "7425c65d80f54f2da0a85494a5eb3e68" instance_seqno: 1487888491433564
      I0308 00:25:47.108719  7617 ts_manager.cc:84] Re-registered known tserver with Master: 7425c65d80f54f2da0a85494a5eb3e68 (ve0122.halxg.cloudera.com:7050)
      I0308 00:25:47.111563  7611 ts_descriptor.cc:125] Processing retry of TS registration from permanent_uuid: "c108a85a68504c2bb9f49e4ee683d981" instance_seqno: 1487888392795318
      I0308 00:25:47.111604  7611 ts_manager.cc:84] Re-registered known tserver with Master: c108a85a68504c2bb9f49e4ee683d981 (ve0128.halxg.cloudera.com:7050)
      F0308 00:25:53.568773  7655 log_index.cc:171] Check failed: log_index > 0 (-2147483648 vs. 0) 
      

      Ideally the master shouldn't crash, but it also sounds like we're not handling log_index overflows.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                mpercy Mike Percy
                Reporter:
                jdcryans Jean-Daniel Cryans
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: