Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-3521

Kudu servers sometimes crash when host clock is synchronized by PTPd

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.16.1, 1.18.0, 1.17.1
    • None
    • None

    Description

      This issue has been reported on the #kudu-general Slack channel. A Kudu server of 1.16.0 version (not sure whether it was kudu-master or kudu-tserver, but it doesn't matter) crashed with the following error:

      F1024 22:32:06.866636 3323203 hybrid_clock.cc:452] Check failed: _s.ok() unable to get current timestamp with error bound: Service unavailable: clock error estimate (18446744073709551615us) too high (clock considered synchronized by the kernel)
      

      From the analysis of the code in hybrid_clock.cc, the only case it could happen is when t.maxerror turned to be a negative number (e.g., -1) in this code.

      Negative values of the timex::maxerror field have never been seen when using ntpd or chronyd for clock synchronization, but it's necessary to update the code to adapt for such situations: apparently, PTP might set the maxerror field of the timex structure to a negative value and then call adjtimex(). That's obvious from the PTPd's code. The essence of the issue is using unsigned integers for clock error in the Kudu code, but timex.maxerror is a signed number, and at least PTPd sets it to a negative number when calling adjtimex(). Also, nowhere in the documentation for adjtimex() it's stated that the maxerror field's value should be a non-negative number.

      As a side note, there was a prior attempt to address this issue, but not enough evidence was presented for the RCA.

      Attachments

        Activity

          People

            aserbin Alexey Serbin
            aserbin Alexey Serbin
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: