Uploaded image for project: 'Apache HAWQ'
  1. Apache HAWQ
  2. HAWQ-1326

Cancel the query earlier if one of the segments for the query crashes

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.1.0.0-incubating
    • Component/s: None
    • Labels:
      None

      Description

      QD thread could hang in the loop of poll() since: 1) The alive segments could wait at the interconnect for the dead segment until interconnect timeout (by default 1 hour). 2) In the QD thread poll() will not sense the system-down until kernel tcp keepalive messaging is triggered, however the keepalive timeout is a bit long (2 hours by default on rhel6.x) and it could be configured via procfs only.

      A proper solution would be using the RM heartbeat mechanism:

      RM maintains a global ID lists (stable cross node adding or removing) for all nodes and keeps updating the health state via userspace heartbeat mechanism, thus we could maintain a bitmap in shared memory which keeps the latest node healthy info updated then we could use it in QD code, i.e. Cancel the query if finding the segment node, which handles part of the query, is down.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Paul Guo Paul Guo
                Reporter:
                Paul Guo Paul Guo
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: