Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-22687

Query hangs indefinitely if LLAP daemon registers after the query is submitted

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 3.1.0
    • 4.0.0-alpha-1
    • llap
    • None

    Description

      If a query is submitted and no LLAP daemon is running, it waits for 1 minute and times out with error SERVICE_UNAVAILABLE.
      While waiting, if a new LLAP Daemon starts, then the timeout is cancelled, and the tasks do not get scheduled as well. As a result, the query hangs indefinitely.
      This is due to the race condition where LLAP Daemon first registers the LLAP instance at .../workers/worker-0000, and afterwards registers .../workers/slot-0000. In the gap between two, Tez AM gets notified of worker zk node and while processing it checks if slot zk node is present, if not it rejects the LLAP Daemon. Error in Tez AM is:

      [INFO] [LlapScheduler] |impl.LlapZookeeperRegistryImpl|: Unknown slot for 8ebfdc45-0382-4757-9416-52898885af90

      Attachments

        1. HIVE-22687.01.patch
          2 kB
          Himanshu Mishra
        2. HIVE-22687.02.patch
          2 kB
          Himanshu Mishra

        Activity

          People

            himanshum Himanshu Mishra
            himanshum Himanshu Mishra
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: