Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-7926 long-lived daemons for query fragment execution, I/O and caching
  3. HIVE-10648

LLAP: registry; Tez attempted to schedule to daemon that didn't exist

    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Not A Problem
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: llap
    • Labels:
      None

      Description

      I can post logs externally; for now app IDs on test cluster are application_1429683757595_0784 and application_1429683757595_0783, I also have logs copied over.
      AM found the node (same logs for other nodes):

      2015-05-07 12:13:28,074 INFO [ServiceThread:org.apache.tez.dag.app.rm.TaskSchedulerEventHandler] impl.LlapYarnRegistryImpl: Adding new worker 342f4992-2608-43ab-a119-b50882e35f75 which mapped to DynamicServiceInstance [alive=true, host=cn059-10.l42scl.hortonworks.com:15001 with resources=<memory:20480, vCores:6>]
      ....
      2015-05-07 12:13:28,082 INFO [Dispatcher thread: Central] node.AMNodeTracker: Num cluster nodes = 19
      

      Trouble is, this node never actually existed... The cluster only had 15 nodes.
      As the job was progressing, AM repeatedly tried to schedule to this node and failed. There was no other LLAP cluster running at the same time.
      In fact, given that I always start a 15-node cluster I am not sure where 19-node data could conceivably come from...

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                gopalv Gopal Vijayaraghavan
                Reporter:
                sershe Sergey Shelukhin
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: