Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-7926 long-lived daemons for query fragment execution, I/O and caching
  3. HIVE-10648

LLAP: registry; Tez attempted to schedule to daemon that didn't exist

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • None
    • None
    • llap
    • None

    Description

      I can post logs externally; for now app IDs on test cluster are application_1429683757595_0784 and application_1429683757595_0783, I also have logs copied over.
      AM found the node (same logs for other nodes):

      2015-05-07 12:13:28,074 INFO [ServiceThread:org.apache.tez.dag.app.rm.TaskSchedulerEventHandler] impl.LlapYarnRegistryImpl: Adding new worker 342f4992-2608-43ab-a119-b50882e35f75 which mapped to DynamicServiceInstance [alive=true, host=cn059-10.l42scl.hortonworks.com:15001 with resources=<memory:20480, vCores:6>]
      ....
      2015-05-07 12:13:28,082 INFO [Dispatcher thread: Central] node.AMNodeTracker: Num cluster nodes = 19
      

      Trouble is, this node never actually existed... The cluster only had 15 nodes.
      As the job was progressing, AM repeatedly tried to schedule to this node and failed. There was no other LLAP cluster running at the same time.
      In fact, given that I always start a 15-node cluster I am not sure where 19-node data could conceivably come from...

      Attachments

        Issue Links

          Activity

            People

              gopalv Gopal Vijayaraghavan
              sershe Sergey Shelukhin
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: