Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-2485

Reduce the Resource Load on the Timeline Server

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      The disk, network, and memory resources needed by the timeline server are are many times higher than the need for the equivalent mapreduce job.

      Based on storage improvents YARN-3448, the timeline server may support up to 30,000 jobs / 10,000,000 tasks a
      day.

      While I understand there is community effort on timeline server v2, it
      will be good if Tez can reduce its pressure on the timeline server by
      auditing both the number of events and size of events.

      Here are some observations based on my understanding of the design of
      timeline stores:

      Each timeline entity pushed explodes into many records in the database
      1 marker record
      1 domain record
      1 record per event
      2 records per related entity
      2 records per primary filter (2 record per primary filter in
      RollingLevelDBTimelineStore, in leveldb it rewrites entire entity
      records per primary filter )
      1 record per other info

      For example

      Task Attempt Start
      1 marker
      1 domain
      1 task attempt start event
      1 related entity X 2
      7 other info entries
      4 primary filters X 2

      20 records written in the database for task attempt start

      Task Attempt Finish
      1 marker
      1 domain
      1 task attempt start event
      1 related entity X 2
      5 other info entries
      5 primary filters X 2

      20 records written in the database for task attempt finish

      =====================================================
      QUESTION:
      =====================================================

      Is there any data we are publishing to the timeline server that is not
      in the UI?

      Do we use all the entities (TEZ_CONTAINER_ID for example)
      Do we use all the primary filters?
      Do we use all the related entities specified?
      Are there any fields we don't use?
      Are there other approaches to consider to reduce entity count/size?
      Is there a way to store the same information in less space?

      ===================
      Key Value Breakdown

      Count Key Size Value Size
      5642512 533690380 745454867

      Entity Type Breakdown

      Type Count Key Size Value Size
      TEZ_CONTAINER_ID 843850 86244392 5654341
      applicationAttemptId 544 53248 6174
      applicationId 544 44412 6174
      TEZ_TASK_ATTEMPT_ID 2471393 239523553 373637209
      TEZ_APPLICATION 1048 84312 13057630
      containerId 362443 37013813 4135845
      TEZ_VERTEX_ID 99239 10387114 1559948
      TEZ_DAG_ID 5402 387705 2910830
      TEZ_TASK_ID 1762211 146210017 344478400
      TEZ_APPLICATION_ATTEMPT 95838 13741814 8316

      Column Breakdown

      Column Count Key Size Value Size
      primarykeys 1092413 118768299 0
      marker 373515 25740507 2988120
      events 578196 55148482 1156392
      domain 373515 26114022 15314115
      reverserelated 587815 73721347 0
      otherinfo 2143751 170983893 725996240
      related 493307 63213830 0

      Other Info Key Breakdown

      Key Count Key Size Value Size
      appSubmitTime 126 11466 1638
      vertexName 349 23732 3081
      stats 349 21987 142938
      applicationId 163 10106 5705
      exitStatus 84337 7337319 84559
      endTime 288538 22354866 3750994
      counters 204201 15474759 646685059
      startTime 204201 15678960 2654613
      nodeId 106761 8540880 3950157
      initTime 512 32325 6656
      numKilledTasks 512 35397 517
      timeTaken 204201 15678960 1061085
      inProgressLogsURL 106761 9715251 11741572
      config 126 8820 13037092
      scheduledTime 96928 7172672 1260064
      dagPlan 163 9128 2074899
      completedLogsURL 106761 9608490 22703699
      taskAttemptErrorEnum 15808 1485952 331784
      initRequestedTime 349 26175 4537
      startRequestedTime 349 26524 4537
      numFailedTasks 512 35397 512
      vertexNameIdMapping 163 11084 16157
      numSucceededTasks 512 36933 1054
      numKilledTaskAttempts 512 38981 521
      status 204201 15066357 2198349
      processorClassName 349 26524 18690
      numFailedTaskAttempts 512 38981 512
      tezVersion 126 9324 14364
      numTasks 349 23034 665
      successfulAttemptId 96785 7742800 4355325
      nodeHttpAddress 106761 9501729 3950157
      numCompletedTasks 512 36933 1056
      diagnostics 204201 16087362 915925
      containerId 106761 9074685 5017767

      Attachments

        1. ats-omit-dup-display-names-and-zero-counters_v2.patch
          4 kB
          Jason Darrell Lowe
        2. ats-omit-dup-display-names-and-zero-counters.patch
          4 kB
          Jason Darrell Lowe
        3. TEZ-2485.REMOVE_TEZ_CONTAINER_ID.1.patch
          1 kB
          Jonathan Turner Eagles
        4. TEZ-2485.SHORTER_ENTITIES.1.patch
          15 kB
          Jonathan Turner Eagles

        Activity

          People

            Unassigned Unassigned
            jeagles Jonathan Turner Eagles
            Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated: