Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
The disk, network, and memory resources needed by the timeline server are are many times higher than the need for the equivalent mapreduce job.
Based on storage improvents YARN-3448, the timeline server may support up to 30,000 jobs / 10,000,000 tasks a
day.
While I understand there is community effort on timeline server v2, it
will be good if Tez can reduce its pressure on the timeline server by
auditing both the number of events and size of events.
Here are some observations based on my understanding of the design of
timeline stores:
Each timeline entity pushed explodes into many records in the database
1 marker record
1 domain record
1 record per event
2 records per related entity
2 records per primary filter (2 record per primary filter in
RollingLevelDBTimelineStore, in leveldb it rewrites entire entity
records per primary filter )
1 record per other info
For example
Task Attempt Start
1 marker
1 domain
1 task attempt start event
1 related entity X 2
7 other info entries
4 primary filters X 2
20 records written in the database for task attempt start
Task Attempt Finish
1 marker
1 domain
1 task attempt start event
1 related entity X 2
5 other info entries
5 primary filters X 2
20 records written in the database for task attempt finish
=====================================================
QUESTION:
=====================================================
Is there any data we are publishing to the timeline server that is not
in the UI?
Do we use all the entities (TEZ_CONTAINER_ID for example)
Do we use all the primary filters?
Do we use all the related entities specified?
Are there any fields we don't use?
Are there other approaches to consider to reduce entity count/size?
Is there a way to store the same information in less space?
===================
Key Value Breakdown
Count | Key Size | Value Size |
---|---|---|
5642512 | 533690380 | 745454867 |
Entity Type Breakdown
Type | Count | Key Size | Value Size |
---|---|---|---|
TEZ_CONTAINER_ID | 843850 | 86244392 | 5654341 |
applicationAttemptId | 544 | 53248 | 6174 |
applicationId | 544 | 44412 | 6174 |
TEZ_TASK_ATTEMPT_ID | 2471393 | 239523553 | 373637209 |
TEZ_APPLICATION | 1048 | 84312 | 13057630 |
containerId | 362443 | 37013813 | 4135845 |
TEZ_VERTEX_ID | 99239 | 10387114 | 1559948 |
TEZ_DAG_ID | 5402 | 387705 | 2910830 |
TEZ_TASK_ID | 1762211 | 146210017 | 344478400 |
TEZ_APPLICATION_ATTEMPT | 95838 | 13741814 | 8316 |
Column Breakdown
Column | Count | Key Size | Value Size |
---|---|---|---|
primarykeys | 1092413 | 118768299 | 0 |
marker | 373515 | 25740507 | 2988120 |
events | 578196 | 55148482 | 1156392 |
domain | 373515 | 26114022 | 15314115 |
reverserelated | 587815 | 73721347 | 0 |
otherinfo | 2143751 | 170983893 | 725996240 |
related | 493307 | 63213830 | 0 |
Other Info Key Breakdown
Key | Count | Key Size | Value Size |
---|---|---|---|
appSubmitTime | 126 | 11466 | 1638 |
vertexName | 349 | 23732 | 3081 |
stats | 349 | 21987 | 142938 |
applicationId | 163 | 10106 | 5705 |
exitStatus | 84337 | 7337319 | 84559 |
endTime | 288538 | 22354866 | 3750994 |
counters | 204201 | 15474759 | 646685059 |
startTime | 204201 | 15678960 | 2654613 |
nodeId | 106761 | 8540880 | 3950157 |
initTime | 512 | 32325 | 6656 |
numKilledTasks | 512 | 35397 | 517 |
timeTaken | 204201 | 15678960 | 1061085 |
inProgressLogsURL | 106761 | 9715251 | 11741572 |
config | 126 | 8820 | 13037092 |
scheduledTime | 96928 | 7172672 | 1260064 |
dagPlan | 163 | 9128 | 2074899 |
completedLogsURL | 106761 | 9608490 | 22703699 |
taskAttemptErrorEnum | 15808 | 1485952 | 331784 |
initRequestedTime | 349 | 26175 | 4537 |
startRequestedTime | 349 | 26524 | 4537 |
numFailedTasks | 512 | 35397 | 512 |
vertexNameIdMapping | 163 | 11084 | 16157 |
numSucceededTasks | 512 | 36933 | 1054 |
numKilledTaskAttempts | 512 | 38981 | 521 |
status | 204201 | 15066357 | 2198349 |
processorClassName | 349 | 26524 | 18690 |
numFailedTaskAttempts | 512 | 38981 | 512 |
tezVersion | 126 | 9324 | 14364 |
numTasks | 349 | 23034 | 665 |
successfulAttemptId | 96785 | 7742800 | 4355325 |
nodeHttpAddress | 106761 | 9501729 | 3950157 |
numCompletedTasks | 512 | 36933 | 1056 |
diagnostics | 204201 | 16087362 | 915925 |
containerId | 106761 | 9074685 | 5017767 |