Use Case: A user launches an application on a secured cluster that runs for some time and then fails within the AM (perhaps due to OOM in the AM), leaving no history in the job history server. The user doesn't notice that the job has failed until after the application has dropped off of the RM's application store. At this point, if no information was stored in the Generic Application History Service, a user must rely on a priviledged system administrator to access the AM logs for them.
It is desirable to activate the Generic Application History service within the timeline server so that users can access their application's information even after the RM has forgotten about their application. This app information should be kept in the GAHS for 1 week, as is done, for example, for logs in the job history server.
One way that the Generic AHS stores metadata about an application is in an Entity levelDB. This includes information about each container for each application. Based on my analysis, the levelDB size grows by at least 2500 bytes per container (uncompressed). This is a conservative estimate as the size could be much bigger based on the amount of diagnostic information associated with failed containers.
On very large and busy clusters, the amount needed on the timeline server's local disk would be between 0.6 TB and 1.0 TB (uncompressed). Even if we assume 90% compression, that's still between 60 GB and 100 GB that will be needed on the local disk. In addition to this, between 80 GB and 143 GB of metadata (uncopressed) will need to be cleaned up every day from the levelDB, which will delay other processing in the timeline server.
The proposal of this JIRA is to add a configuration property that enables/disables whether or not the GAHS stores container information in the levelDB. Whith this change, I estimate that the local disk usage would be about 5700 bytes per job, or about 10 GB (uncompressed) per week. Additionally, the daily cleanup load would only be about 1.5 GB per day.