Just to summarize the design of the current system already implemented(but disabled) in YARN NodeManagers and the gaps we need to fill in.
- NM uploads the logs of all the containers of an App into a single file on HDFS named by node-id in a per-app directory. So for an app, there are a maximum of N log files, N being the number of nodes in the system.
- NM starts streaming a container's logs to the file once a container finishes.
- On app-finish, flushes all containers' logs and closes the per-app, per-node file.
- Removes the local container-logs on app-finish and once the aggregated file is closed.
- The log format is a T-File. Keys are container-ids. Values are a list of compound text of file-type(syslog/stdout/stderr) and the actual container log-file contents.
- TODO: As of today, NM silently ignores any failures during the log upload. It can increment a counter for this failures or maintain a list per app of the containers for which it failed to upload the log.
Coverage of logs: In most cases, we don't need to upload the logs of all the containers.
- Options include
- only AM logs will be uploaded onto the HDFS for any app.
- only AM logs + only failed containers' logs
- AM logs + failed containers' logs + x% of successful containers
- All logs
- The above retention policy is already implemented by LogAggregationService, but this needs to be user-configurable: TODO.
- NM serves the log files of a container till the App finishes. NM doesn't have any indices, all it does is it prints the logs treating them as files, one after another, possibly with headers for each log-type.
- TODO: After the upload finishes, NM will point the user to a configured log-server location. This is mostly the same as JobHistory server.
- TODO: For MapReduce: After the App finishes, when users visit their job-history, there will be servlets which parse the aggregated file and present per container.
command line user-interface
- A dumper already included for the clients.
- Command line is like so, for all container-logs of a single app
./yarn/bin/yarn logs -applicationId application_1304487270789_0001
- Command line is like so, for a single container-logs
./yarn/bin/yarn logs -applicationId application_1304487270789_0001 -containerId container_1304487270789_0001_000002 -nodeAddress 127.0.0.1_45454
- TODO: We need mapreduce specific comand line that takes in a TaskAttemptID and returns logs.
Life on HDFS
- The log file per-app per-node goes into a system log-dir and is written with user's credentials.
- The log-dir is per-user and has quotas specified by admins. The quota are, for now, same for all users, a reasonable value.
- Tooling like dfs -cat or dfs -text for letting the users print out their logs depending on the log format above.
- Admins can have scripts to garbage-collect/HAR logs in the user-dir that have aged beyond a certain time period (e.g. 15days)
- TODO: What is the behaviour when user-quotas are hit. Fail aggregation and skip container-logs? How does the user come to know?