[HDFS-9184] Logging HDFS operation's caller context into audit logs - ASF JIRA

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.8.0, 3.0.0-alpha1
Component/s: None
Labels:
None

Target Version/s:

2.8.0
Release Note:
The feature needs to enabled by setting "hadoop.caller.context.enabled" to true. When the feature is used, additional fields are written into namenode audit log records.

Description

For a given HDFS operation (e.g. delete file), it's very helpful to track which upper level job issues it. The upper level callers may be specific Oozie tasks, MR jobs, and hive queries. One scenario is that the namenode (NN) is abused/spammed, the operator may want to know immediately which MR job should be blamed so that she can kill it. To this end, the caller context contains at least the application-dependent "tracking id".

There are several existing techniques that may be related to this problem.
1. Currently the HDFS audit log tracks the users of the the operation which is obviously not enough. It's common that the same user issues multiple jobs at the same time. Even for a single top level task, tracking back to a specific caller in a chain of operations of the whole workflow (e.g.Oozie -> Hive -> Yarn) is hard, if not impossible.
2. HDFS integrated htrace support for providing tracing information across multiple layers. The span is created in many places interconnected like a tree structure which relies on offline analysis across RPC boundary. For this use case, htrace has to be enabled at 100% sampling rate which introduces significant overhead. Moreover, passing additional information (via annotations) other than span id from root of the tree to leaf is a significant additional work.
3. In HDFS-4680 , there are some related discussion on this topic. The final patch implemented the tracking id as a part of delegation token. This protects the tracking information from being changed or impersonated. However, kerberos authenticated connections or insecure connections don't have tokens. HADOOP-8779 proposes to use tokens in all the scenarios, but that might mean changes to several upstream projects and is a major change in their security implementation.

We propose another approach to address this problem. We also treat HDFS audit log as a good place for after-the-fact root cause analysis. We propose to put the caller id (e.g. Hive query id) in threadlocals. Specially, on client side the threadlocal object is passed to NN as a part of RPC header (optional), while on sever side NN retrieves it from header and put it to Handler's threadlocals. Finally in FSNamesystem, HDFS audit logger will record the caller context for each operation. In this way, the existing code is not affected.

It is still challenging to keep "lying" client from abusing the caller context. Our proposal is to add a signature field to the caller context. The client choose to provide its signature along with the caller id. The operator may need to validate the signature at the time of offline analysis. The NN is not responsible for validating the signature online.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-9184.000.patch
01/Oct/15 01:18
21 kB
Mingliang Liu
HDFS-9184.001.patch
08/Oct/15 19:09
24 kB
Mingliang Liu
HDFS-9184.002.patch
10/Oct/15 01:09
23 kB
Mingliang Liu
HDFS-9184.003.patch
12/Oct/15 22:07
23 kB
Mingliang Liu
HDFS-9184.004.patch
12/Oct/15 23:31
23 kB
Mingliang Liu
HDFS-9184.005.patch
13/Oct/15 01:02
25 kB
Mingliang Liu
HDFS-9184.006.patch
15/Oct/15 01:00
24 kB
Mingliang Liu
HDFS-9184.007.patch
15/Oct/15 20:32
26 kB
Mingliang Liu
HDFS-9184.008.patch
17/Oct/15 19:14
29 kB
Mingliang Liu
HDFS-9184.009.patch
23/Oct/15 04:55
29 kB
Mingliang Liu

Issue Links

blocks

HIVE-12249 Improve logging with tez

Closed

TEZ-2910 Set caller context for tracing ( integrate with HDFS-9184 )

Closed

SPARK-15857 Add Caller Context in Spark

Resolved

YARN-4349 Support CallerContext in YARN

Resolved

PIG-4714 Improve logging across multiple components with callerId

Closed

breaks

HDFS-9362 TestAuditLogger#testAuditLoggerWithCallContext assumes Unix line endings, fails on Windows.

Resolved

HDFS-10793 Fix HdfsAuditLogger binary incompatibility introduced by HDFS-9184

Resolved

causes

HDFS-14685 DefaultAuditLogger doesn't print CallerContext

Resolved

is related to

HDFS-4680 Audit logging of delegation tokens for MR tracing

Closed

FLINK-16809 Support setting CallerContext on YARN deployments

Open

HADOOP-13527 Add Spark to CallerContext LimitedPrivate scope

Resolved

(2 breaks, 1 causes, 3 is related to)

Sub-Tasks

1.	Empty caller context considered invalid	Resolved	Mingliang Liu
2.	Document the caller context config keys	Resolved	Mingliang Liu
3.	TestAuditLogger#testAuditLoggerWithCallContext assumes Unix line endings, fails on Windows.	Resolved	Chris Nauroth
4.	Caller context should always be constructed by a builder	Resolved	Mingliang Liu

Logging HDFS operation's caller context into audit logs

Details

Description

Attachments

Attachments

Issue Links

Sub-Tasks

Activity

People

Dates