Right now, for a command such "show databases", Sentry has to perform authorization checks on each database. When there are many databases, like 12000 databases in the system, the authorization checks of a single command in Sentry could be very slow. There are two main factors that slow down authorization checks in Sentry even when caching is enabled:
1) Cache returns the list of privileges in the form of String. As a result, every authorization check has to convert the privilege string to privilege object.
2) When cache is enabled, the cache returns all privileges of a given user regardless what resource to check.
2.1) for example, a user has 2000 privileges assigned and the resource to check is "server=server1, database=db_1, table=table_1". The cache returns all 2000 privileges including unrelated privileges such like "server=server1->database=db_2->action=ALL".
2.2) Returning unrelated privileges has two side effects:
2.2.1) Converting privileges from String to Object overhead is proportional to the number of returned privileges from cache. Converting unrelated privileges cost time, but no benefit.
2.2.2) Authorization check goes through each privilege, and its overhead is proportional to the number of returned privileges from cache. Converting unrelated privileges cost time, but no benefit.
1) Add a new function listPrivilegeObjects that lets authorization provider get privilege objects when checking the authorization. This avoids the conversion overhead. All the interfaces from policy engine (PolicyEngine) to the cache (PrivilegeCache) have to be changed to add this new function.
2) Implement a new cache TreePrivilegeCache. It converts the privilege from String format to Privilege object at beginning, and directly return the privilege objects in listPrivilegeObjects at authorization check. This avoids the overhead of conversion at each authorization check.
3) TreePrivilegeCache organizes the privileges based on the resource hierarchy, like a tree. Therefore, it can return only related privileges based on the resource to check. This reduces the authorization check overhead.
3.1) For example, a user has 2000 privileges assigned, and the resource to check is "server=server1, database=db_1, table=table_1". the cache TreePrivilegeCache returns only related privileges excluding unrelated privileges such like "server=server1->database=db_2->action=ALL".
SENTRY-1291 was to address the problem 2). However, it did not address the problem 1). And its implementation SimplePrivilegeCache is not memory efficient (the key of the map contains the whole resource hierarchy, and many keys share large portion of the same content), nor operational efficient (for each authorization check, SimplePrivilegeCache .listPrivileges() has to construct a large amount of keys in order to find all related privileges in a map).
4) Use TreePrivilegeCache instead of SimplePrivilegeCache for caching. Note, this solution is built on top of
SENTRY-1291, and utilizes the changes SENTRY-1291 made, such as providing resource hierarchy when getting privileges for authorization check.
Major Behavior Change
1) Create a new Interface FilteredPrivilegeCache, which extends from PrivilegeCache.
2) Move the function added by
SENTRY-1291 in PrivilegeCache to FilteredPrivilegeCache. Add additional functions in this solution to FilteredPrivilegeCache. In this way, there is no change in PrivilegeCache, and we are backward compatible with old implementation before SENTRY-1291.
3) Move all changed in SimplePrivilegeCache (implements PrivilegeCache) from
SENTRY-1291 to a new class SimpleFilteredPrivilegeCache, which implements FilteredPrivilegeCache.
4) Instead of hard-coding the privilege cache class, use configuration AuthzConfVars.AUTHZ_PRIVILEGE_CACHE ("sentry.hive.privilege.cache") to specify the privilege cache class name. The default value is "org.apache.sentry.provider.cache.TreePrivilegeCache". User can change to another cache implementation in sentry-site.xml at a service (such as hive server or HMS). The options are
4.1) org.apache.sentry.provider.cache.SimplePrivilegeCache (the original cache implementation before
4.2) org.apache.sentry.provider.cache.SimpleFilteredPrivilegeCache (the cache implemented in
4.3) org.apache.sentry.provider.cache.TreePrivilegeCache (the cache implemented in this Jira