[SENTRY-1827] Minimize TPathsDump thrift message used in HDFS sync - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.8.0, 2.0.0
Fix Version/s: 1.8.0, 2.0.0
Component/s: None
Labels:
None

Description

We obtained a heap dump taken from the JVM running Hive Metastore at the time when Sentry HDFS sync operation was performed. I've analyzed this dump with jxray (www.jxray.com) and found that a significant percentage of memory is wasted due to duplicate strings:

7. DUPLICATE STRINGS

Total strings: 29,986,017  Unique strings: 9,640,413  Duplicate values: 4,897,743  Overhead: 2,570,746K (9.4%)

Of them, more than 1/3 come from sentry:

  917,331K (3.3%), 10517636 dup strings (498477 unique), 10517636 dup backing arrays:
     <-- org.apache.sentry.hdfs.service.thrift.TPathEntry.pathElement <--  {j.u.HashMap}.values <-- org.apache.sentry.hdfs.service.thrift.TPathsDump.nodeMap <-- org.apache.sentry.hdfs.service.thrift.TPathsUpdate.pathsDump <-- Java Local@7fea0851c360 (org.apache.sentry.hdfs.service.thrift.TPathsUpdate)

The duplicate strings in memory have been eliminated by ~~SENTRY-1811~~. However, when these strings are serialized into the TPathsDump thrift message, they are duplicated again. That is, if there are 3 different TPathEntry objects with the same pathElement="foo", then (even if there is only one interned copy of the "foo" string in memory), a separate copy of "foo" will be written to the serialized message for each of these 3 TPathEntries. This is one reason why TPathsDump serialized messages may get very big, consume a lot of memory and take long time to send over the network.

To address this problem we may use some form of custom compression, where we don't write multiple copies of duplicate strings, but rather substitute them with some shorter "string ids".

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SENTRY-1827.01.patch
30/Jun/17 18:26
27 kB
Misha Dmitriev
SENTRY-1827.01-sentry-ha-redesign.patch
10/Jul/17 21:19
39 kB
Alex Kolbasov
SENTRY-1827.02.patch
30/Jun/17 22:14
29 kB
Misha Dmitriev
SENTRY-1827.02-sentry-ha-redesign.patch
12/Jul/17 20:53
40 kB
Arjun Mishra
SENTRY-1827.03.patch
01/Jul/17 00:03
34 kB
Misha Dmitriev
SENTRY-1827.03-sentry-ha-redesign.patch
12/Jul/17 21:04
41 kB
Arjun Mishra
SENTRY-1827.04.patch
07/Jul/17 23:18
38 kB
Misha Dmitriev
SENTRY-1827.04-sentry-ha-redesign.patch
13/Jul/17 15:51
40 kB
Arjun Mishra

Issue Links

is related to

SENTRY-1915 Sentry is doing a lot of work to convert list of paths to HMSPaths structure

Resolved

relates to

SENTRY-1927 PathImageRetriever should minimize size of the serialized message when creating path dumps

Resolved

links to

Code review

code review link for sentry-ha-redesign branch

Activity

People

Assignee:: Misha Dmitriev

Reporter:: Misha Dmitriev

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 30/Jun/17 17:33

Updated:: 06/Sep/17 15:52

Resolved:: 20/Jul/17 17:45