Description
We obtained a heap dump taken from the JVM running Hive Metastore at the time when Sentry HDFS sync operation was performed. I've analyzed this dump with jxray (www.jxray.com) and found that a significant percentage of memory is wasted due to duplicate strings:
7. DUPLICATE STRINGS Total strings: 29,986,017 Unique strings: 9,640,413 Duplicate values: 4,897,743 Overhead: 2,570,746K (9.4%)
Of them, more than 1/3 come from sentry:
917,331K (3.3%), 10517636 dup strings (498477 unique), 10517636 dup backing arrays: <-- org.apache.sentry.hdfs.service.thrift.TPathEntry.pathElement <-- {j.u.HashMap}.values <-- org.apache.sentry.hdfs.service.thrift.TPathsDump.nodeMap <-- org.apache.sentry.hdfs.service.thrift.TPathsUpdate.pathsDump <-- Java Local@7fea0851c360 (org.apache.sentry.hdfs.service.thrift.TPathsUpdate)
The duplicate strings in memory have been eliminated by SENTRY-1811. However, when these strings are serialized into the TPathsDump thrift message, they are duplicated again. That is, if there are 3 different TPathEntry objects with the same pathElement="foo", then (even if there is only one interned copy of the "foo" string in memory), a separate copy of "foo" will be written to the serialized message for each of these 3 TPathEntries. This is one reason why TPathsDump serialized messages may get very big, consume a lot of memory and take long time to send over the network.
To address this problem we may use some form of custom compression, where we don't write multiple copies of duplicate strings, but rather substitute them with some shorter "string ids".
Attachments
Attachments
Issue Links
- is related to
-
SENTRY-1915 Sentry is doing a lot of work to convert list of paths to HMSPaths structure
- Resolved
- relates to
-
SENTRY-1927 PathImageRetriever should minimize size of the serialized message when creating path dumps
- Resolved
- links to