We obtained a heap dump taken from the JVM running Hive Metastore at the time when Sentry HDFS sync operation was performed. I've analyzed this dump with jxray (www.jxray.com) and found that a significant percentage of memory is wasted due to duplicate strings:
Of them, more than 1/3 come from sentry:
The duplicate strings in memory have been eliminated by
SENTRY-1811. However, when these strings are serialized into the TPathsDump thrift message, they are duplicated again. That is, if there are 3 different TPathEntry objects with the same pathElement="foo", then (even if there is only one interned copy of the "foo" string in memory), a separate copy of "foo" will be written to the serialized message for each of these 3 TPathEntries. This is one reason why TPathsDump serialized messages may get very big, consume a lot of memory and take long time to send over the network.
To address this problem we may use some form of custom compression, where we don't write multiple copies of duplicate strings, but rather substitute them with some shorter "string ids".