Uploaded image for project: 'Sentry'
  1. Sentry
  2. SENTRY-1827

Minimize TPathsDump thrift message used in HDFS sync

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.8.0, 2.0.0
    • Fix Version/s: 1.8.0, 2.0.0
    • Component/s: None
    • Labels:
      None

      Description

      We obtained a heap dump taken from the JVM running Hive Metastore at the time when Sentry HDFS sync operation was performed. I've analyzed this dump with jxray (www.jxray.com) and found that a significant percentage of memory is wasted due to duplicate strings:

      7. DUPLICATE STRINGS
      
      Total strings: 29,986,017  Unique strings: 9,640,413  Duplicate values: 4,897,743  Overhead: 2,570,746K (9.4%)
      

      Of them, more than 1/3 come from sentry:

        917,331K (3.3%), 10517636 dup strings (498477 unique), 10517636 dup backing arrays:
           <-- org.apache.sentry.hdfs.service.thrift.TPathEntry.pathElement <--  {j.u.HashMap}.values <-- org.apache.sentry.hdfs.service.thrift.TPathsDump.nodeMap <-- org.apache.sentry.hdfs.service.thrift.TPathsUpdate.pathsDump <-- Java Local@7fea0851c360 (org.apache.sentry.hdfs.service.thrift.TPathsUpdate)
      

      The duplicate strings in memory have been eliminated by SENTRY-1811. However, when these strings are serialized into the TPathsDump thrift message, they are duplicated again. That is, if there are 3 different TPathEntry objects with the same pathElement="foo", then (even if there is only one interned copy of the "foo" string in memory), a separate copy of "foo" will be written to the serialized message for each of these 3 TPathEntries. This is one reason why TPathsDump serialized messages may get very big, consume a lot of memory and take long time to send over the network.

      To address this problem we may use some form of custom compression, where we don't write multiple copies of duplicate strings, but rather substitute them with some shorter "string ids".

        Attachments

        1. SENTRY-1827.04-sentry-ha-redesign.patch
          40 kB
          Arjun Mishra
        2. SENTRY-1827.04.patch
          38 kB
          Misha Dmitriev
        3. SENTRY-1827.03-sentry-ha-redesign.patch
          41 kB
          Arjun Mishra
        4. SENTRY-1827.03.patch
          34 kB
          Misha Dmitriev
        5. SENTRY-1827.02-sentry-ha-redesign.patch
          40 kB
          Arjun Mishra
        6. SENTRY-1827.02.patch
          29 kB
          Misha Dmitriev
        7. SENTRY-1827.01-sentry-ha-redesign.patch
          39 kB
          Alexander Kolbasov
        8. SENTRY-1827.01.patch
          27 kB
          Misha Dmitriev

          Issue Links

            Activity

              People

              • Assignee:
                misha@cloudera.com Misha Dmitriev
                Reporter:
                misha@cloudera.com Misha Dmitriev
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: