Uploaded image for project: 'Sentry (Retired)'
  1. Sentry (Retired)
  2. SENTRY-1827

Minimize TPathsDump thrift message used in HDFS sync

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.8.0, 2.0.0
    • 1.8.0, 2.0.0
    • None
    • None

    Description

      We obtained a heap dump taken from the JVM running Hive Metastore at the time when Sentry HDFS sync operation was performed. I've analyzed this dump with jxray (www.jxray.com) and found that a significant percentage of memory is wasted due to duplicate strings:

      7. DUPLICATE STRINGS
      
      Total strings: 29,986,017  Unique strings: 9,640,413  Duplicate values: 4,897,743  Overhead: 2,570,746K (9.4%)
      

      Of them, more than 1/3 come from sentry:

        917,331K (3.3%), 10517636 dup strings (498477 unique), 10517636 dup backing arrays:
           <-- org.apache.sentry.hdfs.service.thrift.TPathEntry.pathElement <--  {j.u.HashMap}.values <-- org.apache.sentry.hdfs.service.thrift.TPathsDump.nodeMap <-- org.apache.sentry.hdfs.service.thrift.TPathsUpdate.pathsDump <-- Java Local@7fea0851c360 (org.apache.sentry.hdfs.service.thrift.TPathsUpdate)
      

      The duplicate strings in memory have been eliminated by SENTRY-1811. However, when these strings are serialized into the TPathsDump thrift message, they are duplicated again. That is, if there are 3 different TPathEntry objects with the same pathElement="foo", then (even if there is only one interned copy of the "foo" string in memory), a separate copy of "foo" will be written to the serialized message for each of these 3 TPathEntries. This is one reason why TPathsDump serialized messages may get very big, consume a lot of memory and take long time to send over the network.

      To address this problem we may use some form of custom compression, where we don't write multiple copies of duplicate strings, but rather substitute them with some shorter "string ids".

      Attachments

        1. SENTRY-1827.04-sentry-ha-redesign.patch
          40 kB
          Arjun Mishra
        2. SENTRY-1827.03-sentry-ha-redesign.patch
          41 kB
          Arjun Mishra
        3. SENTRY-1827.02-sentry-ha-redesign.patch
          40 kB
          Arjun Mishra
        4. SENTRY-1827.01-sentry-ha-redesign.patch
          39 kB
          Alex Kolbasov
        5. SENTRY-1827.04.patch
          38 kB
          Misha Dmitriev
        6. SENTRY-1827.03.patch
          34 kB
          Misha Dmitriev
        7. SENTRY-1827.02.patch
          29 kB
          Misha Dmitriev
        8. SENTRY-1827.01.patch
          27 kB
          Misha Dmitriev

        Issue Links

          Activity

            People

              misha@cloudera.com Misha Dmitriev
              misha@cloudera.com Misha Dmitriev
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: