Uploaded image for project: 'Sentry'
  1. Sentry
  2. SENTRY-1892

Reduce memory consumption of HMSPath$Entry and TPathEntry

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0.0
    • Component/s: Hdfs Plugin
    • Labels:
      None

      Description

      We recently analyzed with jxray (www.jxray.com) some heap dumps from NameNode running in a big HDFS installation with Sentry enabled. One dump is particularly interesting, because it was taken when a full Sentry update was in progress. Because of it, used heap was at its maximum: there were both the old HMSPath$Entry tree of objects in memory, and the data for the new one in TPathEntry objects.

      The old and new Sentry-related data take a pretty large portion of the heap, 7.9% and 12.9% respectively:

       ---- Object tree for GC root(s) Java Local@7f9c9a0b7808 (org.apache.sentry.hdfs.SentryAuthorizationInfo) ----
      
        2,302,963K (7.9%) (1 of org.apache.sentry.hdfs.SentryAuthorizationInfo)
           <-- Java Local@7f9c9a0b7808 (org.apache.sentry.hdfs.SentryAuthorizationInfo)
      ....
      
       ---- Object tree for GC root(s) Java Local@7f9c2b9138c8 (org.apache.sentry.hdfs.service.thrift.TPathsDump) ----
      
        3,760,229K (12.9%) (1 of org.apache.sentry.hdfs.service.thrift.TPathsDump)
           <-- Java Local@7f9c2b9138c8 (org.apache.sentry.hdfs.service.thrift.TPathsDump)
      ...
      

      This is a very considerable portion of the heap. Furthermore, the second portion - the data in TPathsDump - is mostly temporary, and creates a big memory spike, many extra GC pauses, and in the worst case may cause a crash due to OOM. Thus it's very desirable to reduce memory used by these data structures.

      It appears that some of the data structures used here are suboptimal in terms of memory. Here is the list of things that can be fixed:

      1. TPathEntry.children and TPathEntry.authzObjs are both defined as sets in sentry_hdfs_service.thrift. In the Java code, they become HashSets. However, no real set operations (check for element, add element...) are used on them. Rather, they are used as simple lists, from which the respective data structures in HMSPaths$Entry are initialized. HashSets are very ineconomical in terms of memory, because they reuse HashMap code, and one HashMap$Entry object, taking 32-48 bytes, is created for each hash element. From the class histogram in the dump, HashSets are taking 5.8% of the heap. Thus if we replace sets with lists in TPathEntry, we can reduce heap substantially.

      2. JXRay analysis for suboptimal collections shows the following:

      9. BAD COLLECTIONS
      
      Total collections: 40,324,452  Bad collections: 26,076,002  Overhead: 3,361,873K (11.6%)
      
      Top bad collections:
          Ovhd           Problem           Num objs      Type
      -------------------------------------------------------
      922,908K (3.2%)     1-elem      5133339 (54%)     j.u.HashSet
      646,707K (2.2%)     1-elem      3941834 (98%)     j.u.TreeSet
      459,824K (1.6%)     1-elem      1731283 (10%)     j.u.HashMap
      339,906K (1.2%)      empty      3625374 (38%)     j.u.HashSet
      282,265K (1.0%)      empty      3985194 (25%)     j.u.HashMap
      276,279K (1.0%)     1-elem      3926377 (55%)     j.u.ArrayList
      163,534K (0.6%)      small        572788 (3%)     j.u.HashMap
      138,729K (0.5%)      small        574613 (6%)     j.u.HashSet
      116,041K (0.4%)      small      2472638 (35%)     j.u.ArrayList
      
      ===================================================
      
      10. REFERENCE CHAINS FOR BAD COLLECTIONS
      
      Expensive data fields:
      
        901,846K (3.1%): j.u.HashMap: 1727607 / 27% of 1-elem 458,895K (1.6%), 3984640 / 62% of empty 280,170K (1.0%), 570069 / 8% of small 162,780K (0.6%)
           <-- org.apache.sentry.hdfs.HMSPaths$Entry.children
        656,117K (2.3%): j.u.TreeSet: 3941248 / 98% of 1-elem 646,611K (2.2%)
           <-- org.apache.sentry.hdfs.HMSPaths$Entry.authzObjs
      ...
      

      That is, in the permanent Sentry data structures, 1-element HMSPaths$Entry.children tables and 1-element HMSPaths$Entry.authzObjs sets cause a noticeable overhead. We can optimize these data structures by replacing them with Objects and doing a trick like:

      // Before:
        private List<Foo> fooList = new ArrayList<>();
       
        void addFoo(Foo foo) {
          fooList.add(foo);
        }
      
      // After, with an optimization for 0- and 1-size
      
        private Object fooObjOrList;  // null initially
      
        void addFoo(Foo foo) {
          if (fooObjOrList == null) {
            fooObjOrList = foo;
          } else {
            if (fooObjOrList instanceof Foo) {
              List<Foo> fooList = new ArrayList<>();
              fooList.add((Foo) fooObjOrList);
              fooList.add(foo);
              fooObjOrList = fooList;
            } else {
              ((List) fooObjOrList).add(foo);
            }
         }
      }
      

        Attachments

        1. SENTRY-1892.02.patch
          27 kB
          Misha Dmitriev
        2. SENTRY-1892.01.patch
          25 kB
          Misha Dmitriev

          Activity

            People

            • Assignee:
              misha@cloudera.com Misha Dmitriev
              Reporter:
              misha@cloudera.com Misha Dmitriev
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: