Uploaded image for project: 'Sentry'
  1. Sentry
  2. SENTRY-872

Uber jira for HMS HA + Sentry HA redesign

    Details

    • Type: Improvement
    • Status: In Progress
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: 1.5.0, 2.0.0
    • Fix Version/s: 2.0.0
    • Component/s: Hdfs Plugin
    • Labels:
      None
    1. Sentry-872_design_v2_1.pdf
      657 kB
      Alexander Kolbasov
    2. SENTRY-872_design_v2.pdf
      178 kB
      Colin P. McCabe
    3. SENTRY-872_design.pdf
      170 kB
      Colin P. McCabe
    4. SENTRY-872_dessign-v2.1.1.pdf
      448 kB
      Alexander Kolbasov
    5. SENTRY-872.0.patch
      118 kB
      Sravya Tirukkovalur
    6. SENTRY-872.pdf
      168 kB
      Sravya Tirukkovalur

      Issue Links

      1.
      Add metrics for isActive and isHA Sub-task Resolved Rahul Sharma  
       
      2.
      Implement Sentry leadership election Sub-task Resolved Colin P. McCabe  
       
      3.
      Store HMSPaths in Sentry DB to allow fast failover Sub-task Resolved Hao Hao  
       
      4.
      Upgrading SQL script for HMSPaths persistence Sub-task Resolved Hao Hao  
       
      5.
      Implement fencing required for active/standby Sub-task Resolved Colin P. McCabe  
       
      6.
      Add sentry specific test cases to use NotificationLog Sub-task Resolved Sravya Tirukkovalur  
       
      7.
      Adapt SentryMetaStorePostEventListener to write HMS notification logs Sub-task Resolved Sravya Tirukkovalur  
       
      8.
      Fix TestLeaderStatus#testRacingClients Sub-task Resolved Colin P. McCabe  
       
      9.
      Move SentryHDFSServiceClient code from hdfs-common into hdfs-service Sub-task Resolved Hao Hao  
       
      10.
      Integrate Fencer with SentryStore Sub-task Resolved Colin P. McCabe  
       
      11.
      The sentry client should retry RPCs if it gets a SentryStandbyException (SentryPolicyServiceClient - pool version) Sub-task Resolved Hao Hao  
       
      12.
      Move the HDFS code which lives inside the sentry daemon into sentry-provider Sub-task Resolved Hao Hao  
       
      13.
      Implement HMSFollower in Sentry service which reads the NotificationLog entries Sub-task Resolved Sravya Tirukkovalur  
       
      14.
      Handle updating HMSState for HDFS plugin in HMSFollower Sub-task Resolved Unassigned  
       
      15.
      Rework fetching Hive Paths state Sub-task Resolved Hao Hao  
       
      16.
      Add metrics for Sentry HA Sub-task Resolved Alexander Kolbasov  
       
      17.
      Make HiveConf and Hive client jars available to Sentry deamon Sub-task Resolved Vamsee Yarlagadda  
       
      18.
      Notify Sentry about HMS new notifications if low delay is desired Sub-task Resolved Unassigned  
       
      19.
      Changes to get the Fencer working with Oracle and MySQL Sub-task Resolved Rahul Sharma  
       
      20.
      Evict datanucleus second-level cache during activation Sub-task Resolved Alexander Kolbasov  
       
      21.
      [Test infra] Ability to start active and standby sentry service using InternalSentrySrv Sub-task Resolved Unassigned  
       
      22.
      [Test hook] Provide a hook to stop the active sentry sevice Sub-task Resolved Rahul Sharma  
       
      23.
      Support intializing state in Sentry service Sub-task Resolved Unassigned  
       
      24.
      JDO deadlocks while processing grant while a background thread processes Notificationlogs Sub-task Resolved Colin Ma  
       
      25.
      May want to disallow reads on Sentry passive Sub-task Resolved Alexander Kolbasov  
       
      26.
      Do not start up HMSFollower if hive is not using Sentry Sub-task Resolved Unassigned  
       
      27.
      Test TGT renewals in HMSFollower Sub-task Resolved Vamsee Yarlagadda  
       
      28.
      Only leader should follow HMS updates Sub-task Resolved Alexander Kolbasov  
       
      29.
      Test Sentry HA Tasks Sub-task Resolved Vamsee Yarlagadda  
       
      30.
      GenericServiceClient should support connection pools Sub-task Resolved Unassigned  
       
      31.
      Sentry should not serve requests until the full update processing is finished Sub-task Resolved Unassigned  
       
      32.
      Error during fencing table rename can disable master Sub-task Resolved Alexander Kolbasov  
       
      33.
      Fencing implementation in sentry-ha can create two fencing tables Sub-task Resolved Alexander Kolbasov  
       
      34.
      Store processed notification sequence ID in database Sub-task Resolved Hao Hao  
       
      35.
      Rebase sentry-ha-redesign branch on master Sub-task Resolved Unassigned  
       
      36.
      Ensure HMS point-in-time snapshot consistency Sub-task Resolved Hao Hao  
       
      37.
      Sentry clients should retry with another server when they get connection errors Sub-task Resolved Li Li  
       
      38.
      Disable fencing in Sentry store for Active/Active Sub-task Resolved Li Li  
       
      39.
      HMS plugin should wait until Sentry handles the update before continuing. Sub-task Resolved Unassigned  
       
      40.
      Renaming SQL script for HMSPaths persistence Sub-task Resolved kalyan kumar kalvagadda  
       
      41.
      Add feature flag for using NotifcationLog Sub-task Resolved Hao Hao  
       
      42.
      Add option to use non pool model for sentry client Sub-task Resolved Li Li  
       
      43.
      HDFS Sync change for handling persisted Sentry delta or full updates Sub-task Resolved Hao Hao  
       
      44.
      HMS Follower thread should terminate when Sentry receives ^C Sub-task Resolved Alexander Kolbasov  
       
      45.
      HMS Follower should update HDFS plugin paths Sub-task Resolved Hao Hao  
       
      46.
      Refactor SentryStore transaction management to allow for extra TransanctionBlocks for a single permission update Sub-task Resolved Hao Hao  
       
      47.
      Create schema for storing HMS path change and Sentry permission change. Sub-task Resolved Hao Hao  
       
      48.
      HMS Follower should store arriving HMS notifications Sub-task Resolved Hao Hao  
       
      49.
      Remove fencing support Sub-task Resolved Alexander Kolbasov  
       
      50.
      Add feature flag to allow stand-alone configuration without ZK Sub-task Resolved Unassigned  
       
      51.
      Make full Perm/Path snapshot available for NN plugin Sub-task Resolved Hao Hao  
       
      52.
      Refactor propagating logic for Perm/Path delta to NN plugin Sub-task Resolved Lei (Eddy) Xu  
       
      53.
      Upgrading SQL scripts for persist Perm/Path change Sub-task Resolved kalyan kumar kalvagadda  
       
      54.
      Suport secure ZK configuration for leader election Sub-task Resolved Alexander Kolbasov  
       
      55.
      Provide pooled client connection model with HA Sub-task Resolved Alexander Kolbasov  
       
      56.
      Refactor ZK/Curator code Sub-task Resolved Alexander Kolbasov  
       
      57.
      Refactor SentryStore transaction to persist a single path transcation bundled with corresponding delta path change Sub-task Resolved Hao Hao  
       
      58.
      Backport SENTRY-1404 to Sentry-ha-redesign branch Sub-task Resolved Alexander Kolbasov  
       
      59.
      Implement NN client failover for Sentry HA Sub-task Resolved kalyan kumar kalvagadda  
       
      60.
      Implement client failover for Generic and NN clients Sub-task Resolved kalyan kumar kalvagadda  
       
      61.
      Hive tests failing for sentry-ha-redesign branch Sub-task Resolved Hao Hao  
       
      62.
      Define Thrift API for HMS to Sentry notification barrier Sub-task Resolved Alexander Kolbasov  
       
      63.
      Implement HMS Notification barrier on the server side Sub-task Resolved Alexander Kolbasov  
       
      64.
      Converting Sentry to a stateless service Sub-task Resolved Hao Hao  
       
      65.
      Purge MSentryPerm/PathChange tables Sub-task Resolved Lei (Eddy) Xu  
       
      66.
      HMSFollower should persist full HMS snapshot into SentryDB if there is not one. Sub-task Resolved Hao Hao  
       
      67.
      Add propagating logic for Perm/Path updates in Sentry service Sub-task Resolved Hao Hao  
       
      68.
      Fetch Hive Paths point-in-time full snapshot during Sentry startup Sub-task Resolved Hao Hao  
       
      69.
      Fix the secure HMS connection code in HMSFollower Sub-task Resolved Vamsee Yarlagadda  
       
      70.
      Incorrect usage of AuthzConfVars.AUTHZ_SERVER_NAME may cause HS2 HA not work Sub-task Resolved Unassigned  
       
      71.
      HMSFollower to retry connecting to HMS upon connection loss Sub-task Resolved Vamsee Yarlagadda  
       
      72.
      Typo for notification log feature flag Sub-task Resolved Hao Hao  
       
      73.
      In HMSFollower failing of catching error causes the executor to halt Sub-task Resolved Hao Hao  
       
      74.
      Current MAuthzPathsMapping table definition may cause error 'Duplicate entry XX for key PRIMARY' Sub-task Resolved kalyan kumar kalvagadda  
       
      75.
      out of sequence error in HMSFollower Sub-task Resolved Alexander Kolbasov  
       
      76.
      Make HMSFollower initialDelay and run period configurable Sub-task Resolved Hao Hao  
       
      77.
      HMSFollower should not check isLoadMetastoreConfig when trying to connect to HMS Sub-task Resolved Vamsee Yarlagadda  
       
      78.
      Limit HMS connections only to the leader of the sentry servers Sub-task Resolved Vamsee Yarlagadda  
       
      79.
      Update SQL script of MSentryPathChange table to add a column for notification ID Sub-task Resolved kalyan kumar kalvagadda  
       
      80.
      Periodically purge Delta change tables. Sub-task Resolved Lei (Eddy) Xu  
       
      81.
      Cleanup creation of SentryStore and HMSFollower Sub-task Resolved Lei (Eddy) Xu  
       
      82.
      AutoIncrement ChangeID of MSentryPermChange/MSentryPathChange may be error-prone Sub-task Resolved Lei (Eddy) Xu  
       
      83.
      Initialize HMSFollower when sentry server actually starts Sub-task Resolved Na Li  
       
      84.
      Port SENTRY-1360 to sentry-ha-redesign Sub-task Resolved Alexander Kolbasov  
       
      85.
      TestHDFSIntegrationAdvanced timeouts on sentry-ha-redesign branch Sub-task Resolved Unassigned  
       
      86.
      HMSFollower should read current processed notification ID from database every time it runs Sub-task Resolved kalyan kumar kalvagadda  
       
      87.
      Expose current HMS notification ID as a Sentry gauge (metric) Sub-task Resolved Alexander Kolbasov  
       
      88.
      Provide HMSFollower healthcheck (metric) Sub-task Resolved Unassigned  
       
      89.
      Expose HMS data via Sentry web UI Sub-task Resolved Unassigned  
       
      90.
      Improve error reporting from FullUpdateInitializer Sub-task Resolved Alexander Kolbasov  
       
      91.
      HMSFollower shouldn't print the same value of notification ID multiple times Sub-task Resolved Na Li  
       
      92.
      FullUpdateInitializer#createInitialUpdate should not throw RuntimeException Sub-task Resolved Alexander Kolbasov  
       
      93.
      sentry-hdfs-dist should include sentry-core-common after refactor SentryHDFSServiceClientDefaultImpl Sub-task Resolved kalyan kumar kalvagadda  
       
      94.
      Add metrics to measure how much time to get Delta Path/Perm Updates Sub-task Resolved Alexander Kolbasov  
       
      95.
      MetastoreCacheInitializer is lo longer used and should be removed Sub-task Resolved Jan Hentschel  
       
      96.
      Investigate use of EXPORT for replication for initial HMS snapshot Sub-task Resolved Sergio Peña  
       
      97.
      FullUpdateInitializer has a race condition in handling results list Sub-task Resolved Alexander Kolbasov  
       
      98.
      Port SENTRY-1489 to sentry-ha-redesign branch Sub-task Resolved Na Li  
       
      99.
      Port SENTRY-1548 to sentry-ha-redesign branch Sub-task Resolved kalyan kumar kalvagadda  
       
      100.
      FullUpdateInitializer can be more efficient Sub-task Resolved Alexander Kolbasov  
       
      101.
      Refactor HA components based on Sentry-852 Sub-task Resolved Unassigned  
       
      102.
      sql changed needed for AUTHZ_PATH table Sub-task Resolved kalyan kumar kalvagadda  
       
      103.
      Move thrift waiters gauge away from SentryStore Sub-task Resolved Unassigned  
       
      104.
      HMSFollower should handle adding a view with empty path. Sub-task Resolved Na Li  
       
      105.
      Deprecate feature flag for enabling notification log Sub-task Resolved Alexander Kolbasov  
       
      106.
      Waiting for HMS notifications from Thrift should be interruptible Sub-task Resolved Alexander Kolbasov  
       
      107.
      Expose time spent creating the initial snapshot as a metric Sub-task Resolved Alexander Kolbasov  
       
      108.
      FullUpdateInitializer should not use preconditions to verify HMS data Sub-task Resolved Alexander Kolbasov  
       
      109.
      Do not start HMSFollower if Hive isn't configured Sub-task Resolved Na Li  
       
      110.
      Avoid randomizing the servers at client side based on configuration. Sub-task Resolved kalyan kumar kalvagadda  
       
      111.
      HMSFollower shouldn't call processNotificationEvents() unless there are events Sub-task Resolved Alexander Kolbasov  
       
      112.
      Enable TestHDFSIntegrationEnd2End.testEnd2End Sub-task Resolved Lei (Eddy) Xu  
       
      113.
      Disable HMSFollower when HMS integration is not enabled Sub-task Resolved Unassigned  
       
      114.
      HMSFollower doesn't need to save path info when HDFS sync is disabled Sub-task Resolved Sergio Peña  
       
      115.
      Sentry should emit log messages when it is ready to serve requests. Sub-task Resolved Na Li  
       
      116.
      TestSentryStore often fails in setup() Sub-task Resolved Na Li  
       
      117.
      Implement alternative HMS/Sentry synchronization Sub-task Resolved Unassigned  
       
      118.
      Improve retry handling for FullUpdateInitializer Sub-task Resolved Unassigned  
       
      119.
      Sentry HDFS Sync should survive in presence of bad paths objects Sub-task Resolved Alexander Kolbasov  
       
      120.
      Sentry HA Test: programmatic failover in a mini cluster env; also add some test data. Sub-task Resolved kalyan kumar kalvagadda  
       
      121.
      Unit test failures in TestSentryStore due to changeId miscount Sub-task Resolved Na Li

      0%

      Original Estimate - 96h
      Remaining Estimate - 96h
       
      122.
      Create HMSFollower when SentryService.Start() is called Sub-task Resolved Na Li

      0%

      Original Estimate - 4h
      Remaining Estimate - 4h
       
      123.
      HDFS e2e tests should wait for HMSFollower to start Sub-task Resolved Na Li  
       
      124.
      Remove old PoolClientInvocationHandler Sub-task Resolved Jan Hentschel  
       
      125.
      Include response status in TSentrySyncIDResponse Sub-task Resolved Alexander Kolbasov  
       
      126.
      PathsUpdate.parsePath() calls FileSystem.getDefaultUri() way too often Sub-task Resolved Alexander Kolbasov  
       
      127.
      Test secure ZK connections Sub-task Resolved Unassigned  
       
      128.
      Test concurrent roles/groups/privs operations Sub-task Resolved Unassigned  
       
      129.
      Create/Alter/Drop database/table should check corresponding property before drop privileges Sub-task Resolved Alexander Kolbasov  
       
      130.
      Generic service client should support Kerberos Sub-task Resolved kalyan kumar kalvagadda  
       
      131.
      SentryTransportFactory may use incorrect kerberos principal Sub-task Resolved Alexander Kolbasov  
       
      132.
      Inefficient connection management by retrying invocation handler Sub-task Resolved Alexander Kolbasov  
       
      133.
      HMSFollower doesn't handle INSERT operation Sub-task Resolved Sergio Peña  
       
      134.
      HMSFollower shouldn't create local hive during tests Sub-task Resolved Na Li  
       
      135.
      HMSFollower gets stuck once it fails to process a notification event Sub-task Resolved Na Li  
       
      136.
      Add HMSFollower per-operation metrics Sub-task Resolved Alexander Kolbasov  
       
      137.
      sql changes to store last notification-id processed Sub-task Resolved kalyan kumar kalvagadda  
       
      138.
      Passive nodes should still follow latest notification ID Sub-task Resolved Sergio Peña  
       
      139.
      Avoid using local hive meta store with wrong configuration Sub-task Resolved Na Li  
       
      140.
      Improve Sentry memory usage by interning object names Sub-task Resolved Alexander Kolbasov  
       
      141.
      add sentry ha e2e test back accommodating to the re-design Sub-task Resolved Unassigned  
       
      142.
      HMSFollower should detect when a full snapshot from HMS is required Sub-task Resolved Sergio Peña  
       
      143.
      Add log message for key store file path Sub-task Resolved Na Li  
       
      144.
      notification id's in SENTRY_HMS_NOTIFICATION_ID should be purged periodically Sub-task Resolved kalyan kumar kalvagadda  
       
      145.
      HMSFollower should check for leader status after each event processed Sub-task Resolved kalyan kumar kalvagadda  
       
      146.
      Fix the config string for server load balancing Sub-task Resolved kalyan kumar kalvagadda  
       
      147.
      CounterWait.update should be less strict Sub-task Resolved Alexander Kolbasov  
       
      148.
      HMSFollower should not persist empty full snapshot Sub-task Resolved kalyan kumar kalvagadda  
       
      149.
      Generic model clients using kerberos can no longer connect to Sentry server Sub-task Resolved kalyan kumar kalvagadda  
       
      150.
      Multiple followers should not create full snapshot Sub-task Resolved Na Li  
       
      151.
      Avoid detaching object on transaction exit when it isn't needed Sub-task Resolved Alexander Kolbasov  
       
      152.
      Refactor HMSFollower Class Sub-task Resolved kalyan kumar kalvagadda  
       
      153.
      Avoid more detaches on commit Sub-task Resolved Alexander Kolbasov  
       
      154.
      HDFS client concurrently requests full permission update multiple times Sub-task Resolved Alexander Kolbasov  
       
      155.
      Permissions created before table creation are not reflected in HDFS ACLs Sub-task Resolved Alexander Kolbasov  
       
      156.
      Delta change cleaner should leave way more then a single entry intact Sub-task Resolved Alexander Kolbasov  
       
      157.
      HMSFollower should always depend on persisted information to decide is full snapshot is needed Sub-task Resolved kalyan kumar kalvagadda  
       
      158.
      SentryStore should clear SENTRY_HMS_NOTIFICATION_ID while clearing store Sub-task Resolved kalyan kumar kalvagadda  
       
      159.
      Generic clients are not able to connect to sentry server with kerberos enabled. Sub-task Resolved kalyan kumar kalvagadda  
       
      160.
      FullUpdateInitializer does not kill the threads whenever getFullHMSSnapshot throws an exception Sub-task Resolved Alexander Kolbasov  
       
      161.
      Persist new HMS snapshots with a new generation ID. Sub-task Resolved Sergio Peña  
       
      162.
      Add an HMS image ID to the thrift schema definition for hdfs/sentry requests Sub-task Resolved Sergio Peña  
       
      163.
      DBUpdateForwarder returns empty update list to HDFS instead of full update Sub-task Resolved Sergio Peña  
       
      164.
      Sentry Clients should not log every connection request Sub-task Resolved Alexander Kolbasov  
       
      165.
      Sentry Clients failover not working with kerberos enabled Sub-task Resolved kalyan kumar kalvagadda  
       
      166.
      HMSFollower does not handle view update correctly Sub-task Resolved Na Li  
       
      167.
      Ensure DB to sort delta changes by CHANGE_ID Sub-task Resolved Vamsee Yarlagadda  
       
      168.
      HMSFollower not persisting last processed notifications when partition is altered Sub-task Resolved kalyan kumar kalvagadda  
       
      169.
      Reenable ignored unit tests from TestHDFSIntegrationEnd2End Sub-task Resolved Vamsee Yarlagadda  
       
      170.
      Delta tables should not have holes Sub-task Resolved Lei (Eddy) Xu

      0%

      Original Estimate - 168h
      Remaining Estimate - 168h
       
      171.
      Add better debug logging for retrieving the delta changes Sub-task Resolved Vamsee Yarlagadda  
       
      172.
      Provide names for HMSFollower and cleaner threads Sub-task Resolved Alexander Kolbasov  
       
      173.
      Fix flaky HDFS END2END tests Sub-task Resolved kalyan kumar kalvagadda  
       
      174.
      Flaky testConcurrentUpdateChanges test Sub-task Resolved Alexander Kolbasov  
       
      175.
      HMSFollower should handle the case of multiple notifications with the same ID Sub-task Resolved Sergio Peña  
       
      176.
      Sentry server can be more efficient in handling full snapshot from HMS Sub-task Resolved Alexander Kolbasov  
       
      177.
      Define a DB schema for HMS generation IDs Sub-task Resolved Sergio Peña  
       
      178.
      Improve memory handling for HDFS sync Sub-task Resolved Alexander Kolbasov  
       
      179.
      NotificationProcessor may put the wrong path in the update Sub-task Resolved Alexander Kolbasov

      0%

      Original Estimate - 72h
      Remaining Estimate - 72h
       
      180.
      Provide unit test for LeaderStatusMonitor Sub-task Resolved Alexander Kolbasov  
       
      181.
      Send new HMS snapshots to HDFS requesting an old generation ID Sub-task Resolved Sergio Peña  
       
      182.
      Deprecate SENTRY_HA_ENABLED and all tests that use it Sub-task Resolved Na Li

      0%

      Original Estimate - 72h
      Remaining Estimate - 72h
       
      183.
      HMSFollower should be a singleton Sub-task Resolved Alexander Kolbasov

      0%

      Original Estimate - 24h
      Remaining Estimate - 24h
       
      184.
      Transactions could fail to commit to the database under load Sub-task Resolved Alexander Kolbasov  
       
      185.
      SentryStore may serialize transactions that rely on unique key Sub-task Resolved Na Li

      0%

      Original Estimate - 72h
      Remaining Estimate - 72h
       
      186.
      Dropping a Hive database/table doesn't cleanup the permissions associated with it Sub-task Resolved Na Li  
       
      187.
      Rename version in sentry-ha-redesign branch to 2.0.0-SNAPSHOT Sub-task Resolved kalyan kumar kalvagadda  
       
      188.
      Sentry e2e tests should enable SentrySyncHMSNotificationsPostEventListener Sub-task Resolved Na Li  
       
      189.
      Expose current set of IDs as Sentry metrics Sub-task Resolved Alexander Kolbasov  
       
      190.
      Fix build failures when hive-authz2 profile is enabled. Sub-task Resolved kalyan kumar kalvagadda  
       
      191.
      Separate legacy sentry configs from sentry ha configs for api compatibility Sub-task Resolved Na Li  
       
      192.
      HMSFollower should handle notifications even if HDFS sync is disabled. Sub-task Closed Na Li  
       
      193.
      Revert HMSFollower refactoring change Sub-task Resolved kalyan kumar kalvagadda  
       
      194.
      Persisting HMS snapshot and the notification-id to database in same transaction Sub-task Resolved Na Li  
       
      195.
      Try to use pool with idle connections first Sub-task Resolved Alexander Kolbasov  
       
      196.
      Sentry e2e tests are trying to test without notification log Sub-task Resolved Na Li  
       
      197.
      Sentry should handle the case of multiple notifications with the same ID Sub-task Resolved Sergio Peña  
       
      198. Support User level privileges for Sentry HA Sub-task Open Na Li  
       

        Activity

        Hide
        sravya Sravya Tirukkovalur added a comment -

        Attaching a design doc

        Show
        sravya Sravya Tirukkovalur added a comment - Attaching a design doc
        Hide
        sravya Sravya Tirukkovalur added a comment -

        Moving all unresolved jiras with fix version 1.7.0 to 1.8.0. Please change the fix version if you intend to make it into 1.7.0 release.

        Show
        sravya Sravya Tirukkovalur added a comment - Moving all unresolved jiras with fix version 1.7.0 to 1.8.0. Please change the fix version if you intend to make it into 1.7.0 release.
        Hide
        cmccabe Colin P. McCabe added a comment -

        I uploaded a new design doc, take a look! This reflects our thinking about an active/standby high availability design.

        Show
        cmccabe Colin P. McCabe added a comment - I uploaded a new design doc, take a look! This reflects our thinking about an active/standby high availability design.
        Hide
        sravya Sravya Tirukkovalur added a comment - - edited

        Thanks for uploading the updated deisgn doc Colin P. McCabe! Some comments:
        1. In Section "HIVE­7973: Hive Replication Support ", seems like there is some text missing at the end.
        2. In Section "Future work", "The HDFS Plugin Should Use Update Log IDs". In current design, we apply deltas in the NN plugin. I do not believe we necessarily buffer deltas in NN, as there is no reason. So we may want to remove this section.
        3. We might want to add a section about "Sentry passive with hot cache" which follows active versus "Sentry passive with cold cache" which warms up only when it acquires leadership? I think we are inclining towards former which can serve requests with minimal downtime, that is acquiring leadership should not take too long. But might be better if we state it explicitly, so that we evaluate the alternatives thoroughly?
        4. There are some slight alternatives we might want to consider in the path of propagating HMS updates to Sentry and NN. In the proposed design, we will need to replicate HMS <obj,path> information as well as delta changes of it(add/delete <ob,path>) in Sentry db for the passive to follow. Other option is for passive to directly talk to HMS to get these deltas. If the only motivation for replicating this in sentry db is bringing passive upto speed, I think the later approach is preferable as there is no real need to replicate both info and deltas? But, other parameter to consider is around full update. That is, when Sentry restarts in the later approach, we will have to trigger a full update from HMS. But without a proper snapshot solution in HMS, this would mean we will have to lock HMS writes for this period, which means HMS is not available for writes for this period.
        5. Would be useful to have a detailed protocol description especially around what happens when different services restart, and what in memory state does each service rely on.

        Let me know what you think and we can update the doc accordingly. Thanks!

        Show
        sravya Sravya Tirukkovalur added a comment - - edited Thanks for uploading the updated deisgn doc Colin P. McCabe ! Some comments: 1. In Section "HIVE­7973: Hive Replication Support ", seems like there is some text missing at the end. 2. In Section "Future work", "The HDFS Plugin Should Use Update Log IDs". In current design, we apply deltas in the NN plugin. I do not believe we necessarily buffer deltas in NN, as there is no reason. So we may want to remove this section. 3. We might want to add a section about "Sentry passive with hot cache" which follows active versus "Sentry passive with cold cache" which warms up only when it acquires leadership? I think we are inclining towards former which can serve requests with minimal downtime, that is acquiring leadership should not take too long. But might be better if we state it explicitly, so that we evaluate the alternatives thoroughly? 4. There are some slight alternatives we might want to consider in the path of propagating HMS updates to Sentry and NN. In the proposed design, we will need to replicate HMS <obj,path> information as well as delta changes of it(add/delete <ob,path>) in Sentry db for the passive to follow. Other option is for passive to directly talk to HMS to get these deltas. If the only motivation for replicating this in sentry db is bringing passive upto speed, I think the later approach is preferable as there is no real need to replicate both info and deltas? But, other parameter to consider is around full update. That is, when Sentry restarts in the later approach, we will have to trigger a full update from HMS. But without a proper snapshot solution in HMS, this would mean we will have to lock HMS writes for this period, which means HMS is not available for writes for this period. 5. Would be useful to have a detailed protocol description especially around what happens when different services restart, and what in memory state does each service rely on. Let me know what you think and we can update the doc accordingly. Thanks!
        Hide
        cmccabe Colin P. McCabe added a comment -

        2. In Section "Future work", "The HDFS Plugin Should Use Update Log IDs". In current design, we apply deltas in the NN plugin. I do not believe we necessarily buffer deltas in NN, as there is no reason. So we may want to remove this section.

        Hmm, maybe it was unclear. This section was about avoiding buffering deltas in the Sentry daemon, not about buffering the deltas in the NN itself.

        3. We might want to add a section about "Sentry passive with hot cache" which follows active versus "Sentry passive with cold cache" which warms up only when it acquires leadership? I think we are inclining towards former which can serve requests with minimal downtime, that is acquiring leadership should not take too long. But might be better if we state it explicitly, so that we evaluate the alternatives thoroughly?

        We had a good discussion about this offline. Hao Hao suggested that we might be able to simplify the design if we were willing to load the cache after a failover. We also discussed whether the cache could be eliminated entirely. My general feeling is that eliminating the cache might be more work than it seems, but loading it on failover might be feasible. We are looking into it. This would avoid the need for the update log.

        4. There are some slight alternatives we might want to consider in the path of propagating HMS updates to Sentry and NN. In the proposed design, we will need to replicate HMS <obj,path> information as well as delta changes of it(add/delete <ob,path>) in Sentry db for the passive to follow. Other option is for passive to directly talk to HMS to get these deltas. If the only motivation for replicating this in sentry db is bringing passive upto speed, I think the later approach is preferable as there is no real need to replicate both info and deltas? But, other parameter to consider is around full update. That is, when Sentry restarts in the later approach, we will have to trigger a full update from HMS. But without a proper snapshot solution in HMS, this would mean we will have to lock HMS writes for this period, which means HMS is not available for writes for this period.

        Ultimately, the HIVE-7973 API is delivering information which affects the global sentry state, such as that a particular Hive table has been deleted or moved. It makes sense for the active sentry daemon to reflect that state in the DB. The standby sentry daemons don't need to use the HIVE-7973 API since the DB is the source of truth for them. This keeps them all in sync and allows fairy rapid failovers.

        5. Would be useful to have a detailed protocol description especially around what happens when different services restart, and what in memory state does each service rely on.

        Good point. We should add more detail here.

        Show
        cmccabe Colin P. McCabe added a comment - 2. In Section "Future work", "The HDFS Plugin Should Use Update Log IDs". In current design, we apply deltas in the NN plugin. I do not believe we necessarily buffer deltas in NN, as there is no reason. So we may want to remove this section. Hmm, maybe it was unclear. This section was about avoiding buffering deltas in the Sentry daemon, not about buffering the deltas in the NN itself. 3. We might want to add a section about "Sentry passive with hot cache" which follows active versus "Sentry passive with cold cache" which warms up only when it acquires leadership? I think we are inclining towards former which can serve requests with minimal downtime, that is acquiring leadership should not take too long. But might be better if we state it explicitly, so that we evaluate the alternatives thoroughly? We had a good discussion about this offline. Hao Hao suggested that we might be able to simplify the design if we were willing to load the cache after a failover. We also discussed whether the cache could be eliminated entirely. My general feeling is that eliminating the cache might be more work than it seems, but loading it on failover might be feasible. We are looking into it. This would avoid the need for the update log. 4. There are some slight alternatives we might want to consider in the path of propagating HMS updates to Sentry and NN. In the proposed design, we will need to replicate HMS <obj,path> information as well as delta changes of it(add/delete <ob,path>) in Sentry db for the passive to follow. Other option is for passive to directly talk to HMS to get these deltas. If the only motivation for replicating this in sentry db is bringing passive upto speed, I think the later approach is preferable as there is no real need to replicate both info and deltas? But, other parameter to consider is around full update. That is, when Sentry restarts in the later approach, we will have to trigger a full update from HMS. But without a proper snapshot solution in HMS, this would mean we will have to lock HMS writes for this period, which means HMS is not available for writes for this period. Ultimately, the HIVE-7973 API is delivering information which affects the global sentry state, such as that a particular Hive table has been deleted or moved. It makes sense for the active sentry daemon to reflect that state in the DB. The standby sentry daemons don't need to use the HIVE-7973 API since the DB is the source of truth for them. This keeps them all in sync and allows fairy rapid failovers. 5. Would be useful to have a detailed protocol description especially around what happens when different services restart, and what in memory state does each service rely on. Good point. We should add more detail here.
        Hide
        Yibing Yibing Shi added a comment -

        HIVE-7973 hasn't been committed in HIVE yet. Has this interface been used and tested widely? If not, I suggest not to build the design on this immature/unstable interface. Instead, please consider implementing an election mechanism among Sentry clients so that only one client would send the full update to active Sentry daemon. All Sentry clients should be allowed to send delta changes (add/delete/alter table/partition) to active Sentry daemon though.

        Agree not to allow standby Sentry daemons to read directly from HMS for <obj, path> information. One single source of truth (DB) is a better design.

        Eliminating the cache is a good idea, as now everything comes from DB.

        Show
        Yibing Yibing Shi added a comment - HIVE-7973 hasn't been committed in HIVE yet. Has this interface been used and tested widely? If not, I suggest not to build the design on this immature/unstable interface. Instead, please consider implementing an election mechanism among Sentry clients so that only one client would send the full update to active Sentry daemon. All Sentry clients should be allowed to send delta changes (add/delete/alter table/partition) to active Sentry daemon though. Agree not to allow standby Sentry daemons to read directly from HMS for <obj, path> information. One single source of truth (DB) is a better design. Eliminating the cache is a good idea, as now everything comes from DB.
        Hide
        dapengsun Dapeng Sun added a comment - - edited

        Thank Colin P. McCabe for the design.
        Here are my comments:
        To avoiding Hive MetaStore send many duplicate information to Sentry Server, if Hive side can provide the Hive replication support, it would be good. I think another solution would be selecting a leader from HiveMetaListeners when Hive Metastores starting, in sync mode, other listeners must wait the leader finish sync the metadata to Sentry. After the leader Metastore started, other Metastore could be back to work.

        About Sentry HA, currently sentry server service is stateless, it's better to keep it stateless and not cover it to active/passive mode, since we have plan to persist Sentry Update Log to database. I think sentry servers could share states by Zookeeper and database.

        Show
        dapengsun Dapeng Sun added a comment - - edited Thank Colin P. McCabe for the design. Here are my comments: To avoiding Hive MetaStore send many duplicate information to Sentry Server, if Hive side can provide the Hive replication support, it would be good. I think another solution would be selecting a leader from HiveMetaListeners when Hive Metastores starting, in sync mode, other listeners must wait the leader finish sync the metadata to Sentry. After the leader Metastore started, other Metastore could be back to work. About Sentry HA, currently sentry server service is stateless, it's better to keep it stateless and not cover it to active/passive mode, since we have plan to persist Sentry Update Log to database. I think sentry servers could share states by Zookeeper and database.
        Hide
        sravya Sravya Tirukkovalur added a comment -

        Yibing Shi, Hive-7973 comes in multiple parts: 1. HMS schema changes to include notification logs, 2. DbNotificationListener which is essentially a postEventListener, 3. ReplicationTasks which use these notification logs for replication.

        For 2 above, In Sentry's plugin for HMS, we have a very similar post event listener implementation (SentryMetastorePostEventListener). The only difference right now being: Sentry's listener sends these notifications as RPC to Sentry service, where as DbNotificationListener persists the notifications in DB.

        We think persisting in DB approach is better than RPC as it brings with it multiple advantages mainly 1. Delta changes are persisted allowing the clients (even in a Sentry HA failover case) to pull 2. This gets a global sequence of HMS updates when there are more than one writers (HMS HA). 3. Opportunity to make this part of transaction, so that notification log is a true WAL with consistency guarantees.

        Given that, we do have two options here. 1. Either improve Sentry's plugin to persist to Notification Log. 2. Improve Hive plugin

        We are inclining more towards option 1, as there are also some Sentry specific optimizations we might be able to do like capture location in the notification log. Which also avoids a second round trip to HMS when processing the notification log, which avoids consistency issues around this.

        Also, Sentry's plugin is running on production use cases for a while now, whereas 7973 is pretty new, so it would be a while before it matures. We can always contribute back the Sentry plugin to Hive upstream if there is enough interest. What do you think?

        Show
        sravya Sravya Tirukkovalur added a comment - Yibing Shi , Hive-7973 comes in multiple parts: 1. HMS schema changes to include notification logs, 2. DbNotificationListener which is essentially a postEventListener, 3. ReplicationTasks which use these notification logs for replication. For 2 above, In Sentry's plugin for HMS, we have a very similar post event listener implementation (SentryMetastorePostEventListener). The only difference right now being: Sentry's listener sends these notifications as RPC to Sentry service, where as DbNotificationListener persists the notifications in DB. We think persisting in DB approach is better than RPC as it brings with it multiple advantages mainly 1. Delta changes are persisted allowing the clients (even in a Sentry HA failover case) to pull 2. This gets a global sequence of HMS updates when there are more than one writers (HMS HA). 3. Opportunity to make this part of transaction, so that notification log is a true WAL with consistency guarantees. Given that, we do have two options here. 1. Either improve Sentry's plugin to persist to Notification Log. 2. Improve Hive plugin We are inclining more towards option 1, as there are also some Sentry specific optimizations we might be able to do like capture location in the notification log. Which also avoids a second round trip to HMS when processing the notification log, which avoids consistency issues around this. Also, Sentry's plugin is running on production use cases for a while now, whereas 7973 is pretty new, so it would be a while before it matures. We can always contribute back the Sentry plugin to Hive upstream if there is enough interest. What do you think?
        Hide
        sravya Sravya Tirukkovalur added a comment -

        Moving the HDFS Sync implementation improvements jiras to https://issues.apache.org/jira/browse/SENTRY-1314, so that we can continue focusing on HA redesign here.

        Show
        sravya Sravya Tirukkovalur added a comment - Moving the HDFS Sync implementation improvements jiras to https://issues.apache.org/jira/browse/SENTRY-1314 , so that we can continue focusing on HA redesign here.
        Hide
        cmccabe Colin P. McCabe added a comment - - edited

        Thanks for checking out the design doc!

        Just to be clear, HIVE-7973 has been committed to Hive already. In fact, it is already a part of Cloudera's CDH5.8 distribution. While it's true that there are a few open subtasks remaining on the upstream JIRA, the same could be said for almost any Hadoop feature. We always have plans to improve things We are planning on using HIVE-7973 for other things besides Sentry HA-- for example, it is useful for replicating the Hive database. That code will receive additional testing and attention due to the other uses that it's being put to. When using HIVE-7973, it doesn't matter which HMS process we talk to-- both of them have access to the notification log stored in SQL. This allows us to see what is going on in Hive, and exactly what order it occurred in, even when there are multiple HMS processes involved-- something we currently cannot do.

        With an active/active design, all the sentry daemons would have to request updates (or be sent updates) from the HMS. This is inefficient because it multiplies the RPC load on the HMS service. It is especially inefficient if we have 3 sentry daemons (for extra redundancy). It opens the door to divergence between sentry daemons, because some of the sentry daemons might receive updates from HMS earlier or later due to network conditions. If we are persisting the HMS updates in the Sentry SQL database, we must somehow choose which sentry daemon does the persisting. They can't all do it, because their updates would conflict. Choosing one sentry daemon to do the persistence is essentially equivalent to choosing a master.

        The update log is useful for more than just implementing HA. It can be used as a generalized mechanism for synchronizing a cache. For example, the HDFS plugin can read the update log and apply its updates to keep the cache maintained in the NameNode process in sync with what is going on in Sentry. This is better than the current mechanism of buffering "deltas" in memory in the sentry daemon. The delta mechanism requires lots of heap memory, whereas the update log mechanism does not. Because the update log is stored in the SQL database, the HDFS plugin will be able to continue requesting update log entries even if the sentry service is restarted or has a failover. In contrast, the deltas buffered in memory will be lost if either of those events occur. So in conclusion I would say that we do agree that sentry should move towards becoming stateless, and we view this design as a stepping stone towards that.

        Show
        cmccabe Colin P. McCabe added a comment - - edited Thanks for checking out the design doc! Just to be clear, HIVE-7973 has been committed to Hive already. In fact, it is already a part of Cloudera's CDH5.8 distribution. While it's true that there are a few open subtasks remaining on the upstream JIRA, the same could be said for almost any Hadoop feature. We always have plans to improve things We are planning on using HIVE-7973 for other things besides Sentry HA-- for example, it is useful for replicating the Hive database. That code will receive additional testing and attention due to the other uses that it's being put to. When using HIVE-7973 , it doesn't matter which HMS process we talk to-- both of them have access to the notification log stored in SQL. This allows us to see what is going on in Hive, and exactly what order it occurred in, even when there are multiple HMS processes involved-- something we currently cannot do. With an active/active design, all the sentry daemons would have to request updates (or be sent updates) from the HMS. This is inefficient because it multiplies the RPC load on the HMS service. It is especially inefficient if we have 3 sentry daemons (for extra redundancy). It opens the door to divergence between sentry daemons, because some of the sentry daemons might receive updates from HMS earlier or later due to network conditions. If we are persisting the HMS updates in the Sentry SQL database, we must somehow choose which sentry daemon does the persisting. They can't all do it, because their updates would conflict. Choosing one sentry daemon to do the persistence is essentially equivalent to choosing a master. The update log is useful for more than just implementing HA. It can be used as a generalized mechanism for synchronizing a cache. For example, the HDFS plugin can read the update log and apply its updates to keep the cache maintained in the NameNode process in sync with what is going on in Sentry. This is better than the current mechanism of buffering "deltas" in memory in the sentry daemon. The delta mechanism requires lots of heap memory, whereas the update log mechanism does not. Because the update log is stored in the SQL database, the HDFS plugin will be able to continue requesting update log entries even if the sentry service is restarted or has a failover. In contrast, the deltas buffered in memory will be lost if either of those events occur. So in conclusion I would say that we do agree that sentry should move towards becoming stateless, and we view this design as a stepping stone towards that.
        Hide
        sravya Sravya Tirukkovalur added a comment -

        Created a feature branch as per our discussion on dev list : sentry-ha-redesign

        Show
        sravya Sravya Tirukkovalur added a comment - Created a feature branch as per our discussion on dev list : sentry-ha-redesign
        Hide
        colinma Colin Ma added a comment -

        Sravya Tirukkovalur, after refactor of SENTRY-1205, SENTRY-1406, the project structure changes a lot, I think it's necessary to rebase the sentry-ha-redesign with master, what do you think?

        Show
        colinma Colin Ma added a comment - Sravya Tirukkovalur , after refactor of SENTRY-1205 , SENTRY-1406 , the project structure changes a lot, I think it's necessary to rebase the sentry-ha-redesign with master, what do you think?
        Hide
        akolb Alexander Kolbasov added a comment - - edited

        I uploaded an updated design doc, please take a look! The major emphasis is shifting from Active/Passive to Active/Active approach.

        Show
        akolb Alexander Kolbasov added a comment - - edited I uploaded an updated design doc, please take a look! The major emphasis is shifting from Active/Passive to Active/Active approach.
        Hide
        akolb Alexander Kolbasov added a comment -

        Colin Ma Actual;ly with some tweaks we can change detach on commit to false, I'll file a separate JIRA for investigating this.

        Show
        akolb Alexander Kolbasov added a comment - Colin Ma Actual;ly with some tweaks we can change detach on commit to false, I'll file a separate JIRA for investigating this.
        Hide
        akolb Alexander Kolbasov added a comment -

        Added an updated design document v2.1.1

        Show
        akolb Alexander Kolbasov added a comment - Added an updated design document v2.1.1
        Hide
        akolb Alexander Kolbasov added a comment -

        All the changes for Sentry HA are done. I am going to update the design doc and resolve this JIRA.

        Show
        akolb Alexander Kolbasov added a comment - All the changes for Sentry HA are done. I am going to update the design doc and resolve this JIRA.

          People

          • Assignee:
            akolb Alexander Kolbasov
            Reporter:
            sravya Sravya Tirukkovalur
          • Votes:
            1 Vote for this issue
            Watchers:
            15 Start watching this issue

            Dates

            • Created:
              Updated:

              Time Tracking

              Estimated:
              Original Estimate - 508h
              508h
              Remaining:
              Remaining Estimate - 508h
              508h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development