Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.0.0
    • Fix Version/s: 2.0.0
    • Component/s: Sentry
    • Labels:
      None

      Description

      The current criteria for getting a full snapshot from HMS is insufficient. We need to get full snapshot every time when there is a gap between last stored notification and the smallest notification available from HMS.

      1. SENTRY-1760.5-sentry-ha-redesign.patch
        20 kB
        Sergio Peña
      2. SENTRY-1760.4-sentry-ha-redesign.patch
        16 kB
        Sergio Peña
      3. SENTRY-1760.3-sentry-ha-redesign.patch
        16 kB
        Sergio Peña
      4. SENTRY-1760.2-sentry-ha-redesign.patch
        15 kB
        Sergio Peña
      5. SENTRY-1760.1-sentry-ha-redesign.patch
        4 kB
        Sergio Peña

        Issue Links

          Activity

          Hide
          spena Sergio Peña added a comment -

          What cases should be considered as gaps between notifications? Probably:

          • Notifications that could not be processed due to errors.
          • HDFS sync is disabled.
          • Other ones?

          Btw, I see in the HMSFollower code this:

          if (!eventIDBefore.equals(eventIDAfter)) {
                    LOGGER.error("#### Fail to get a point-in-time hive full snapshot !! Current NotificationID = " +
                        eventIDAfter.toString());
                    return;
                  }
          

          What's the expected behavior? Does the user need to restart Sentry on the above situation? or should HMSFollower handle that; sleep and redo the initial snapshot automatically?

          Show
          spena Sergio Peña added a comment - What cases should be considered as gaps between notifications? Probably: Notifications that could not be processed due to errors. HDFS sync is disabled. Other ones? Btw, I see in the HMSFollower code this: if (!eventIDBefore.equals(eventIDAfter)) { LOGGER.error("#### Fail to get a point-in-time hive full snapshot !! Current NotificationID = " + eventIDAfter.toString()); return; } What's the expected behavior? Does the user need to restart Sentry on the above situation? or should HMSFollower handle that; sleep and redo the initial snapshot automatically?
          Hide
          LinaAtAustin Na Li added a comment -

          Sergio Peña the notification ID is updated in DB even when exception happens during processing the event. See Kalyan's update in https://reviews.apache.org/r/58975/diff/5#4 line 554-556

          The reason there is a gap between last stored notification in sentry store and the smallest notification available from HMS could be caused by sentry was down for a while, and HMS has removed some updates from meta store. There is no way to get those missing notifications in meta store, but missed by sentry. So sentry has to get full update.

          Show
          LinaAtAustin Na Li added a comment - Sergio Peña the notification ID is updated in DB even when exception happens during processing the event. See Kalyan's update in https://reviews.apache.org/r/58975/diff/5#4 line 554-556 The reason there is a gap between last stored notification in sentry store and the smallest notification available from HMS could be caused by sentry was down for a while, and HMS has removed some updates from meta store. There is no way to get those missing notifications in meta store, but missed by sentry. So sentry has to get full update.
          Hide
          akolb Alexander Kolbasov added a comment -

          HMSFollower is running every 1/2 a sec, so it will automatically retry next time.

          There may be multiple reasons for a gap - e.g. Sentry was down for a while.
          I am wondering whether we can also detect a case when HMS notification database was reset.

          Show
          akolb Alexander Kolbasov added a comment - HMSFollower is running every 1/2 a sec, so it will automatically retry next time. There may be multiple reasons for a gap - e.g. Sentry was down for a while. I am wondering whether we can also detect a case when HMS notification database was reset.
          Hide
          spena Sergio Peña added a comment -

          So, there are a few cases that will require a full snapshot:

          • sentry is down for a while
          • errros while processing notifications
          • HDFS sync is disabled
          • HMS DB is reset. I don't know how we would be able to detect if the HMS notification DB is reset. Perhaps the current HMS notification ID will be less than the already processed ID on Sentry, and if so a full snapshot is required? I don't trust on this, though.

          However, if there are different scenarios causing a gap, can we just force a full snapshot on every Sentry restart?

          Show
          spena Sergio Peña added a comment - So, there are a few cases that will require a full snapshot: sentry is down for a while errros while processing notifications HDFS sync is disabled HMS DB is reset. I don't know how we would be able to detect if the HMS notification DB is reset. Perhaps the current HMS notification ID will be less than the already processed ID on Sentry, and if so a full snapshot is required? I don't trust on this, though. However, if there are different scenarios causing a gap, can we just force a full snapshot on every Sentry restart?
          Hide
          akolb Alexander Kolbasov added a comment -

          Full snapshot is very expensive as it now requires stop the world for hive - we can't just do it willy-nilly, e.g. when Sentry crashed and decided to restart.

          The case where Sentry thinks it is in front of notifications is interesting - it most likely indicates that something is quite wrong and may be a good case for forcing a full snapshot.

          Show
          akolb Alexander Kolbasov added a comment - Full snapshot is very expensive as it now requires stop the world for hive - we can't just do it willy-nilly, e.g. when Sentry crashed and decided to restart. The case where Sentry thinks it is in front of notifications is interesting - it most likely indicates that something is quite wrong and may be a good case for forcing a full snapshot.
          Hide
          spena Sergio Peña added a comment -

          Alexander Kolbasov I finally got these decisions ideas on the HMSFollower that will require a full snapshot. One question first, should we allow forcing a snapshot while HMSFollower is running? or should this forcing state be saved on the Sentry store so we do a full snapshot on the next Sentry service restart?

          How and when to detect a full snapshot is required?

          During HMSFollower constructor:

          • No transactions have been processed yet (current situation).
          • Sentry store has a PARTIAL_SNAPSHOT state (Note: Any error while processing notifications should set PARTIAL_SNAPSHOT state on the Sentry store).

          During HMSFollower start:

          • Sentry latest notification ID > HMS latest notification ID.
          • HMS latest notification ID - Sentry latest notification ID has a gap of notifications (due to HMS notification cleanup).
            i.e. Sentry has id=40 and HMS has id=150. 150-40=110 notifications should be fetched from HMS, but only 90 are fetched.

          This is a new decision as well:

          • What if Sentry id 40 and HMS id 40 are different? Let's say HMS had a cleanup, and HMS did exactly 40 notifications after it. Sentry won't be in sync because even if it already processed 40 notifications before, the HMS has different metadata. This looks too unlikely, but should we take it into account?
          Show
          spena Sergio Peña added a comment - Alexander Kolbasov I finally got these decisions ideas on the HMSFollower that will require a full snapshot. One question first, should we allow forcing a snapshot while HMSFollower is running? or should this forcing state be saved on the Sentry store so we do a full snapshot on the next Sentry service restart? How and when to detect a full snapshot is required? During HMSFollower constructor: No transactions have been processed yet (current situation). Sentry store has a PARTIAL_SNAPSHOT state (Note: Any error while processing notifications should set PARTIAL_SNAPSHOT state on the Sentry store). During HMSFollower start: Sentry latest notification ID > HMS latest notification ID. HMS latest notification ID - Sentry latest notification ID has a gap of notifications (due to HMS notification cleanup). i.e. Sentry has id=40 and HMS has id=150. 150-40=110 notifications should be fetched from HMS, but only 90 are fetched. This is a new decision as well: What if Sentry id 40 and HMS id 40 are different? Let's say HMS had a cleanup, and HMS did exactly 40 notifications after it. Sentry won't be in sync because even if it already processed 40 notifications before, the HMS has different metadata. This looks too unlikely, but should we take it into account?
          Hide
          hadoopqa Hadoop QA added a comment -

          Here are the results of testing the latest attachment
          https://issues.apache.org/jira/secure/attachment/12867824/SENTRY-1760.1-sentry-ha-redesign.patch against sentry-ha-redesign.

          Overall: +1 all checks pass

          SUCCESS: all tests passed

          Console output: https://builds.apache.org/job/PreCommit-SENTRY-Build/2694/console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - Here are the results of testing the latest attachment https://issues.apache.org/jira/secure/attachment/12867824/SENTRY-1760.1-sentry-ha-redesign.patch against sentry-ha-redesign. Overall: +1 all checks pass SUCCESS: all tests passed Console output: https://builds.apache.org/job/PreCommit-SENTRY-Build/2694/console This message is automatically generated.
          Hide
          spena Sergio Peña added a comment -

          Added link to the review board.

          Show
          spena Sergio Peña added a comment - Added link to the review board.
          Hide
          LinaAtAustin Na Li added a comment -

          Sergio Peña Will this update include automatically detecting when the Name Node is out of sync with sentry, so sentry will send full snapshot to NN?

          Show
          LinaAtAustin Na Li added a comment - Sergio Peña Will this update include automatically detecting when the Name Node is out of sync with sentry, so sentry will send full snapshot to NN?
          Hide
          spena Sergio Peña added a comment -

          No, that part will be handled by SENTRY-1815

          Show
          spena Sergio Peña added a comment - No, that part will be handled by SENTRY-1815
          Hide
          hadoopqa Hadoop QA added a comment -

          Here are the results of testing the latest attachment
          https://issues.apache.org/jira/secure/attachment/12877685/SENTRY-1760.2-sentry-ha-redesign.patch against sentry-ha-redesign.

          Overall: -1 due to an error

          ERROR: failed to apply patch (exit code 1):
          The patch does not appear to apply with p0, p1, or p2

          Console output: https://builds.apache.org/job/PreCommit-SENTRY-Build/3059/console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - Here are the results of testing the latest attachment https://issues.apache.org/jira/secure/attachment/12877685/SENTRY-1760.2-sentry-ha-redesign.patch against sentry-ha-redesign. Overall: -1 due to an error ERROR: failed to apply patch (exit code 1): The patch does not appear to apply with p0, p1, or p2 Console output: https://builds.apache.org/job/PreCommit-SENTRY-Build/3059/console This message is automatically generated.
          Hide
          spena Sergio Peña added a comment -

          Rebase the patch to latest branch changes.

          Show
          spena Sergio Peña added a comment - Rebase the patch to latest branch changes.
          Hide
          hadoopqa Hadoop QA added a comment -

          Here are the results of testing the latest attachment
          https://issues.apache.org/jira/secure/attachment/12877823/SENTRY-1760.2-sentry-ha-redesign.patch against sentry-ha-redesign.

          Overall: +1 all checks pass

          SUCCESS: all tests passed

          Console output: https://builds.apache.org/job/PreCommit-SENTRY-Build/3065/console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - Here are the results of testing the latest attachment https://issues.apache.org/jira/secure/attachment/12877823/SENTRY-1760.2-sentry-ha-redesign.patch against sentry-ha-redesign. Overall: +1 all checks pass SUCCESS: all tests passed Console output: https://builds.apache.org/job/PreCommit-SENTRY-Build/3065/console This message is automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -

          Here are the results of testing the latest attachment
          https://issues.apache.org/jira/secure/attachment/12877862/SENTRY-1760.3-sentry-ha-redesign.patch against sentry-ha-redesign.

          Overall: -1 due to 2 errors

          ERROR: mvn test exited 1
          ERROR: Failed: org.apache.sentry.service.thrift.TestSentryHMSClient

          Console output: https://builds.apache.org/job/PreCommit-SENTRY-Build/3070/console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - Here are the results of testing the latest attachment https://issues.apache.org/jira/secure/attachment/12877862/SENTRY-1760.3-sentry-ha-redesign.patch against sentry-ha-redesign. Overall: -1 due to 2 errors ERROR: mvn test exited 1 ERROR: Failed: org.apache.sentry.service.thrift.TestSentryHMSClient Console output: https://builds.apache.org/job/PreCommit-SENTRY-Build/3070/console This message is automatically generated.
          Hide
          spena Sergio Peña added a comment -

          Fixed unit tests.

          Show
          spena Sergio Peña added a comment - Fixed unit tests.
          Hide
          hadoopqa Hadoop QA added a comment -

          Here are the results of testing the latest attachment
          https://issues.apache.org/jira/secure/attachment/12877881/SENTRY-1760.4-sentry-ha-redesign.patch against sentry-ha-redesign.

          Overall: +1 all checks pass

          SUCCESS: all tests passed

          Console output: https://builds.apache.org/job/PreCommit-SENTRY-Build/3072/console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - Here are the results of testing the latest attachment https://issues.apache.org/jira/secure/attachment/12877881/SENTRY-1760.4-sentry-ha-redesign.patch against sentry-ha-redesign. Overall: +1 all checks pass SUCCESS: all tests passed Console output: https://builds.apache.org/job/PreCommit-SENTRY-Build/3072/console This message is automatically generated.
          Hide
          akolb Alexander Kolbasov added a comment -

          I think it would be very useful to also have a counter for each distinct reason encountered for taking a full snapshot.

          Show
          akolb Alexander Kolbasov added a comment - I think it would be very useful to also have a counter for each distinct reason encountered for taking a full snapshot.
          Hide
          LinaAtAustin Na Li added a comment -

          I agree. I mentioned the similar idea to Sergio.

          Show
          LinaAtAustin Na Li added a comment - I agree. I mentioned the similar idea to Sergio.
          Hide
          spena Sergio Peña added a comment -

          Alexander Kolbasov Shouldn't be better to have a followup jira with the counter? or is the counter smaller enough to add it here?

          Show
          spena Sergio Peña added a comment - Alexander Kolbasov Shouldn't be better to have a followup jira with the counter? or is the counter smaller enough to add it here?
          Hide
          hadoopqa Hadoop QA added a comment -

          Here are the results of testing the latest attachment
          https://issues.apache.org/jira/secure/attachment/12878033/SENTRY-1760.5-sentry-ha-redesign.patch against sentry-ha-redesign.

          Overall: -1 due to 6 errors

          ERROR: mvn test exited 1
          ERROR: Failed: org.apache.sentry.tests.e2e.solr.TestUpdateOperations
          ERROR: Failed: org.apache.sentry.tests.e2e.solr.db.integration.TestSolrUpdateOperations
          ERROR: Failed: org.apache.sentry.tests.e2e.solr.db.integration.TestSolrUpdateOperations
          ERROR: Failed: org.apache.sentry.tests.e2e.solr.TestQueryOperations
          ERROR: Failed: org.apache.sentry.tests.e2e.solr.TestDocLevelOperations

          Console output: https://builds.apache.org/job/PreCommit-SENTRY-Build/3075/console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - Here are the results of testing the latest attachment https://issues.apache.org/jira/secure/attachment/12878033/SENTRY-1760.5-sentry-ha-redesign.patch against sentry-ha-redesign. Overall: -1 due to 6 errors ERROR: mvn test exited 1 ERROR: Failed: org.apache.sentry.tests.e2e.solr.TestUpdateOperations ERROR: Failed: org.apache.sentry.tests.e2e.solr.db.integration.TestSolrUpdateOperations ERROR: Failed: org.apache.sentry.tests.e2e.solr.db.integration.TestSolrUpdateOperations ERROR: Failed: org.apache.sentry.tests.e2e.solr.TestQueryOperations ERROR: Failed: org.apache.sentry.tests.e2e.solr.TestDocLevelOperations Console output: https://builds.apache.org/job/PreCommit-SENTRY-Build/3075/console This message is automatically generated.
          Hide
          spena Sergio Peña added a comment -

          The tests are Solr tests, and they are not related to the changes on this patch. I am resubmitting the patch to see if they fail again.

          Show
          spena Sergio Peña added a comment - The tests are Solr tests, and they are not related to the changes on this patch. I am resubmitting the patch to see if they fail again.
          Hide
          akolb Alexander Kolbasov added a comment -

          Doing this as a follow-up JIRA is fine. The code should be trivial but it can be done later once we have the actual change stabilized.

          Show
          akolb Alexander Kolbasov added a comment - Doing this as a follow-up JIRA is fine. The code should be trivial but it can be done later once we have the actual change stabilized.
          Hide
          hadoopqa Hadoop QA added a comment -

          Here are the results of testing the latest attachment
          https://issues.apache.org/jira/secure/attachment/12878043/SENTRY-1760.5-sentry-ha-redesign.patch against sentry-ha-redesign.

          Overall: -1 due to an error

          ERROR: mvn test exited 1

          Console output: https://builds.apache.org/job/PreCommit-SENTRY-Build/3076/console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - Here are the results of testing the latest attachment https://issues.apache.org/jira/secure/attachment/12878043/SENTRY-1760.5-sentry-ha-redesign.patch against sentry-ha-redesign. Overall: -1 due to an error ERROR: mvn test exited 1 Console output: https://builds.apache.org/job/PreCommit-SENTRY-Build/3076/console This message is automatically generated.
          Hide
          spena Sergio Peña added a comment -

          Now the tests passed, but the MVN failed for an unknown reason. Re-attaching to re-test again.

          Show
          spena Sergio Peña added a comment - Now the tests passed, but the MVN failed for an unknown reason. Re-attaching to re-test again.
          Hide
          hadoopqa Hadoop QA added a comment -

          Here are the results of testing the latest attachment
          https://issues.apache.org/jira/secure/attachment/12878051/SENTRY-1760.5-sentry-ha-redesign.patch against sentry-ha-redesign.

          Overall: +1 all checks pass

          SUCCESS: all tests passed

          Console output: https://builds.apache.org/job/PreCommit-SENTRY-Build/3077/console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - Here are the results of testing the latest attachment https://issues.apache.org/jira/secure/attachment/12878051/SENTRY-1760.5-sentry-ha-redesign.patch against sentry-ha-redesign. Overall: +1 all checks pass SUCCESS: all tests passed Console output: https://builds.apache.org/job/PreCommit-SENTRY-Build/3077/console This message is automatically generated.
          Hide
          spena Sergio Peña added a comment -

          Alexander Kolbasov Tests run successfully. Could you commit now it is accepted on the RB?

          Show
          spena Sergio Peña added a comment - Alexander Kolbasov Tests run successfully. Could you commit now it is accepted on the RB?
          Hide
          akolb Alexander Kolbasov added a comment -

          Sergio Peña Thank you for your contribution!

          Show
          akolb Alexander Kolbasov added a comment - Sergio Peña Thank you for your contribution!

            People

            • Assignee:
              spena Sergio Peña
              Reporter:
              akolb Alexander Kolbasov
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development