Details

    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 7.4, 8.0
    • AutoScaling
    • None

    Description

      Some high-impact cluster changes (such as split shard) leave the original data and original state that is no longer actively used. This makes sense due to safety reasons and to make it easier to roll-back the changes.

      However, this unused data will accumulate over time, especially when actions like split shard are invoked automatically by the autoscaling framework. We need a periodic task that would clean up this kind of data after a certain period.

      Attachments

        1. SOLR-11670.patch
          26 kB
          Andrzej Bialecki
        2. SOLR-11670.patch
          54 kB
          Andrzej Bialecki
        3. SOLR-11670.patch
          85 kB
          Andrzej Bialecki

        Activity

          erickerickson Erick Erickson added a comment -

          Here's a way to generate dead nodes in 6x:

          Set up a single solr instance, starting it like this:
          bin/solr start -z localhost:2181 -p 8981 -s example/cloud/node1/solr

          Create a collection (one shard and one replica will do).

          Stop that instance and start another changing the port but keeping the same SOLR_HOME:

          bin/solr start -z localhost:2181 -p 8982 -s example/cloud/node1/solr

          (note, the port has changed, but the -s points to where the old core.properties file is located).

          legacyCloud=true needs to be true for this case.

          7x doesn't exhibit the behavior at all. I think the issue is that coreNodeName gets defined in core.properties and is used to update an existing znode if things like the base_url or node_name change.

          erickerickson Erick Erickson added a comment - Here's a way to generate dead nodes in 6x: Set up a single solr instance, starting it like this: bin/solr start -z localhost:2181 -p 8981 -s example/cloud/node1/solr Create a collection (one shard and one replica will do). Stop that instance and start another changing the port but keeping the same SOLR_HOME: bin/solr start -z localhost:2181 -p 8982 -s example/cloud/node1/solr (note, the port has changed, but the -s points to where the old core.properties file is located). legacyCloud=true needs to be true for this case. 7x doesn't exhibit the behavior at all. I think the issue is that coreNodeName gets defined in core.properties and is used to update an existing znode if things like the base_url or node_name change.

          Another candidate data for periodic cleanups are async status ids - as of now, REQUESTSTATUS does not automatically clean up the tracking data structures, meaning the status of completed or failed tasks stays stored in ZooKeeper unless cleared manually.

          ab Andrzej Bialecki added a comment - Another candidate data for periodic cleanups are async status ids - as of now, REQUESTSTATUS does not automatically clean up the tracking data structures, meaning the status of completed or failed tasks stays stored in ZooKeeper unless cleared manually.

          Patch containing the following:

          • a MaintenanceTask API that components may use for registering tasks with Overseer. Tasks are initialized using key / value pairs from /clusterprops.json
          • MaintenanceTasks component, which is a registry and runner of tasks. This component also monitors changes to /clusterprops.json and re-initializes tasks and their schedule as needed.
          • implementation of InactiveSliceCleanupTask, which deletes inactive slices that exceeded a configured TTL time. This task is registered by SplitShardCmd.
          • changes to SplitShardTest to exercise this code.

          This code is also available on branch jira/solr-11670.

          ab Andrzej Bialecki added a comment - Patch containing the following: a MaintenanceTask  API that components may use for registering tasks with Overseer. Tasks are initialized using key / value pairs from /clusterprops.json MaintenanceTasks component, which is a registry and runner of tasks. This component also monitors changes to /clusterprops.json  and re-initializes tasks and their schedule as needed. implementation of InactiveSliceCleanupTask , which deletes inactive slices that exceeded a configured TTL time. This task is registered by SplitShardCmd . changes to SplitShardTest  to exercise this code. This code is also available on branch jira/solr-11670 .
          ab Andrzej Bialecki added a comment - - edited

          shalin suggested another approach (thanks!) and this patch implements it - instead of a separate component for managing the tasks it reuses the ScheduledTrigger and custom trigger actions that implement maintenance tasks.

          This indeed simplifies the execution and management of these tasks, and reuses familiar concepts. The (small) downside is that it's less convenient to pre-register some tasks that we know the cluster should run by default, but it's the same situation as with any default triggers (eg. autoAddReplicas trigger).

          ScheduledMaintenanceTriggerTest illustrates the registration and how the InactiveShardCleanupAction works.

          (The patch seems larger than before but that's due to some refactoring of common utility methods for waiting on collection state.)

          ab Andrzej Bialecki added a comment - - edited shalin suggested another approach (thanks!) and this patch implements it - instead of a separate component for managing the tasks it reuses the ScheduledTrigger and custom trigger actions that implement maintenance tasks. This indeed simplifies the execution and management of these tasks, and reuses familiar concepts. The (small) downside is that it's less convenient to pre-register some tasks that we know the cluster should run by default, but it's the same situation as with any default triggers (eg. autoAddReplicas trigger). ScheduledMaintenanceTriggerTest illustrates the registration and how the InactiveShardCleanupAction works. (The patch seems larger than before but that's due to some refactoring of common utility methods for waiting on collection state.)

          Thanks Andrzej! A few comments:

          • Autoscaling.java – the constant AUTO_ADD_REPLICAS_LISTENER_NAME is not used anywhere
          • ExecutePlanAction – the counter appended to the asyncId has been removed. It is not useful today I agree but it will be needed again when SOLR-11605 is implemented
          • InactiveShardCleanupAction – It is not safe to compare nanotime if the overseer leader changed between the shard split and the cleanup task. Perhaps we stick to currentTimeMillis() here?
          • Perhaps we should let InactiveShardCleanupAction only produce operations that are later executed by ExecutePlanAction? The advantage is that those operations (and others created by clean up tasks in future) can be performed in parallel once SOLR-11605 comes in. But more importantly, an exception thrown in one clean up action will not cause the action processing to be aborted. This point is moot however if you envision each future clean up action to have its own scheduled trigger (in which case we should increase the schedule interval for this one to something very large e.g. once a day). It is also possible that future clean up tasks do not produce solrj requests at all. I'm curious to know what you think about this.
          • Exceptions thrown from the delete shard API should also be added to context properties so that they can be sent to listeners
          • SimSolrCloudTestCase – many unused imports
          shalin Shalin Shekhar Mangar added a comment - Thanks Andrzej! A few comments: Autoscaling.java – the constant AUTO_ADD_REPLICAS_LISTENER_NAME is not used anywhere ExecutePlanAction – the counter appended to the asyncId has been removed. It is not useful today I agree but it will be needed again when SOLR-11605 is implemented InactiveShardCleanupAction – It is not safe to compare nanotime if the overseer leader changed between the shard split and the cleanup task. Perhaps we stick to currentTimeMillis() here? Perhaps we should let InactiveShardCleanupAction only produce operations that are later executed by ExecutePlanAction? The advantage is that those operations (and others created by clean up tasks in future) can be performed in parallel once SOLR-11605 comes in. But more importantly, an exception thrown in one clean up action will not cause the action processing to be aborted. This point is moot however if you envision each future clean up action to have its own scheduled trigger (in which case we should increase the schedule interval for this one to something very large e.g. once a day). It is also possible that future clean up tasks do not produce solrj requests at all. I'm curious to know what you think about this. Exceptions thrown from the delete shard API should also be added to context properties so that they can be sent to listeners SimSolrCloudTestCase – many unused imports

          Updated patch addressing the issues from review. Using ExecutePlanAction is a good idea - I changed InactiveShardCleanupAction to InactiveShardPlanAction, which only produces "operations" to be executed by ExecutePlanAction.

          ab Andrzej Bialecki added a comment - Updated patch addressing the issues from review. Using ExecutePlanAction is a good idea - I changed InactiveShardCleanupAction to InactiveShardPlanAction , which only produces "operations" to be executed by ExecutePlanAction .

          Looks great Andrzej! Only thing I don't understand is why do we make a copy of triggerListeners in the processor? +1 to commit regardless of that.

          shalin Shalin Shekhar Mangar added a comment - Looks great Andrzej! Only thing I don't understand is why do we make a copy of triggerListeners in the processor? +1 to commit regardless of that.

           why do we make a copy of triggerListeners in the processor

          If we change AutoScalingConfig while an event is being processed then we may get inconsistent listener notifications. I noticed this situation in tests where old events (from a previous test) would continue to be processed during the next test method, but they produced only some of the notifications one would normally expect (because the listeners have been cleared between eg. BEFORE_ACTION and AFTER_ACTION).
          Under normal circumstances this shouldn't be a big deal, but in tests this is much more likely to happen.

          ab Andrzej Bialecki added a comment -  why do we make a copy of triggerListeners in the processor If we change AutoScalingConfig while an event is being processed then we may get inconsistent listener notifications. I noticed this situation in tests where old events (from a previous test) would continue to be processed during the next test method, but they produced only some of the notifications one would normally expect (because the listeners have been cleared between eg. BEFORE_ACTION and AFTER_ACTION). Under normal circumstances this shouldn't be a big deal, but in tests this is much more likely to happen.

          Thanks for explaining!

          shalin Shalin Shekhar Mangar added a comment - Thanks for explaining!

          Commit b17052e8520bb57bcfe126aa2f8e6bd0b9aa76c5 in lucene-solr's branch refs/heads/master from ab
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b17052e ]

          SOLR-11670: Implement a periodic house-keeping task.

          jira-bot ASF subversion and git services added a comment - Commit b17052e8520bb57bcfe126aa2f8e6bd0b9aa76c5 in lucene-solr's branch refs/heads/master from ab [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b17052e ] SOLR-11670 : Implement a periodic house-keeping task.

          Commit ed6feded6de7f1c268986df6de6a5dc9db6a3f34 in lucene-solr's branch refs/heads/master from ab
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ed6feded ]

          SOLR-11670: Use TimeSource's value of NOW consistently when parsing date math.
          Add a unit test for TimeSource's epochTime.

          jira-bot ASF subversion and git services added a comment - Commit ed6feded6de7f1c268986df6de6a5dc9db6a3f34 in lucene-solr's branch refs/heads/master from ab [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ed6feded ] SOLR-11670 : Use TimeSource's value of NOW consistently when parsing date math. Add a unit test for TimeSource's epochTime.
          dsmiley David Smiley added a comment -

          ab I think it's confusing that TimeSource.getTime & getEpochTime return nanoseconds. I think the methods should be renamed so that it's clear what unit it is, otherwise the conditions are ripe for continuing bugs in unit conversions. Perhaps rename getTime to getTimeNs.

          dsmiley David Smiley added a comment - ab I think it's confusing that TimeSource.getTime & getEpochTime return nanoseconds. I think the methods should be renamed so that it's clear what unit it is, otherwise the conditions are ripe for continuing bugs in unit conversions. Perhaps rename getTime to getTimeNs .

          it's confusing that TimeSource.getTime & getEpochTime return nanoseconds

          System.nanoTime returns time in ns, too ... but I see your point, we can rename these two methods to getTimeNs and getEpochTimeNs.

          ab Andrzej Bialecki added a comment - it's confusing that TimeSource.getTime & getEpochTime return nanoseconds System.nanoTime returns time in ns, too ... but I see your point, we can rename these two methods to getTimeNs and getEpochTimeNs .

          I created SOLR-12091 to track the method renaming.

          ab Andrzej Bialecki added a comment - I created SOLR-12091 to track the method renaming.

          Commit 17cfd87a28275b58deeec95d1172ed8cec2d1304 in lucene-solr's branch refs/heads/branch_7x from ab
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=17cfd87 ]

          SOLR-11670: Implement a periodic house-keeping task.

          jira-bot ASF subversion and git services added a comment - Commit 17cfd87a28275b58deeec95d1172ed8cec2d1304 in lucene-solr's branch refs/heads/branch_7x from ab [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=17cfd87 ] SOLR-11670 : Implement a periodic house-keeping task.

          Commit e1b0f796af9363b5496cc4ca8f17755f3c10e59b in lucene-solr's branch refs/heads/branch_7x from ab
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e1b0f79 ]

          SOLR-11670: Use TimeSource's value of NOW consistently when parsing date math.
          Add a unit test for TimeSource's epochTime.

          jira-bot ASF subversion and git services added a comment - Commit e1b0f796af9363b5496cc4ca8f17755f3c10e59b in lucene-solr's branch refs/heads/branch_7x from ab [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e1b0f79 ] SOLR-11670 : Use TimeSource's value of NOW consistently when parsing date math. Add a unit test for TimeSource's epochTime.

          Commit 25ec5cda0b2f4ce36366c56edc20e62e33040188 in lucene-solr's branch refs/heads/master from ab
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=25ec5cd ]

          SOLR-11670: Allow for larger diff in simulated time.

          jira-bot ASF subversion and git services added a comment - Commit 25ec5cda0b2f4ce36366c56edc20e62e33040188 in lucene-solr's branch refs/heads/master from ab [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=25ec5cd ] SOLR-11670 : Allow for larger diff in simulated time.

          Commit 21e2915f0d7b64cec2e02280fb4035cf687165ec in lucene-solr's branch refs/heads/branch_7x from ab
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=21e2915 ]

          SOLR-11670: Allow for larger diff in simulated time.

          jira-bot ASF subversion and git services added a comment - Commit 21e2915f0d7b64cec2e02280fb4035cf687165ec in lucene-solr's branch refs/heads/branch_7x from ab [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=21e2915 ] SOLR-11670 : Allow for larger diff in simulated time.
          dsmiley David Smiley added a comment -

          ab you put this into the 7.3 section of CHANGES.txt but I think it belongs in 7.4?

          dsmiley David Smiley added a comment - ab you put this into the 7.3 section of CHANGES.txt but I think it belongs in 7.4?

          Commit fa03a3843c7e81046398d03bb5d4f1eb78e43fcb in lucene-solr's branch refs/heads/master from ab
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=fa03a38 ]

          SOLR-11670 This functionality was added in 7.4 and not in 7.3.0.

          jira-bot ASF subversion and git services added a comment - Commit fa03a3843c7e81046398d03bb5d4f1eb78e43fcb in lucene-solr's branch refs/heads/master from ab [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=fa03a38 ] SOLR-11670 This functionality was added in 7.4 and not in 7.3.0.

          Commit 80d7b2ada369d3caac8f6c6d94ccf8b2683ab5d6 in lucene-solr's branch refs/heads/branch_7x from ab
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=80d7b2a ]

          SOLR-11670 This functionality was added in 7.4 and not in 7.3.0.

          jira-bot ASF subversion and git services added a comment - Commit 80d7b2ada369d3caac8f6c6d94ccf8b2683ab5d6 in lucene-solr's branch refs/heads/branch_7x from ab [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=80d7b2a ] SOLR-11670 This functionality was added in 7.4 and not in 7.3.0.

          You're right dsmiley - fixed, thanks.

          ab Andrzej Bialecki added a comment - You're right dsmiley - fixed, thanks.

          Commit ed2d3583300263fa6aff4ad41b262bb2c32ae01c in lucene-solr's branch refs/heads/master from ab
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ed2d358 ]

          SOLR-11670: Make sure defaults are applied in simulated cluster.

          jira-bot ASF subversion and git services added a comment - Commit ed2d3583300263fa6aff4ad41b262bb2c32ae01c in lucene-solr's branch refs/heads/master from ab [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ed2d358 ] SOLR-11670 : Make sure defaults are applied in simulated cluster.

          Commit f6319d6d0a80e5f82b26f6b340ad250618f6b565 in lucene-solr's branch refs/heads/branch_7x from ab
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f6319d6 ]

          SOLR-11670: Make sure defaults are applied in simulated cluster.

          jira-bot ASF subversion and git services added a comment - Commit f6319d6d0a80e5f82b26f6b340ad250618f6b565 in lucene-solr's branch refs/heads/branch_7x from ab [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f6319d6 ] SOLR-11670 : Make sure defaults are applied in simulated cluster.

          People

            ab Andrzej Bialecki
            ab Andrzej Bialecki
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: