Uploaded image for project: 'Felix'
  1. Felix
  2. FELIX-961

100% CPU looping inside uses calculation

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: framework-1.4.1
    • Fix Version/s: framework-1.6.0
    • Component/s: Framework
    • Labels:
      None

      Description

      While investigating a problem report against pax-runner (http://article.gmane.org/gmane.comp.java.ops4j.general/6778) I found it was actually caused by a 100% CPU loop inside the "uses" calculation code. In Felix 1.4.1 this was stopping the shell bundle from activating, hence the lack of console. Using the trunk build I can get a console, but the looping still occurs with the testcase.

      1. USES_TESTCASE2.zip
        7.84 MB
        Stuart McCulloch
      2. USES_TESTCASE.zip
        5.08 MB
        Stuart McCulloch

        Activity

        Hide
        mcculls Stuart McCulloch added a comment -

        The attached testcase (Felix + config + bundles) reproduces the uses calculation problem while reducing the amount of bundles from the original testcase (which also included Spring-DM). Note the problem is a bit intermittent, but isn't rare.

        Adding more bundles to the testcase increases the chance of the problem occurring.

        Show
        mcculls Stuart McCulloch added a comment - The attached testcase (Felix + config + bundles) reproduces the uses calculation problem while reducing the amount of bundles from the original testcase (which also included Spring-DM). Note the problem is a bit intermittent, but isn't rare. Adding more bundles to the testcase increases the chance of the problem occurring.
        Hide
        mcculls Stuart McCulloch added a comment -

        Just to clarify, this is not technically a hang because the framework is still doing work - it can just take a long time to complete.
        For example, the attached testcase can take a few minutes before everything resolves and if more bundles are added then it
        takes much longer (perhaps exponential, but I don't have the evidence yet).

        Show
        mcculls Stuart McCulloch added a comment - Just to clarify, this is not technically a hang because the framework is still doing work - it can just take a long time to complete. For example, the attached testcase can take a few minutes before everything resolves and if more bundles are added then it takes much longer (perhaps exponential, but I don't have the evidence yet).
        Hide
        mcculls Stuart McCulloch added a comment -

        Larger testcase that takes a very long time to resolve.

        Show
        mcculls Stuart McCulloch added a comment - Larger testcase that takes a very long time to resolve.
        Hide
        bdelacretaz Bertrand Delacretaz added a comment -

        I have seen similar problems when working on the Sling jcrinstall module, framework seems to hang but is really only spending lots of time in the R4SearchPolicyCore.findConsistentClassSpace() method.

        Looks like my problem is more likely to happen with JDK 1.6 than with JDK 1.5: the test scenario below always passes with 1.5 but "hangs" almost every time with 1.6 (macosx, java versions "1.6.0-dp" and "1.5.0_16").

        Note that I'm still running V1.0.4 of the Felix framework for those tests, so not sure if it's the same problem.

        Here's my test scenario FWIW.

        Uses the Felix webconsole mounted at /system/console.

        1. Install about 80 bundles (can't share those, sorry) via the console, using a loop of CURL commands like

        curl -F action=install -Fbundlefile=@$f http://admin:admin@localhost:4502/system/console/bundles

        2. Try to start all bundles using a loop of CURL commands like

        curl -d action=start http://admin:admin@localhost:4502/system/console/bundles/N

        with N ranging from 20 (the number of bundles installed before 1.) to 150

        With JDK 1.6, step 2. very often hangs when the first bundle is started (HTTP response does not come), and jconsole shows the SCR Actor thread spending all its time inside R4SearchPolicyCore.findConsistentClassSpace()

        Show
        bdelacretaz Bertrand Delacretaz added a comment - I have seen similar problems when working on the Sling jcrinstall module, framework seems to hang but is really only spending lots of time in the R4SearchPolicyCore.findConsistentClassSpace() method. Looks like my problem is more likely to happen with JDK 1.6 than with JDK 1.5: the test scenario below always passes with 1.5 but "hangs" almost every time with 1.6 (macosx, java versions "1.6.0-dp" and "1.5.0_16"). Note that I'm still running V1.0.4 of the Felix framework for those tests, so not sure if it's the same problem. Here's my test scenario FWIW. Uses the Felix webconsole mounted at /system/console. 1. Install about 80 bundles (can't share those, sorry) via the console, using a loop of CURL commands like curl -F action=install -Fbundlefile=@$f http://admin:admin@localhost:4502/system/console/bundles 2. Try to start all bundles using a loop of CURL commands like curl -d action=start http://admin:admin@localhost:4502/system/console/bundles/N with N ranging from 20 (the number of bundles installed before 1.) to 150 With JDK 1.6, step 2. very often hangs when the first bundle is started (HTTP response does not come), and jconsole shows the SCR Actor thread spending all its time inside R4SearchPolicyCore.findConsistentClassSpace()
        Hide
        rickhall Richard S. Hall added a comment -

        It is certainly a known issue that finding a consistent class space may take a long time, since the algorithm is very simplistic and the search space is potentially very big. It should not hang or loop indefinitely, however. If it is, that sounds like a bug.

        Show
        rickhall Richard S. Hall added a comment - It is certainly a known issue that finding a consistent class space may take a long time, since the algorithm is very simplistic and the search space is potentially very big. It should not hang or loop indefinitely, however. If it is, that sounds like a bug.
        Hide
        rickhall Richard S. Hall added a comment -

        Looking at TESTCASE2, it does indeed lead to long delays, although it does appear to be making progress in all cases. Interestingly, some executions do finish quickly, which is dependent upon the ordering we get when we iterate over the entries in a hash map. In short, the algorithm is working as expected, which can be slow in some scenarios. I am looking into implementing some heuristics that may help us out, but worst case will always be bad unless we come up with a completely different algorithm. Hopefully, I can improve this case.

        Show
        rickhall Richard S. Hall added a comment - Looking at TESTCASE2, it does indeed lead to long delays, although it does appear to be making progress in all cases. Interestingly, some executions do finish quickly, which is dependent upon the ordering we get when we iterate over the entries in a hash map. In short, the algorithm is working as expected, which can be slow in some scenarios. I am looking into implementing some heuristics that may help us out, but worst case will always be bad unless we come up with a completely different algorithm. Hopefully, I can improve this case.
        Hide
        mcculls Stuart McCulloch added a comment -

        FWIW, I believe Equinox now attempts the correct algorithm for a certain period and then switches to a less accurate "best-effort" approach if it takes too long.

        Show
        mcculls Stuart McCulloch added a comment - FWIW, I believe Equinox now attempts the correct algorithm for a certain period and then switches to a less accurate "best-effort" approach if it takes too long.
        Hide
        rickhall Richard S. Hall added a comment -

        As a side note, from my understanding of how Equinox works, if it detects a resolve is taking too long, it just takes the best result it has so far and excludes any bundles that could not resolve and resolves the rest. This approach may make sense for Equinox because it resolves all unresolved bundles every time it resolves any, but Felix' resolver is incremental, meaning it only resolves the bundles it needs to resolve.

        Thus, in Felix if you resolve bundle foo, only bundles related to foo through dependencies will be resolved. There are no other byproducts. If foo fails to resolve, then there is no change to the resolved state. This is not true for Equinox. However, we could potentially modify Felix' resolver to simply fail if it starts to take too long, rather than run until it finds a solution or fails (which could take a long time).

        Keep in mind that Equinox' "best effort" approach is just a way to end a long resolve, it doesn't necessarily result in a solution for a particular bundle, it just results in whatever it could accomplish, which may or may not include the bundle you wanted to resolve.

        Show
        rickhall Richard S. Hall added a comment - As a side note, from my understanding of how Equinox works, if it detects a resolve is taking too long, it just takes the best result it has so far and excludes any bundles that could not resolve and resolves the rest. This approach may make sense for Equinox because it resolves all unresolved bundles every time it resolves any, but Felix' resolver is incremental, meaning it only resolves the bundles it needs to resolve. Thus, in Felix if you resolve bundle foo, only bundles related to foo through dependencies will be resolved. There are no other byproducts. If foo fails to resolve, then there is no change to the resolved state. This is not true for Equinox. However, we could potentially modify Felix' resolver to simply fail if it starts to take too long, rather than run until it finds a solution or fails (which could take a long time). Keep in mind that Equinox' "best effort" approach is just a way to end a long resolve, it doesn't necessarily result in a solution for a particular bundle, it just results in whatever it could accomplish, which may or may not include the bundle you wanted to resolve.
        Hide
        rickhall Richard S. Hall added a comment -

        I think TESTCASE2 was interesting, since it demonstrated that it is fairly easy to get into fairly long running "uses" constraint violations without a really complicated use case. In this particular case, the issue arose because bundles were overriding packages provided by the system bundle. This resulted in lots of imports that had two candidates, one from the system bundle and one from another bundle.

        This ended up being a worst case scenario since the system bundle export was chosen first, since it was resolved already and resolved packages are given preference. I noticed that some bundles eliminated the system bundle by including a version range on their imports, but others did not. If all bundles imported with an appropriate version range, then this issue would not have appeared, since there wouldn't have been so many imports with multiple candidates.

        As a result, I implemented two "fixes" for this:

        1. I noticed that some runs wouldn't take long depending on the order of the bundles when calculating uses constraints, which changed since they were pulled out of a map. If bundles with lots of potential candidates for their imports were handled first, it seemed to be quicker. Thus, I modified the resolver to first sort the bundles based the number of potential candidates they had. This change got the resolver completing consistently in about 15 seconds with testcase2.

        2. The algorithm for permutating from one set of potential candidates to another when a constraint violation is detected is exhaustive, but not very smart. In an effort to make it a little smarter, I thought of a way to make it permutate the candidates for a specific bundle when a constraint is detected without losing the ability to be exhaustive; now the resolver rotates the potential candidates for the bundle where the constraint violation was detected and retests. This change got the resolver completing consistently in about 1 second with testcase2.

        It is not clear if these fixes are general or specific to testcase2, but my intuition says they should provide general improvements. However, they have not fixed the worst case situation and it is still possible some set of bundles could cause the resolver to go on for a very long time trying to find a solution. The only solution here is to fail after a certain amount of time or to completely rewrite the resolver.

        Show
        rickhall Richard S. Hall added a comment - I think TESTCASE2 was interesting, since it demonstrated that it is fairly easy to get into fairly long running "uses" constraint violations without a really complicated use case. In this particular case, the issue arose because bundles were overriding packages provided by the system bundle. This resulted in lots of imports that had two candidates, one from the system bundle and one from another bundle. This ended up being a worst case scenario since the system bundle export was chosen first, since it was resolved already and resolved packages are given preference. I noticed that some bundles eliminated the system bundle by including a version range on their imports, but others did not. If all bundles imported with an appropriate version range, then this issue would not have appeared, since there wouldn't have been so many imports with multiple candidates. As a result, I implemented two "fixes" for this: 1. I noticed that some runs wouldn't take long depending on the order of the bundles when calculating uses constraints, which changed since they were pulled out of a map. If bundles with lots of potential candidates for their imports were handled first, it seemed to be quicker. Thus, I modified the resolver to first sort the bundles based the number of potential candidates they had. This change got the resolver completing consistently in about 15 seconds with testcase2. 2. The algorithm for permutating from one set of potential candidates to another when a constraint violation is detected is exhaustive, but not very smart. In an effort to make it a little smarter, I thought of a way to make it permutate the candidates for a specific bundle when a constraint is detected without losing the ability to be exhaustive; now the resolver rotates the potential candidates for the bundle where the constraint violation was detected and retests. This change got the resolver completing consistently in about 1 second with testcase2. It is not clear if these fixes are general or specific to testcase2, but my intuition says they should provide general improvements. However, they have not fixed the worst case situation and it is still possible some set of bundles could cause the resolver to go on for a very long time trying to find a solution. The only solution here is to fail after a certain amount of time or to completely rewrite the resolver.
        Hide
        rickhall Richard S. Hall added a comment -

        I forgot to add, I'd be interested in others reporting any results they have and for this issue to be closed if the original issue is resolved.

        Show
        rickhall Richard S. Hall added a comment - I forgot to add, I'd be interested in others reporting any results they have and for this issue to be closed if the original issue is resolved.
        Hide
        mcculls Stuart McCulloch added a comment -

        Just tried the latest framework compiled from trunk with the original testcase (~40 bundles)
        While it still takes some time resolve (around 1 minute) I think this is reasonable given this
        particular selection of imports, uses constraints, and version ranges.

        So I'll close this issue, but as Richard says - if anyone has more scenarios let us know

        Show
        mcculls Stuart McCulloch added a comment - Just tried the latest framework compiled from trunk with the original testcase (~40 bundles) While it still takes some time resolve (around 1 minute) I think this is reasonable given this particular selection of imports, uses constraints, and version ranges. So I'll close this issue, but as Richard says - if anyone has more scenarios let us know

          People

          • Assignee:
            rickhall Richard S. Hall
            Reporter:
            mcculls Stuart McCulloch
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development