Uploaded image for project: 'Commons Collections'
  1. Commons Collections
  2. COLLECTIONS-427

performance problem in SetUniqueList.retainAll()

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.2.1
    • Fix Version/s: 4.0-alpha1, 4.0, 4.1
    • Component/s: None
    • Labels:
      None
    • Environment:

      java 1.6.0_24
      Ubuntu 11.10

      Description

      I am encountering a performance problem in SetUniqueList.retainAll().
      It appears in version 3.2.1 and also in revision 1365132. I attached
      a test that exposes this problem and a patch that fixes it. On my
      machine, for this test, the patch provides a 621X speedup.

      To run the test, just do:

      $ java Test

      The output for the un-patched version is:
      Time is 6215

      The output for the patched version is:
      Time is 10

      There are two problems here. First, "SetUniqueList.retainAll()"
      should have similar implementation with the current implementation of
      "ListOrderedSet.retainAll()", which is more optimized. Second, even
      "ListOrderedSet.retainAll()" has a performance problem, which was
      reported and explained in detail in COLLECTIONS-426.

      The attached patch has two parts. The first part (the first loop) is
      inspired from COLLECTIONS-426. The second part (everything after the
      first loop) is in fact the current implementation of
      "ListOrderedSet.retainAll()", with some minor changes to adapt it for
      the current code. Overall, the attached patch is very similar to the
      COLLECTIONS-426 patch.

      I will rehash some of the information from COLLECTIONS-426 (which
      describes "ListOrderedSet.retainAll()") for the current
      "SetUniqueList.retainAll()".

      The current code for "SetUniqueList.retainAll()" is:

      public boolean retainAll(Collection<?> coll) {
          boolean result = super.retainAll(coll);
          set.retainAll(coll);
          return result;
      }
      

      where both "super.retainAll(coll)" and "set.retainAll(coll)" can have
      quadratic complexity, e.g., if "coll" is a List. Both these calls to
      "retainAll" are in fact calls to
      "java.util.AbstractCollection.retainAll()", which has the code:

      public boolean retainAll(Collection<?> c) {
          boolean modified = false;
          Iterator<E> e = iterator();
          while (e.hasNext()) {
              if (!c.contains(e.next())) {
                  e.remove();
                  modified = true;
              }
          }
          return modified;
      }
      

      which iterates over "this" and calls "contains()" on "c". Mapping
      this code back to "SetUniqueList.retainAll()" means that the code
      iterates over "this" and "set" and calls "contains()" on "coll". If
      "coll" has slow "contains()" (e.g., if "coll" is a list), then
      "SetUniqueList.retainAll()" has quadratic complexity.

      The patch iterates over "coll" and calls "contains()" on "set", which
      we know is fast, because "set" is a Set. For a more detailed
      discussion of the patch and the problem, see the current
      implementation of "ListOrderedSet.retainAll()", the discussion for
      COLLECTIONS-426, and the patch for COLLECTIONS-426.

      Is this a bug, or am I misunderstanding the intended behavior? If so,
      can you please confirm if the patch is correct?

      1. patch.diff
        1 kB
        Mert Guldur
      2. Test.java
        0.7 kB
        Mert Guldur

        Activity

        Hide
        tn Thomas Neidhart added a comment -

        This fix is wrong imho and should be reverted in 4.1

        Show
        tn Thomas Neidhart added a comment - This fix is wrong imho and should be reverted in 4.1
        Hide
        adriannistor Adrian Nistor added a comment -

        Hi Thomas,

        Oh, ok. Why is the fix wrong? What is the bug in this fix?

        Best,

        Adrian

        Show
        adriannistor Adrian Nistor added a comment - Hi Thomas, Oh, ok. Why is the fix wrong? What is the bug in this fix? Best, Adrian
        Hide
        tn Thomas Neidhart added a comment -

        The fix is technically correct but conceptually wrong.

        It tries to improve the worst-case scenario for a well-known method at the expense of additional space complexity.
        It many cases the additional temporary set is unneeded, especially when the provided collection has a fast contains method or the collection is small.

        A user can always solve the performance problem him/herself by putting the elements in a set and provide this as parameter to retainAll.

        The described problem also applies to almost all collection types as the default implementation of retainAll is provided in AbstractCollection and is unchanged for all the collections I know of. If you want to press forward your fix, I suggest to move it to the openjdk project and discuss it there, actually I would be interested on their opinions on this issue.

        Show
        tn Thomas Neidhart added a comment - The fix is technically correct but conceptually wrong. It tries to improve the worst-case scenario for a well-known method at the expense of additional space complexity. It many cases the additional temporary set is unneeded, especially when the provided collection has a fast contains method or the collection is small. A user can always solve the performance problem him/herself by putting the elements in a set and provide this as parameter to retainAll. The described problem also applies to almost all collection types as the default implementation of retainAll is provided in AbstractCollection and is unchanged for all the collections I know of. If you want to press forward your fix, I suggest to move it to the openjdk project and discuss it there, actually I would be interested on their opinions on this issue.
        Hide
        adriannistor Adrian Nistor added a comment -

        Hi Thomas,

        > A user can always solve the performance problem him/herself by
        > putting the elements in a set and provide this as parameter to
        > retainAll.

        Yes, but:

        (1) typically users spend (a lot) of time making the code work
        correctly. Users don't want to spend time with optimizations if they
        can avoid it.

        (2) users needs to identify this call as a potential optimization
        point, which is not easy if the buggy case is not triggered during
        testing. Furthermore, if the program-wide slowdown is small but
        non-negligible (e.g., 10%), users may view it as unfortunate but
        legitimate, i.e., users may not realize it can be improved.

        (3) the user needs figure out there is a easy and fast way to optimize
        this code. This requires the user to look inside the method
        implementation.

        No user will do the above, so it is better to do it automatically if
        we can. And in this case we can do it easily and transparently in the
        library. If we can optimize something, we should do it.

        > at the expense of additional space complexity.

        Yes. But typically we have memory available, speed is more difficult
        to get. And the additional space is linear in the size of c, whereas
        the time improvement is huge. And the added memory is less than half
        the memory used by the SetUniqueList and is very short lived. So the
        added memory is not a problem.

        > The described problem also applies to almost all collection types

        Yes, but "others do it too" is not a reason for us to not improve if
        we can.

        Overall, it is your decision. I just feel that if we can help the
        developers, we should do it. And if others don't do it, that's their
        problem, not ours.

        Best,

        Adrian

        Show
        adriannistor Adrian Nistor added a comment - Hi Thomas, > A user can always solve the performance problem him/herself by > putting the elements in a set and provide this as parameter to > retainAll. Yes, but: (1) typically users spend (a lot) of time making the code work correctly. Users don't want to spend time with optimizations if they can avoid it. (2) users needs to identify this call as a potential optimization point, which is not easy if the buggy case is not triggered during testing. Furthermore, if the program-wide slowdown is small but non-negligible (e.g., 10%), users may view it as unfortunate but legitimate, i.e., users may not realize it can be improved. (3) the user needs figure out there is a easy and fast way to optimize this code. This requires the user to look inside the method implementation. No user will do the above, so it is better to do it automatically if we can. And in this case we can do it easily and transparently in the library. If we can optimize something, we should do it. > at the expense of additional space complexity. Yes. But typically we have memory available, speed is more difficult to get. And the additional space is linear in the size of c, whereas the time improvement is huge. And the added memory is less than half the memory used by the SetUniqueList and is very short lived. So the added memory is not a problem. > The described problem also applies to almost all collection types Yes, but "others do it too" is not a reason for us to not improve if we can. Overall, it is your decision. I just feel that if we can help the developers, we should do it. And if others don't do it, that's their problem, not ours. Best, Adrian
        Hide
        sebb@apache.org Sebb added a comment -

        The problem with unconditionally using extra memory in the hope that it will improve performance is that users who don't want or need it have no choice.
        Commons code may be used in devices with limited memory; we should not assume that memory is not an issue.

        Whereas if such copying is not done automatically, at least users who want to trade memory for speed can do so.

        Show
        sebb@apache.org Sebb added a comment - The problem with unconditionally using extra memory in the hope that it will improve performance is that users who don't want or need it have no choice. Commons code may be used in devices with limited memory; we should not assume that memory is not an issue. Whereas if such copying is not done automatically, at least users who want to trade memory for speed can do so.
        Hide
        adriannistor Adrian Nistor added a comment -

        Hi Sebb,

        > at least users who want to trade memory for speed can do so.

        Ok. Let's at least write this down in the method Javadoc, like in
        COLLECTIONS-415 and COLLECTIONS-417?

        Best,

        Adrian

        Show
        adriannistor Adrian Nistor added a comment - Hi Sebb, > at least users who want to trade memory for speed can do so. Ok. Let's at least write this down in the method Javadoc, like in COLLECTIONS-415 and COLLECTIONS-417 ? Best, Adrian
        Hide
        tn Thomas Neidhart added a comment -

        Again, users need to know what they are doing and be aware of the performance constraints of the collection types they are using.
        I agree that the implemented algorithm of retainAll is not explained in detail (the original Collection#retainAll method has just a reference to contains(), but does not describe the performance implications), and we will add a note as in other cases, but this is not a problem limited to commons-collections but to the collection framework of java in general.

        That's why I suggested to move the discussion to openjdk. In fact, if your proposed fix would be implemented in AbstractCollection the problem would go away for almost all collection types and users would be safe (or is that so?).

        On a personal note: I strongly doubt that the described problem is a "real" one. A quick search on stackoverflow with "retainAll slow" lists two posts with both suggesting to call it with a set.

        Show
        tn Thomas Neidhart added a comment - Again, users need to know what they are doing and be aware of the performance constraints of the collection types they are using. I agree that the implemented algorithm of retainAll is not explained in detail (the original Collection#retainAll method has just a reference to contains(), but does not describe the performance implications), and we will add a note as in other cases, but this is not a problem limited to commons-collections but to the collection framework of java in general. That's why I suggested to move the discussion to openjdk. In fact, if your proposed fix would be implemented in AbstractCollection the problem would go away for almost all collection types and users would be safe (or is that so?). On a personal note: I strongly doubt that the described problem is a "real" one. A quick search on stackoverflow with "retainAll slow" lists two posts with both suggesting to call it with a set.
        Hide
        adriannistor Adrian Nistor added a comment -

        Hi Thomas,

        > users need to know what they are doing and be aware of the
        > performance constraints

        true, and how often does that happen?

        > this is not a problem limited to commons-collections

        Ok, I will try this "others do it too" argument next time when I get a
        speeding ticket. I will let you know how that works out

        > devices with limited memory

        Is Apache Collections optimized for such devices? How many
        optimizations for such devices do we have? 1, 2, 5? It seems quite a
        unique instance here that we are all of the sudden worrying about some
        memory (less than half of this collection, and short lived).

        Overall, I still think that if we can optimize something without
        disadvantages (I would not count this memory thing, except the case we
        are optimizing memory--small quantities of memory--in more than 5
        places in the entire Apache Collections), there is no reason to not do
        it. But all the pros and cons are in this thread, so in the end do
        what you think best for this library.

        Best,

        Adrian

        Show
        adriannistor Adrian Nistor added a comment - Hi Thomas, > users need to know what they are doing and be aware of the > performance constraints true, and how often does that happen? > this is not a problem limited to commons-collections Ok, I will try this "others do it too" argument next time when I get a speeding ticket. I will let you know how that works out > devices with limited memory Is Apache Collections optimized for such devices? How many optimizations for such devices do we have? 1, 2, 5? It seems quite a unique instance here that we are all of the sudden worrying about some memory (less than half of this collection, and short lived). Overall, I still think that if we can optimize something without disadvantages (I would not count this memory thing, except the case we are optimizing memory-- small quantities of memory --in more than 5 places in the entire Apache Collections), there is no reason to not do it. But all the pros and cons are in this thread, so in the end do what you think best for this library. Best, Adrian
        Hide
        tn Thomas Neidhart added a comment -

        > users need to know what they are doing and be aware of the
        > performance constraints

        true, and how often does that happen?

        I do not think it should be the goal of a general-purpose library to pre-optimize every possible use-case.

        > this is not a problem limited to commons-collections

        Ok, I will try this "others do it too" argument next time when I get a
        speeding ticket. I will let you know how that works out

        I did not say that. I said that the retainAll() method suffers from the "problem" in general.
        Changing it in collections, especially for some rarely used classes does not safe users from the performance problems you are talking about. I am pretty sure that ArrayList.retainAll is much, much more often used than SetUniqueList.retainAll. So why changing it here and not for ArrayList?

        You are looking at the problem from a purely theoretical POV. From an engineering POV it is much more important that users get what they expect. And the retainAll method as implemented is well-known in the java community.

        And again, a user can get the expected performance by putting the argument in a set himself. So where is the problem (apart from documenting it properly)?

        Show
        tn Thomas Neidhart added a comment - > users need to know what they are doing and be aware of the > performance constraints true, and how often does that happen? I do not think it should be the goal of a general-purpose library to pre-optimize every possible use-case. > this is not a problem limited to commons-collections Ok, I will try this "others do it too" argument next time when I get a speeding ticket. I will let you know how that works out I did not say that. I said that the retainAll() method suffers from the "problem" in general. Changing it in collections, especially for some rarely used classes does not safe users from the performance problems you are talking about. I am pretty sure that ArrayList.retainAll is much, much more often used than SetUniqueList.retainAll. So why changing it here and not for ArrayList? You are looking at the problem from a purely theoretical POV. From an engineering POV it is much more important that users get what they expect. And the retainAll method as implemented is well-known in the java community. And again, a user can get the expected performance by putting the argument in a set himself. So where is the problem (apart from documenting it properly)?
        Hide
        tn Thomas Neidhart added a comment -

        There is a similar ticket for openjdk from 2004, with a comment from Josh Bloch: https://bugs.openjdk.java.net/browse/JDK-5028425?page=com.atlassian.streams.streams-jira-plugin:activity-stream-issue-tab

        Show
        tn Thomas Neidhart added a comment - There is a similar ticket for openjdk from 2004, with a comment from Josh Bloch: https://bugs.openjdk.java.net/browse/JDK-5028425?page=com.atlassian.streams.streams-jira-plugin:activity-stream-issue-tab
        Hide
        tn Thomas Neidhart added a comment -

        Reverted performance improvement, added clarifying javadoc wrt runtime complexity and did a slight improvement over the original code in r1655062.

        Show
        tn Thomas Neidhart added a comment - Reverted performance improvement, added clarifying javadoc wrt runtime complexity and did a slight improvement over the original code in r1655062.

          People

          • Assignee:
            Unassigned
            Reporter:
            mertguldur Mert Guldur
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development