Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
3.2.1
-
None
-
None
-
java 1.6.0_24
Ubuntu 11.10
Description
Hi,
ListUtils.subtract is very slow when subtracting two large lists. The
root cause of this problem is similar to the root cause of the
previously fixed COLLECTIONS-328 in ListUtils, i.e., quadratic
complexity instead of linear complexity.
I am encountering this problem in version 3.2.1 and also in revision
1342815 (May 25th 2012). I have attached a test that exposes this
problem and a simple patch. On my machine, for the attached test,
this patch provides a 95X speedup.
To run the test, just do:
$ java Test
Currently, the code for ListUtils.subtract is:
final ArrayList<E> result = new ArrayList<E>(list1);
for (E e : list2) {
result.remove(e);
}
return result;
which is quadratic, because result.remove(e) has linear complexity.
The attached patch makes the remove() be constant complexity by
removing from an org.apache.commons.collections.bag.HashBag. I use
HashBag and not HashSet because ListUtils.subtract needs to respect
the cardinality when there are repeated objects in the list.
As mentioned in COLLECTIONS-328 discussion, for small lists, there is
some overhead for creating the HashBag. This can be fixed with a
threshold, but I did not implement it in my patch because the
COLLECTIONS-328 patch does not implement it.
Unlike the patch for COLLECTIONS-328, my patch does not choose the
list to iterate over based on size, because of the cardinality
requirement in subtract. This means the code could be made even
faster if we could use something like a LinkedHashBag but neither
Apache Collections nor standard Java libraries have such a class.
Even so, this patch is still a lot faster than the original version.
Is this truly a bug, or am I missing something here? If so, can you
please confirm if the patch is correct?
Thanks,
Adrian