Issue Details (XML | Word | Printable)

Key: COLLECTIONS-302
Type: Improvement Improvement
Status: Closed Closed
Resolution: Won't Fix
Priority: Minor Minor
Assignee: Unassigned
Reporter: Joachim Rudolph
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Commons Collections

CollectionUtils.subtract() should not use ArrayList to improve speed

Created: 13/Aug/08 11:00 AM   Updated: 03/Nov/08 04:36 PM
Return to search
Component/s: Collection
Affects Version/s: None
Fix Version/s: None

Time Tracking:
Original Estimate: 2h
Original Estimate - 2h
Remaining Estimate: 2h
Remaining Estimate - 2h
Time Spent: Not Specified
Remaining Estimate - 2h

Resolution Date: 03/Nov/08 04:36 PM


 Description  « Hide
The implementation of version 3.2.1 is
public static Collection subtract(final Collection a, final Collection b) {
ArrayList list = new ArrayList( a );
for (Iterator it = b.iterator(); it.hasNext() { list.remove(it.next()); }
return list;
}
when a and b are large and similar the subtract implementation will call ArrayList.remove() frequently which copies a potentially large part of the list using system.arraycopy.

Suggestion : use LinkedList ( at least for large lists )



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Henri Yandell added a comment - 24/Oct/08 03:58 AM
Ran a basic test and interestingly ArrayList came out better.
import java.util.*;
import org.apache.commons.collections.*;

public class Bob {

    public static void main(String[] args) throws Exception {
        test(Integer.parseInt(args[0]));
    }
    public static void test(int n) {
        Collection a = new ArrayList();
        for(int i=0; i<n; i++) {
            a.add("bob"+i);
        }
        Collection b = new ArrayList();
        for(int i=0; i<n; i+=2) {
            b.add("bob"+i);
        }
        long t1 = System.currentTimeMillis();
        CollectionUtils.subtract(a, b);
        long t2 = System.currentTimeMillis();
        System.err.println("T" + n + ": " + (t2-t1));
    }

}

For an input of 10,000, both were around 550 msec. For 100,000 the ArrayList was 58000, while the LinkedList was 84000. Hardly scientific as I'm not repeating the test in the same run so could be missing out on JIT improving a second run, not running multiple times etc. My suspicion is that the ArrayList constructor checks to see if things are ArrayLists and does quick arraycopies, while the LinkedList constructor just sits and plods along. I retested by changing the input to LinkedLists from ArrayLists and the time doubled up to 102000. Of course when I try LinkedList passing in to the LinkedList variant, it goes up to 125000. Ah well.

Point of all that - apart from implying that more testing is needed - is that the collection type used might want to depend on the type of the 'a' variable.


Joachim Rudolph added a comment - 24/Oct/08 11:48 AM
I did some tests myself, I just underestimated the speed of System.arraycopy()

In my problem the resulting collection would be almost empty.
For this case the following code models the algorithm :

Bar.java
private static final int REPS = 50;
    @SuppressWarnings("unchecked")
    public static void test(int n) throws InterruptedException {
        Collection a = new ArrayList(n);
        for(int i=0; i<n; i++) {
            a.add("bob"+i);
        }
        
        for (int r = 0; r < REPS; ++r) {
            ArrayList c = new ArrayList(a);
            long t1 = System.currentTimeMillis();
            while( c.size() > 0 ){
                c.remove(0);
            }
            long t2 = System.currentTimeMillis();
            System.err.println("T" + n + ": " + (t2-t1)+ " ms");
        }
    }

In this testcase for n = 200000, System.arraycopy() is called 200000 times with an average of 100000 references to be moved.
The code runs on my machine within 12.3 seconds/iteration which is about 6.5 GBytes/sec, much better than I expected.
So I will use Arraylists on many other problems where I was worried about the O(N^2) performance before.

I should have done my profiling before... sorry.


Henri Yandell added a comment - 03/Nov/08 04:36 PM
No worries - closing the issue as there's nothing major here to work on.

Henri Yandell made changes - 03/Nov/08 04:36 PM
Field Original Value New Value
Status Open [ 1 ] Closed [ 6 ]
Resolution Won't Fix [ 2 ]