Uploaded image for project: 'Commons Lang'
  1. Commons Lang
  2. LANG-839

ArrayUtils removeElements methods use unnecessary HashSet

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 3.1
    • Fix Version/s: 3.2
    • Component/s: lang.*
    • Labels:
      None

      Description

      The removeElements() methods use a HashSet to collect the indexes that need removing.

      This requires creating Integer objects for each index, and the HashSet then has to be converted into an int[] array.

      It would be more efficient to store the entries in an actual int[] array.

      The maximum size of this is the length of the values array (or the length of the input array if that is shorter).

      The array must be truncated before calling the private removeAll() method; this can be done with Arrays.copyOf(x[], length).

      However, if the arrays are very large, and most of the values do not appear in the input, this might result in using more memory than the HashSet implementation.

        Activity

        Hide
        sebb@apache.org Sebb added a comment -

        Note that the current implementation creates a HashMap of size values.length to hold the numbers of each value found, so a large values array will cause problems anyway.

        Show
        sebb@apache.org Sebb added a comment - Note that the current implementation creates a HashMap of size values.length to hold the numbers of each value found, so a large values array will cause problems anyway.
        Hide
        sebb@apache.org Sebb added a comment -

        An alternative implementation might be to use BitSet, as indexes are never negative.

        This requires much reduced storage as the 32-bit ints are stored as 1 bit each.

        If this was also used in the private removeAll() method, it would remove the need to check for duplicates, and there would be no need to sort the int array.

        BitSet offers nextSetBit and nextClearBit methods but does not offer the reverse, so either the copy algorithm would need to be reversed, or the indexes could be subtracted from the array length before storage.

        Conversion to using BitSet would also avoid the need to clone external arrays as they would no longer need sorting.

        Show
        sebb@apache.org Sebb added a comment - An alternative implementation might be to use BitSet, as indexes are never negative. This requires much reduced storage as the 32-bit ints are stored as 1 bit each. If this was also used in the private removeAll() method, it would remove the need to check for duplicates, and there would be no need to sort the int array. BitSet offers nextSetBit and nextClearBit methods but does not offer the reverse, so either the copy algorithm would need to be reversed, or the indexes could be subtracted from the array length before storage. Conversion to using BitSet would also avoid the need to clone external arrays as they would no longer need sorting.
        Hide
        sebb@apache.org Sebb added a comment -

        Implementation of removeAll private method using BitSet, plus unit test, plus sample fix to show its use in removeElements(boolean[] array, boolean... values)

        Show
        sebb@apache.org Sebb added a comment - Implementation of removeAll private method using BitSet, plus unit test, plus sample fix to show its use in removeElements(boolean[] array, boolean... values)
        Hide
        sebb@apache.org Sebb added a comment -

        Using BitSet will clearly use less memory than HashSet or even an int[] array, so would also reduce memory for the methods that currently clone the index array.

        It remains to be seen whether the overhead of creating and using the BitSet is more or less than cloning and sorting the array.

        Show
        sebb@apache.org Sebb added a comment - Using BitSet will clearly use less memory than HashSet or even an int[] array, so would also reduce memory for the methods that currently clone the index array. It remains to be seen whether the overhead of creating and using the BitSet is more or less than cloning and sorting the array.
        Hide
        sebb@apache.org Sebb added a comment -

        URL: http://svn.apache.org/viewvc?rev=1395834&view=rev
        Log:
        LANG-839 ArrayUtils removeElements methods use unnecessary HashSet
        Add benchmark test to show whether Bitset is faster

        Added:
        commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/HashSetvBitSetTest.java (with props)

        Show
        sebb@apache.org Sebb added a comment - URL: http://svn.apache.org/viewvc?rev=1395834&view=rev Log: LANG-839 ArrayUtils removeElements methods use unnecessary HashSet Add benchmark test to show whether Bitset is faster Added: commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/HashSetvBitSetTest.java (with props)
        Hide
        sebb@apache.org Sebb added a comment -

        BitSet is considerably faster on Win XP:

        Ratio=38% count=0 hash=610134 bits=234388
        Ratio=46% count=5 hash=1809448 bits=837536
        Ratio=54% count=10 hash=2840584 bits=1536229
        Ratio=38% count=200 hash=72772936 bits=28216994
        Ratio=37% count=50 hash=17909539 bits=6729347
        Ratio=39% count=100 hash=35617096 bits=13972166
        Ratio=40% count=1000 hash=339097567 bits=138882176
        Ratio=42% count=2000 hash=650113632 bits=278152949

        and Continuum:

        Ratio=10% count=0 hash=1164164 bits=126956
        Ratio=15% count=5 hash=1433866 bits=228518
        Ratio=18% count=10 hash=1911315 bits=355922
        Ratio=17% count=200 hash=31370106 bits=5439748
        Ratio=18% count=50 hash=6947508 bits=1271146
        Ratio=18% count=100 hash=13671526 bits=2555063
        Ratio=15% count=1000 hash=154243712 bits=24577725
        Ratio=10% count=2000 hash=411835139 bits=43056221

        Show
        sebb@apache.org Sebb added a comment - BitSet is considerably faster on Win XP: Ratio=38% count=0 hash=610134 bits=234388 Ratio=46% count=5 hash=1809448 bits=837536 Ratio=54% count=10 hash=2840584 bits=1536229 Ratio=38% count=200 hash=72772936 bits=28216994 Ratio=37% count=50 hash=17909539 bits=6729347 Ratio=39% count=100 hash=35617096 bits=13972166 Ratio=40% count=1000 hash=339097567 bits=138882176 Ratio=42% count=2000 hash=650113632 bits=278152949 and Continuum: Ratio=10% count=0 hash=1164164 bits=126956 Ratio=15% count=5 hash=1433866 bits=228518 Ratio=18% count=10 hash=1911315 bits=355922 Ratio=17% count=200 hash=31370106 bits=5439748 Ratio=18% count=50 hash=6947508 bits=1271146 Ratio=18% count=100 hash=13671526 bits=2555063 Ratio=15% count=1000 hash=154243712 bits=24577725 Ratio=10% count=2000 hash=411835139 bits=43056221
        Hide
        sebb@apache.org Sebb added a comment -

        Applied slightly different fix which uses BitSet but converts to int[] for use with original removeAll() method.

        URL: http://svn.apache.org/viewvc?rev=1395837&view=rev
        Log:
        LANG-839 ArrayUtils removeElements methods use unnecessary HashSet
        Replace HashSet with BitSet (less memory; faster)

        Modified:
        commons/proper/lang/trunk/src/changes/changes.xml
        commons/proper/lang/trunk/src/main/java/org/apache/commons/lang3/ArrayUtils.java

        Show
        sebb@apache.org Sebb added a comment - Applied slightly different fix which uses BitSet but converts to int[] for use with original removeAll() method. URL: http://svn.apache.org/viewvc?rev=1395837&view=rev Log: LANG-839 ArrayUtils removeElements methods use unnecessary HashSet Replace HashSet with BitSet (less memory; faster) Modified: commons/proper/lang/trunk/src/changes/changes.xml commons/proper/lang/trunk/src/main/java/org/apache/commons/lang3/ArrayUtils.java
        Hide
        mbenson Matt Benson added a comment -

        Nice!

        Show
        mbenson Matt Benson added a comment - Nice!
        Hide
        sebb@apache.org Sebb added a comment -

        Thanks!

        Note: this started because of the boxing warnings (adding int to HashSet).
        I was looking for a way to eliminate them, and it occurred to me that the process seemed quite inefficient.

        Show
        sebb@apache.org Sebb added a comment - Thanks! Note: this started because of the boxing warnings (adding int to HashSet). I was looking for a way to eliminate them, and it occurred to me that the process seemed quite inefficient.
        Hide
        mbenson Matt Benson added a comment -

        I don't know why BitSet never occurred to me; I guess because I haven't used them much, if at all. It took me awhile to wrap my head around solving the problem at all (only creating one new array), so I'm perfectly happy someone else found a way to make it (much) more efficient!

        Show
        mbenson Matt Benson added a comment - I don't know why BitSet never occurred to me; I guess because I haven't used them much, if at all. It took me awhile to wrap my head around solving the problem at all (only creating one new array), so I'm perfectly happy someone else found a way to make it (much) more efficient!
        Hide
        sebb@apache.org Sebb added a comment -

        Further testing shows that the approach used in the original patch - i.e. using a version of removeAll() that processes the BitSet directly - is generally faster than using the original removeAll() method after converting the BitSet to int[]

        Win XP:

        Ratio=92% array=100 count=1 extract=10314998 bitset=9571607
        Ratio=68% array=100 count=10 extract=14912510 bitset=10283430
        Ratio=15% array=100 count=50 extract=50206660 bitset=8017779
        Ratio=8% array=100 count=100 extract=92868228 bitset=7494807
        Ratio=86% array=1000 count=10 extract=42377453 bitset=36508272
        Ratio=28% array=1000 count=100 extract=124472803 bitset=35606481
        Ratio=4% array=1000 count=500 extract=570030828 bitset=24349463
        Ratio=1% array=1000 count=1000 extract=1099601765 bitset=12346262

        Continuum:

        Ratio=76% array=100 count=1 extract=2948847 bitset=2257111
        Ratio=32% array=100 count=10 extract=4860676 bitset=1589708
        Ratio=6% array=100 count=50 extract=17143953 bitset=1160451
        Ratio=1% array=100 count=100 extract=29390021 bitset=449595
        Ratio=87% array=1000 count=10 extract=16487025 bitset=14461313
        Ratio=30% array=1000 count=100 extract=42920962 bitset=13228312
        Ratio=4% array=1000 count=500 extract=199373015 bitset=9112329
        Ratio=0% array=1000 count=1000 extract=387091985 bitset=1126133

        Show
        sebb@apache.org Sebb added a comment - Further testing shows that the approach used in the original patch - i.e. using a version of removeAll() that processes the BitSet directly - is generally faster than using the original removeAll() method after converting the BitSet to int[] Win XP: Ratio=92% array=100 count=1 extract=10314998 bitset=9571607 Ratio=68% array=100 count=10 extract=14912510 bitset=10283430 Ratio=15% array=100 count=50 extract=50206660 bitset=8017779 Ratio=8% array=100 count=100 extract=92868228 bitset=7494807 Ratio=86% array=1000 count=10 extract=42377453 bitset=36508272 Ratio=28% array=1000 count=100 extract=124472803 bitset=35606481 Ratio=4% array=1000 count=500 extract=570030828 bitset=24349463 Ratio=1% array=1000 count=1000 extract=1099601765 bitset=12346262 Continuum: Ratio=76% array=100 count=1 extract=2948847 bitset=2257111 Ratio=32% array=100 count=10 extract=4860676 bitset=1589708 Ratio=6% array=100 count=50 extract=17143953 bitset=1160451 Ratio=1% array=100 count=100 extract=29390021 bitset=449595 Ratio=87% array=1000 count=10 extract=16487025 bitset=14461313 Ratio=30% array=1000 count=100 extract=42920962 bitset=13228312 Ratio=4% array=1000 count=500 extract=199373015 bitset=9112329 Ratio=0% array=1000 count=1000 extract=387091985 bitset=1126133
        Hide
        sebb@apache.org Sebb added a comment -

        URL: http://svn.apache.org/viewvc?rev=1395943&view=rev
        Log:
        LANG-839 Add tests to compare extractIndices+removeAll with direct use of BitSet in new version of removeAll

        Modified:
        commons/proper/lang/trunk/src/main/java/org/apache/commons/lang3/ArrayUtils.java
        commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/HashSetvBitSetTest.java

        Show
        sebb@apache.org Sebb added a comment - URL: http://svn.apache.org/viewvc?rev=1395943&view=rev Log: LANG-839 Add tests to compare extractIndices+removeAll with direct use of BitSet in new version of removeAll Modified: commons/proper/lang/trunk/src/main/java/org/apache/commons/lang3/ArrayUtils.java commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/HashSetvBitSetTest.java
        Hide
        sebb@apache.org Sebb added a comment -

        URL: http://svn.apache.org/viewvc?rev=1396075&view=rev
        Log:
        LANG-839 ArrayUtils removeElements methods use unnecessary HashSet
        Eliminate conversion of BitSet to int[]

        Modified:
        commons/proper/lang/trunk/src/main/java/org/apache/commons/lang3/ArrayUtils.java
        commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/HashSetvBitSetTest.java

        Show
        sebb@apache.org Sebb added a comment - URL: http://svn.apache.org/viewvc?rev=1396075&view=rev Log: LANG-839 ArrayUtils removeElements methods use unnecessary HashSet Eliminate conversion of BitSet to int[] Modified: commons/proper/lang/trunk/src/main/java/org/apache/commons/lang3/ArrayUtils.java commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/HashSetvBitSetTest.java

          People

          • Assignee:
            Unassigned
            Reporter:
            sebb@apache.org Sebb
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development