Lucene - Core
  1. Lucene - Core
  2. LUCENE-2953

PriorityQueue is inheriently broken if subclass attempts to use "heap" w/generic T bound to anything other then "Object"

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.2, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      as discovered in SOLR-2410 the fact that the protected "heap" variable in PriorityQueue is initialized using an Object[] makes it impossible for subclasses of PriorityQueue to exist and access the "heap" array unless they bind the generic to Object.

      1. BenchmarkArrayAccess.java
        3 kB
        Dawid Weiss
      2. LUCENE-2953.patch
        18 kB
        Uwe Schindler
      3. LUCENE-2953.patch
        1 kB
        Hoss Man

        Issue Links

          Activity

          Robert Muir made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          Robert Muir added a comment -

          Bulk closing for 3.2

          Show
          Robert Muir added a comment - Bulk closing for 3.2
          Uwe Schindler made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Lucene Fields [New] [New, Patch Available]
          Resolution Fixed [ 1 ]
          Hide
          Uwe Schindler added a comment -

          Committed 3.x revision: 1079711 (only private heap and getter, initialize unchanged for backwards compatibility).

          If we want to backport also the initialize() changes to 3.x (as PQ is a @lucene.internal class), we should reopen this issue.

          Show
          Uwe Schindler added a comment - Committed 3.x revision: 1079711 (only private heap and getter, initialize unchanged for backwards compatibility). If we want to backport also the initialize() changes to 3.x (as PQ is a @lucene.internal class), we should reopen this issue.
          Hide
          Uwe Schindler added a comment -

          Committed trunk revision: 1079707

          I will backport the getHeapArray() method to 3.2, but for now this is not a problem, as Solr's class is not yet generified in 3.x.

          Show
          Uwe Schindler added a comment - Committed trunk revision: 1079707 I will backport the getHeapArray() method to 3.2, but for now this is not a problem, as Solr's class is not yet generified in 3.x.
          uschindler committed 1079707 (22 files)
          Reviews: none

          LUCENE-2953: PriorityQueue's internal heap was made private final

          Lucene trunk
          Uwe Schindler made changes -
          Assignee Uwe Schindler [ thetaphi ]
          Fix Version/s 3.2 [ 12316070 ]
          Fix Version/s 4.0 [ 12314025 ]
          Hide
          Uwe Schindler added a comment - - edited

          I remember using pq at some point for other things and hating that initialize method, so I'm all for it.

          The inititalize method was only there for one single reason in Lucene (a hack): The getSentinelObject() method was used in HitQueue like that: it should return null for some special corner case. To enable this special case, a boolean field was used. But the ctor had to populate that field before the prepoulating in super() was done, and thats impossible. I changed that by adding a boolean ctor to the PQ base class to enable/disable pre-populating like HitQueue did before.

          Show
          Uwe Schindler added a comment - - edited I remember using pq at some point for other things and hating that initialize method, so I'm all for it. The inititalize method was only there for one single reason in Lucene (a hack): The getSentinelObject() method was used in HitQueue like that: it should return null for some special corner case. To enable this special case, a boolean field was used. But the ctor had to populate that field before the prepoulating in super() was done, and thats impossible. I changed that by adding a boolean ctor to the PQ base class to enable/disable pre-populating like HitQueue did before.
          Uwe Schindler made changes -
          Attachment LUCENE-2953.patch [ 12472927 ]
          Hide
          Uwe Schindler added a comment -

          Here my patch, that also removes initialize(int), which is bad design.

          For 3.x we can simply leave out this change and only make the heap variable private and expose as Object[] using a getter method.

          Show
          Uwe Schindler added a comment - Here my patch, that also removes initialize(int), which is bad design. For 3.x we can simply leave out this change and only make the heap variable private and expose as Object[] using a getter method.
          Hide
          Dawid Weiss added a comment -

          I remember using pq at some point for other things and hating that initialize method, so I'm all for it.

          Show
          Dawid Weiss added a comment - I remember using pq at some point for other things and hating that initialize method, so I'm all for it.
          Hide
          Uwe Schindler added a comment -

          Or a protected getter method that would do the cast (why bother with having two fields)

          Good idea, I am currently fixing the whole stuff (I was the one who added the generics in Lucene 3.0). But I am now also removing initialize(int), this construct is very broken. In trunk we can break backwards for this.

          Show
          Uwe Schindler added a comment - Or a protected getter method that would do the cast (why bother with having two fields) Good idea, I am currently fixing the whole stuff (I was the one who added the generics in Lucene 3.0). But I am now also removing initialize(int), this construct is very broken. In trunk we can break backwards for this.
          Hide
          Dawid Weiss added a comment -

          > The problem with then always casting from Object to T is thousands of unchecked warnings in PriorityQueue.

          You could erase the type in internal methods of PriorityQueue and use Object instead of T then.

          > So my proposal is to internally use the T[] as a private field and simply use another Object[] thats protected (pointing to the same array).

          Or a protected getter method that would do the cast (why bother with having two fields):

          protected Object[] getStorageArray() { return (Object[]) heap; }
          

          If Yonik wants access to that array I'm sure he copies it to a local var. prior to doing any intensive loops...

          Show
          Dawid Weiss added a comment - > The problem with then always casting from Object to T is thousands of unchecked warnings in PriorityQueue. You could erase the type in internal methods of PriorityQueue and use Object instead of T then. > So my proposal is to internally use the T[] as a private field and simply use another Object[] thats protected (pointing to the same array). Or a protected getter method that would do the cast (why bother with having two fields): protected Object[] getStorageArray() { return (Object[]) heap; } If Yonik wants access to that array I'm sure he copies it to a local var. prior to doing any intensive loops...
          Hide
          Uwe Schindler added a comment -

          The easy and most simply way to handle this is osing Object[] like in ArrayList.

          The problem with then always casting from Object to T is thousands of unchecked warnings in PriorityQueue. I would propose the following:
          In general the final T[] heap variable should be private to the PQ and used only there. For performance yonik wanted the heap[] protected and that caused the issue. As long as the heap[] array is private it can never be accessed incorrectly.

          So my proposal is to internally use the T[] as a private field and simply use another Object[] thats protected (pointing to the same array). This would fix the problem. The most correct idea would be to add a setHeapSlot(int, T o) and T getHeapSlot(int) method and hiding the T[] heap completely, but I know, Yonik will disagree

          There is some other problem: the heap array should be final, but it cannot, because of the stupid initialize() method. I would like to remove this method and simply move the code to PQ's ctor. I don't understand why the initialize() method is there, which is a problem: Every guide on Java programming tells you to never call protected overrideable methods from ctors, as this can break easily. If the heap[] is final, the problem of having two references to the same object is not a problem anymore.

          Show
          Uwe Schindler added a comment - The easy and most simply way to handle this is osing Object[] like in ArrayList. The problem with then always casting from Object to T is thousands of unchecked warnings in PriorityQueue. I would propose the following: In general the final T[] heap variable should be private to the PQ and used only there. For performance yonik wanted the heap[] protected and that caused the issue. As long as the heap[] array is private it can never be accessed incorrectly. So my proposal is to internally use the T[] as a private field and simply use another Object[] thats protected (pointing to the same array). This would fix the problem. The most correct idea would be to add a setHeapSlot(int, T o) and T getHeapSlot(int) method and hiding the T[] heap completely, but I know, Yonik will disagree There is some other problem: the heap array should be final, but it cannot, because of the stupid initialize() method. I would like to remove this method and simply move the code to PQ's ctor. I don't understand why the initialize() method is there, which is a problem: Every guide on Java programming tells you to never call protected overrideable methods from ctors, as this can break easily. If the heap[] is final, the problem of having two references to the same object is not a problem anymore.
          Dawid Weiss made changes -
          Attachment BenchmarkArrayAccess.java [ 12472925 ]
          Hide
          Dawid Weiss added a comment -

          A Google Caliper benchmark comparing iteration over a class with a generic T[] (real T[] type), its concrete-type subclass and a class using Object[] and (T) casts for accessing array elements.

          Show
          Dawid Weiss added a comment - A Google Caliper benchmark comparing iteration over a class with a generic T[] (real T[] type), its concrete-type subclass and a class using Object[] and (T) casts for accessing array elements.
          Hide
          Dawid Weiss added a comment -

          There seems to be no consensus on how to deal with generic arrays. Even the JDK has two different implementations – one in ArrayDeque (uses T[]), the other in ArrayList (uses Object[]). Creating an array of a given component type is (can be?) more costly than keeping an array Object[] because it needs to be done via call to Array.newArray (haven't checked though). Theoretically having a concrete-type array should speed up iterators (because no additional casts are needed), but I don't think this is the case.

          In fact, I just wrote a simple Caliper benchmark that compares these (attached), my results show the runtime times is nearly identical (probably within stddev).:

           0% Scenario{vm=java, trial=0, benchmark=Generic, size=1000000} 8985430.93 ns; σ=257329.28 ns @ 10 trials
          33% Scenario{vm=java, trial=0, benchmark=GenericSubclass, size=1000000} 8989486.27 ns; σ=207151.20 ns @ 10 trials
          67% Scenario{vm=java, trial=0, benchmark=Object, size=1000000} 8767324.34 ns; σ=218235.97 ns @ 10 trials
          
                benchmark   ms linear runtime
                  Generic 8.99 =============================
          GenericSubclass 8.99 ==============================
                   Object 8.77 =============================
          
          vm: java
          trial: 0
          size: 1000000
          
          Show
          Dawid Weiss added a comment - There seems to be no consensus on how to deal with generic arrays. Even the JDK has two different implementations – one in ArrayDeque (uses T[]), the other in ArrayList (uses Object[]). Creating an array of a given component type is (can be?) more costly than keeping an array Object[] because it needs to be done via call to Array.newArray (haven't checked though). Theoretically having a concrete-type array should speed up iterators (because no additional casts are needed), but I don't think this is the case. In fact, I just wrote a simple Caliper benchmark that compares these (attached), my results show the runtime times is nearly identical (probably within stddev).: 0% Scenario{vm=java, trial=0, benchmark=Generic, size=1000000} 8985430.93 ns; σ=257329.28 ns @ 10 trials 33% Scenario{vm=java, trial=0, benchmark=GenericSubclass, size=1000000} 8989486.27 ns; σ=207151.20 ns @ 10 trials 67% Scenario{vm=java, trial=0, benchmark=Object, size=1000000} 8767324.34 ns; σ=218235.97 ns @ 10 trials benchmark ms linear runtime Generic 8.99 ============================= GenericSubclass 8.99 ============================== Object 8.77 ============================= vm: java trial: 0 size: 1000000
          Hoss Man made changes -
          Link This issue breaks SOLR-2410 [ SOLR-2410 ]
          Hoss Man made changes -
          Field Original Value New Value
          Attachment LUCENE-2953.patch [ 12472894 ]
          Hide
          Hoss Man added a comment -

          patch to TestPriorityQueue demonstrating bug

          Show
          Hoss Man added a comment - patch to TestPriorityQueue demonstrating bug
          Hoss Man created issue -

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Hoss Man
            • Votes:
              1 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development