Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.1, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Robert pointed me to http://envisage-project.eu/proving-android-java-and-python-sorting-algorithm-is-broken-and-how-to-fix-it/ yesterday which explains how most implementations of TimSort are broken. We should check our TimSorter.

        Activity

        Hide
        Adrien Grand added a comment -

        Given the description of the problem, I think we suffer from it too. However when I tried the test code with Lucene's ArrayUtil.timSort instead of Java's Arrays.sort I could not reproduce the problem. I suspect this is due to some differences in implementation details... I'll try to dig more.

        Show
        Adrien Grand added a comment - Given the description of the problem, I think we suffer from it too. However when I tried the test code with Lucene's ArrayUtil.timSort instead of Java's Arrays.sort I could not reproduce the problem. I suspect this is due to some differences in implementation details... I'll try to dig more.
        Hide
        Adrien Grand added a comment -

        OK I found the reason: Our TimSorter picks minRunLength based on the description at http://svn.python.org/projects/python/trunk/Objects/listsort.txt (look for
        "Computing minrun") which gives a value between 32 and 65 while Java picks a value which is always less than or equal to 32. If I hack the test code to use the same formula as our TimSorter, I can reproduce the bug.

        Show
        Adrien Grand added a comment - OK I found the reason: Our TimSorter picks minRunLength based on the description at http://svn.python.org/projects/python/trunk/Objects/listsort.txt (look for "Computing minrun") which gives a value between 32 and 65 while Java picks a value which is always less than or equal to 32. If I hack the test code to use the same formula as our TimSorter, I can reproduce the bug.
        Hide
        Uwe Schindler added a comment -

        Once we fixed the bug, should we put Collections.sort/Arrays.sort on the forbidden list? We currently sometimes also use the Java version - especially the one from Collections, although this one cannot do in-place sorting and clones the whole List into an array for sorting.

        Show
        Uwe Schindler added a comment - Once we fixed the bug, should we put Collections.sort/Arrays.sort on the forbidden list? We currently sometimes also use the Java version - especially the one from Collections, although this one cannot do in-place sorting and clones the whole List into an array for sorting.
        Hide
        Adrien Grand added a comment -

        I am not sure this is necessary. This bug cannot corrupt data, it can just raise an ArrayIndexOutOfBoundsException if you happen to have a large array which has a very specific pattern. So it is rather unlikely to happen in practice, and even if it happens it would not corrupt data and would be easy to identify as "TimSort" would be in the stack trace?

        Show
        Adrien Grand added a comment - I am not sure this is necessary. This bug cannot corrupt data, it can just raise an ArrayIndexOutOfBoundsException if you happen to have a large array which has a very specific pattern. So it is rather unlikely to happen in practice, and even if it happens it would not corrupt data and would be easy to identify as "TimSort" would be in the stack trace?
        Hide
        Adrien Grand added a comment -

        There are two suggested ways to fix the issue, either change the max stack size or change the condition upon which consecutive runs are collapsed. I opted for the first one which is easier to implement.

        Since the code that demonstrates the bug is under ASL 2.0, I reused it for the test case.

        Show
        Adrien Grand added a comment - There are two suggested ways to fix the issue, either change the max stack size or change the condition upon which consecutive runs are collapsed. I opted for the first one which is easier to implement. Since the code that demonstrates the bug is under ASL 2.0, I reused it for the test case.
        Hide
        Michael McCandless added a comment -

        The fix looks low-risk here? And given how we rely on this sorting impl in Lucene, I think we should fix this in 4.10.4?

        Show
        Michael McCandless added a comment - The fix looks low-risk here? And given how we rely on this sorting impl in Lucene, I think we should fix this in 4.10.4?
        Hide
        Adrien Grand added a comment -

        The fix is low-risk but I think the bug is low-risk too: you need very special patterns in your data and it cannot trigger corruptions (it can make sorting fail with an ArrayIndexOutOfBoundsException but if you don't hit it then the output is sorted). So I would rather like to wait a bit before backporting.

        Show
        Adrien Grand added a comment - The fix is low-risk but I think the bug is low-risk too: you need very special patterns in your data and it cannot trigger corruptions (it can make sorting fail with an ArrayIndexOutOfBoundsException but if you don't hit it then the output is sorted). So I would rather like to wait a bit before backporting.
        Hide
        Michael McCandless added a comment -

        OK that makes sense Adrien Grand let's not backport.

        Show
        Michael McCandless added a comment - OK that makes sense Adrien Grand let's not backport.
        Hide
        Hoss Man added a comment -

        There are two suggested ways to fix the issue, either change the max stack size or change the condition upon which consecutive runs are collapsed. I opted for the first one which is easier to implement.

        Doesn't that cause excessive/unneccessary array allocation? .... it sounds like the same fix the JVM team applied that the original bug reporter lamented was inefficient...

        The reaction of the Java developer community to our report is somewhat disappointing: instead of using our fixed (and verified!) version of mergeCollapse(), they opted to increase the allocated runLen “sufficiently”. As we showed, this is not necessary. In consequence, whoever uses java.utils.Collection.sort() is forced to over allocate space. Given the astronomical number of program runs that such a central routine is used in, this leads to a considerable waste of energy.

        Show
        Hoss Man added a comment - There are two suggested ways to fix the issue, either change the max stack size or change the condition upon which consecutive runs are collapsed. I opted for the first one which is easier to implement. Doesn't that cause excessive/unneccessary array allocation? .... it sounds like the same fix the JVM team applied that the original bug reporter lamented was inefficient... The reaction of the Java developer community to our report is somewhat disappointing: instead of using our fixed (and verified!) version of mergeCollapse(), they opted to increase the allocated runLen “sufficiently”. As we showed, this is not necessary. In consequence, whoever uses java.utils.Collection.sort() is forced to over allocate space. Given the astronomical number of program runs that such a central routine is used in, this leads to a considerable waste of energy.
        Hide
        Robert Muir added a comment -

        Lets fix the bug first and optimize later.

        From a software engineering perspective, this is the safest fix.

        Please, no high-risk fixes for low-risk bugs.

        Show
        Robert Muir added a comment - Lets fix the bug first and optimize later. From a software engineering perspective, this is the safest fix. Please, no high-risk fixes for low-risk bugs.
        Hide
        Adrien Grand added a comment -

        For the record, the over-allocation which is mentionned here translates in our case to allocating an int[] array of length 50 instead of 41 per TimSort run.

        Show
        Adrien Grand added a comment - For the record, the over-allocation which is mentionned here translates in our case to allocating an int[] array of length 50 instead of 41 per TimSort run.
        Hide
        ASF subversion and git services added a comment -

        Commit 1662410 from Adrien Grand in branch 'dev/trunk'
        [ https://svn.apache.org/r1662410 ]

        LUCENE-6293: Fixed TimSorter bug.

        Show
        ASF subversion and git services added a comment - Commit 1662410 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1662410 ] LUCENE-6293 : Fixed TimSorter bug.
        Hide
        ASF subversion and git services added a comment -

        Commit 1662411 from Adrien Grand in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1662411 ]

        LUCENE-6293: Fixed TimSorter bug.

        Show
        ASF subversion and git services added a comment - Commit 1662411 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1662411 ] LUCENE-6293 : Fixed TimSorter bug.
        Hide
        Timothy Potter added a comment -

        Bulk close after 5.1 release

        Show
        Timothy Potter added a comment - Bulk close after 5.1 release

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development