Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      sorting can be much slower on trunk than branch_3x

      1. LUCENE-2504_SortMissingLast.patch
        11 kB
        Yonik Seeley
      2. LUCENE-2504.patch
        4 kB
        Yonik Seeley
      3. LUCENE-2504.patch
        17 kB
        Yonik Seeley
      4. LUCENE-2504.patch
        26 kB
        Michael McCandless
      5. LUCENE-2504.zip
        4 kB
        Michael McCandless

        Activity

        Hide
        Robert Muir added a comment -

        rmuir20120906-bulk-40-change

        Show
        Robert Muir added a comment - rmuir20120906-bulk-40-change
        Hide
        Hoss Man added a comment -

        bulk cleanup of 4.0-ALPHA / 4.0 Jira versioning. all bulk edited issues have hoss20120711-bulk-40-change in a comment

        Show
        Hoss Man added a comment - bulk cleanup of 4.0-ALPHA / 4.0 Jira versioning. all bulk edited issues have hoss20120711-bulk-40-change in a comment
        Hide
        Simon Willnauer added a comment -

        yonik, I see a bunch of commits on this issue, can we resolve this?

        Show
        Simon Willnauer added a comment - yonik, I see a bunch of commits on this issue, can we resolve this?
        Hide
        Yonik Seeley added a comment -

        This was a simple attempt to try and simplify the comparators. Static classes are used instead of inner classes. Unfortunately, it didn't help the JVMs from getting stuck in badly optimized code (it was a long shot for that), but it does result in a consistent 4% speedup.

        It looks as simple as the previous version to my eye, so I'll commit if there are no objections.

        Show
        Yonik Seeley added a comment - This was a simple attempt to try and simplify the comparators. Static classes are used instead of inner classes. Unfortunately, it didn't help the JVMs from getting stuck in badly optimized code (it was a long shot for that), but it does result in a consistent 4% speedup. It looks as simple as the previous version to my eye, so I'll commit if there are no objections.
        Hide
        Yonik Seeley added a comment -

        OK, I've committed the fix to always use the latest generation field comparator.
        Not sure if this is the best way to handle - but at least it's correct now and we can improve more later.

        Show
        Yonik Seeley added a comment - OK, I've committed the fix to always use the latest generation field comparator. Not sure if this is the best way to handle - but at least it's correct now and we can improve more later.
        Hide
        Yonik Seeley added a comment - - edited

        The open question is whether this hotspot fickleness is particular to Oracle's java impl, or, is somehow endemic to bytecode VMs (.NET included).

        I tried IBM's latest Java6 (SR8 FP1, 20100624)
        It seems to have some of the same pitfalls as Oracle's JVM, just different.
        The first run does not differ from the second run in the same JVM as it does with Oracle, but the first run itself has much more variation. The worst case is worse, and just like the Oracle JVM, it gets stuck in it's worst case.

        Each run (of the complete set of fields) in a separate JVM since two runs in the same JVM didn't really differ as they did in the oracle JVM.

        branch_3x:

        unique terms in field median sort time of 100 sorts in ms another run another run another run another run another run another run
        100000 129 128 130 109 98 128 135
        10000 128 123 127 127 98 128 135
        1000 129 130 130 128 98 130 136
        100 128 133 133 130 100 132 139
        10 150 153 153 154 122 153 159

        trunk:

        unique terms in field median sort time of 100 sorts in ms another run another run another run another run another run another run
        100000 217 81 383 99 79 78 215
        10000 254 73 346 101 106 108 267
        1000 253 74 347 99 107 108 258
        100 253 107 394 98 107 102 255
        10 251 107 388 99 106 98 257

        The second way of testing is to completely mix fields (no serial correlation between what field is sorted on). This is the test that is very predictable with the Oracle JVM, but I still see wide variability with the IBM JVM. Here is the list of different runs for the IBM JVM (ms):

        branch_3x

        128 129 123 120 128 100 95 74 130 91 120

        trunk

        106 89 168 116 155 119 108 118 112 169 165

        To my eye, it looks like we have more variability in trunk, due to increased use of abstractions?

        edit: corrected the table description - all times in this message are for the IBM JVM.

        Show
        Yonik Seeley added a comment - - edited The open question is whether this hotspot fickleness is particular to Oracle's java impl, or, is somehow endemic to bytecode VMs (.NET included). I tried IBM's latest Java6 (SR8 FP1, 20100624) It seems to have some of the same pitfalls as Oracle's JVM, just different. The first run does not differ from the second run in the same JVM as it does with Oracle, but the first run itself has much more variation. The worst case is worse, and just like the Oracle JVM, it gets stuck in it's worst case. Each run (of the complete set of fields) in a separate JVM since two runs in the same JVM didn't really differ as they did in the oracle JVM. branch_3x: unique terms in field median sort time of 100 sorts in ms another run another run another run another run another run another run 100000 129 128 130 109 98 128 135 10000 128 123 127 127 98 128 135 1000 129 130 130 128 98 130 136 100 128 133 133 130 100 132 139 10 150 153 153 154 122 153 159 trunk: unique terms in field median sort time of 100 sorts in ms another run another run another run another run another run another run 100000 217 81 383 99 79 78 215 10000 254 73 346 101 106 108 267 1000 253 74 347 99 107 108 258 100 253 107 394 98 107 102 255 10 251 107 388 99 106 98 257 The second way of testing is to completely mix fields (no serial correlation between what field is sorted on). This is the test that is very predictable with the Oracle JVM, but I still see wide variability with the IBM JVM. Here is the list of different runs for the IBM JVM (ms): branch_3x 128 129 123 120 128 100 95 74 130 91 120 trunk 106 89 168 116 155 119 108 118 112 169 165 To my eye, it looks like we have more variability in trunk, due to increased use of abstractions? edit: corrected the table description - all times in this message are for the IBM JVM.
        Hide
        Michael McCandless added a comment -

        Yes, but FieldValueHitQueue has it's own list of comparators that never get updated.

        Ugh, yes.

        Show
        Michael McCandless added a comment - Yes, but FieldValueHitQueue has it's own list of comparators that never get updated. Ugh, yes.
        Hide
        Michael McCandless added a comment -

        I think we all owe it to ourselves to stop equating java with Oracle, if Java
        stays with Oracle its pretty obvious the language (is) will die anyway.

        Yeah I agree.

        The open question is whether this hotspot fickleness is particular to
        Oracle's java impl, or, is somehow endemic to bytecode VMs (.NET
        included). It's really a hard, complex problem (JIT compilation from
        bytecode based on runtime data), so it wouldn't surprise me if it's
        the latter, to varying degrees.

        .NET is not a choice but generating C/C++ code is?

        As far as I know it's much easier to invoke C/C++ from java, than .NET
        from java. C/C++ is also more portable than .NET, I think? (There is
        Mono – how mature is it by now?).

        I don't think we should jump the gun and make real design/architectural
        choices based on Oracle bugs.

        I expect source code spec will also buy sizable perf gains
        irrespective of hotspot fickleness, and in non-Oracle java impls.
        Generating a dedicated class, with one method doing all searching and
        collecting, removes all kinds of barriers to the JIT compiler. It
        makes its job far easier.

        I agree with robert that we should stop comparing against sun jvms all the time and turn everything upside-down specializing code here and there or go one step further and generate C++ code. Dude who is gonna maintain the compatibility to Java-Only environments?

        If we manage to pursue specialized code gen, it'll be a loooong time
        coming! My point about C/C++ is that if we do somehow manage to get a
        working code gen framework online (for Java), the added cost to make
        it also target C/C++ will be "relatively" small. Ie, it's nearly "for
        free".

        If we were do to this, that would not mean we'd abandon java, of
        course – the framework would fully support "pure java" as well.

        I think that code specializations of very "hot" part of lucene are ok and we should follow that way like we did at some places but it already make things very complicated to follow.

        You mean manual specialization right (like this issue)?

        Yes, I think we will have to keep manually specializing, going
        forward, until we can have code generator that
        does it more cleanly...

        Would it make way more sense to push OSS JVMs than spending lots of time on investigating on .NET as an alternative or C/C++ code generator?

        I think we should do both.

        Before I would go the C++ path I'd rather use Java to host a C core like lucy which brings you as close as it gets to the machine.

        I think this (a Java wrapper for Lucy) is a great idea – we should explore that, too.

        interesting papers - seems we are touching the limits of Java though.

        Well that's the big question – limits of Java or limit's of Sun/Oracle's impl.

        It looks like harmony has a ways to go on absolute performance: I just
        ran a very quick benchmark (TermQuery search on 10 M multi-segment
        wiki index w/ a 50% random filter) and Oracle java 1.6.0_21 gets 15.6
        QPS while Harmony 1.5.0-r946978 gets 9.5 QPS (Harmony 1.6.0-r946981
        also gets 9.5 QPS). I just ran java -server -Xms2g -Xmx2g; it's
        possible by tuning Harmony (it has many awesome looking command-line
        args!) it'd get faster...

        Show
        Michael McCandless added a comment - I think we all owe it to ourselves to stop equating java with Oracle, if Java stays with Oracle its pretty obvious the language (is) will die anyway. Yeah I agree. The open question is whether this hotspot fickleness is particular to Oracle's java impl, or, is somehow endemic to bytecode VMs (.NET included). It's really a hard, complex problem (JIT compilation from bytecode based on runtime data), so it wouldn't surprise me if it's the latter, to varying degrees. .NET is not a choice but generating C/C++ code is? As far as I know it's much easier to invoke C/C++ from java, than .NET from java. C/C++ is also more portable than .NET, I think? (There is Mono – how mature is it by now?). I don't think we should jump the gun and make real design/architectural choices based on Oracle bugs. I expect source code spec will also buy sizable perf gains irrespective of hotspot fickleness, and in non-Oracle java impls. Generating a dedicated class, with one method doing all searching and collecting, removes all kinds of barriers to the JIT compiler. It makes its job far easier. I agree with robert that we should stop comparing against sun jvms all the time and turn everything upside-down specializing code here and there or go one step further and generate C++ code. Dude who is gonna maintain the compatibility to Java-Only environments? If we manage to pursue specialized code gen, it'll be a loooong time coming! My point about C/C++ is that if we do somehow manage to get a working code gen framework online (for Java), the added cost to make it also target C/C++ will be "relatively" small. Ie, it's nearly "for free". If we were do to this, that would not mean we'd abandon java, of course – the framework would fully support "pure java" as well. I think that code specializations of very "hot" part of lucene are ok and we should follow that way like we did at some places but it already make things very complicated to follow. You mean manual specialization right (like this issue)? Yes, I think we will have to keep manually specializing, going forward, until we can have code generator that does it more cleanly... Would it make way more sense to push OSS JVMs than spending lots of time on investigating on .NET as an alternative or C/C++ code generator? I think we should do both. Before I would go the C++ path I'd rather use Java to host a C core like lucy which brings you as close as it gets to the machine. I think this (a Java wrapper for Lucy) is a great idea – we should explore that, too. interesting papers - seems we are touching the limits of Java though. Well that's the big question – limits of Java or limit's of Sun/Oracle's impl. It looks like harmony has a ways to go on absolute performance: I just ran a very quick benchmark (TermQuery search on 10 M multi-segment wiki index w/ a 50% random filter) and Oracle java 1.6.0_21 gets 15.6 QPS while Harmony 1.5.0-r946978 gets 9.5 QPS (Harmony 1.6.0-r946981 also gets 9.5 QPS). I just ran java -server -Xms2g -Xmx2g; it's possible by tuning Harmony (it has many awesome looking command-line args!) it'd get faster...
        Hide
        Yonik Seeley added a comment -

        Attaching a draft patch that seems to fix the issue (the ones I can find at least).

        Hmm I don't see the problem - eg OneComparatorNonScoringCollector saves the returned comparator from comparator.setNextReader.

        Yes, but FieldValueHitQueue has it's own list of comparators that never get updated.

        Show
        Yonik Seeley added a comment - Attaching a draft patch that seems to fix the issue (the ones I can find at least). Hmm I don't see the problem - eg OneComparatorNonScoringCollector saves the returned comparator from comparator.setNextReader. Yes, but FieldValueHitQueue has it's own list of comparators that never get updated.
        Hide
        Michael McCandless added a comment -

        I'm now getting many UnsupportedOperationExceptions (i.e. the search process is using older comparators after calling setNextReader())

        That's no good!

        One culprit is OneComparatorNonScoringCollector, and another is OneComparatorFieldValueHitQueue I think.

        Hmm I don't see the problem – eg OneComparatorNonScoringCollector saves the returned comparator from comparator.setNextReader.

        Can you post the full exc?

        Show
        Michael McCandless added a comment - I'm now getting many UnsupportedOperationExceptions (i.e. the search process is using older comparators after calling setNextReader()) That's no good! One culprit is OneComparatorNonScoringCollector, and another is OneComparatorFieldValueHitQueue I think. Hmm I don't see the problem – eg OneComparatorNonScoringCollector saves the returned comparator from comparator.setNextReader. Can you post the full exc?
        Hide
        Yonik Seeley added a comment -

        Looks like we're not using the correct comparators everywhere.
        I was trying a slightly different way to implement sort-missing-last, and my first comparator only implements setNextReader(), but I'm now getting many UnsupportedOperationExceptions (i.e. the search process is using older comparators after calling setNextReader())

        One culprit is OneComparatorNonScoringCollector, and another is OneComparatorFieldValueHitQueue I think.

        Show
        Yonik Seeley added a comment - Looks like we're not using the correct comparators everywhere. I was trying a slightly different way to implement sort-missing-last, and my first comparator only implements setNextReader(), but I'm now getting many UnsupportedOperationExceptions (i.e. the search process is using older comparators after calling setNextReader()) One culprit is OneComparatorNonScoringCollector, and another is OneComparatorFieldValueHitQueue I think.
        Hide
        Simon Willnauer added a comment -

        I think we all owe it to ourselves to stop equating java with Sun/Oracle, if Java stays with Oracle its pretty obvious the language (is) will die anyway.

        I agree with robert that we should stop comparing against sun jvms all the time and turn everything upside-down specializing code here and there or go one step further and generate C++ code. Dude who is gonna maintain the compatibility to Java-Only environments? I could imagine that we have something which is super special purpose like mike did with DirectNIOFSDirectory to work around unexposed methods like fadvice.

        I think that code specializations of very "hot" part of lucene are ok and we should follow that way like we did at some places but it already make things very complicated to follow. Without the knowledge of a committer or a person actively following that development it is extremely difficult to comprehend design decisions.

        I would rather like the idea to put effort in stuff like harmony and make code we can control perform better that introducing a preprocessor which generates code for a JVM owned by a company. Would it make way more sense to push OSS JVMs than spending lots of time on investigating on .NET as an alternative or C/C++ code generator? Before I would go the C++ path I'd rather use Java to host a C core like lucy which brings you as close as it gets to the machine.

        EG, see my post here:

        interesting papers - seems we are touching the limits of Java though.

        Show
        Simon Willnauer added a comment - I think we all owe it to ourselves to stop equating java with Sun/Oracle, if Java stays with Oracle its pretty obvious the language (is) will die anyway. I agree with robert that we should stop comparing against sun jvms all the time and turn everything upside-down specializing code here and there or go one step further and generate C++ code. Dude who is gonna maintain the compatibility to Java-Only environments? I could imagine that we have something which is super special purpose like mike did with DirectNIOFSDirectory to work around unexposed methods like fadvice. I think that code specializations of very "hot" part of lucene are ok and we should follow that way like we did at some places but it already make things very complicated to follow. Without the knowledge of a committer or a person actively following that development it is extremely difficult to comprehend design decisions. I would rather like the idea to put effort in stuff like harmony and make code we can control perform better that introducing a preprocessor which generates code for a JVM owned by a company. Would it make way more sense to push OSS JVMs than spending lots of time on investigating on .NET as an alternative or C/C++ code generator? Before I would go the C++ path I'd rather use Java to host a C core like lucy which brings you as close as it gets to the machine. EG, see my post here: interesting papers - seems we are touching the limits of Java though.
        Hide
        Robert Muir added a comment -

        Java (Oracle) really needs to do something to address this.

        I think we all owe it to ourselves to stop equating java with Sun/Oracle, if Java
        stays with Oracle its pretty obvious the language (is) will die anyway.

        I think this is a severe and growing problem for Lucene going forward

        • our search performance is crucial and we can't risk hotspot
          randomly, substantially slowing things down by alot.

        While I agree at the moment we should make efforts to work around issues like this,
        I don't think we should jump the gun and make real design/architectural
        choices based on Oracle bugs.

        Especially for trunk, by the time we release Lucene 4.0 some other company
        will probably "own" Java anyway.

        Not that we have a choice here... but I've often wondered whether .NET
        has this same hotspot fickleness problem

        .NET is not a choice but generating C/C++ code is?

        Show
        Robert Muir added a comment - Java (Oracle) really needs to do something to address this. I think we all owe it to ourselves to stop equating java with Sun/Oracle, if Java stays with Oracle its pretty obvious the language (is) will die anyway. I think this is a severe and growing problem for Lucene going forward our search performance is crucial and we can't risk hotspot randomly, substantially slowing things down by alot. While I agree at the moment we should make efforts to work around issues like this, I don't think we should jump the gun and make real design/architectural choices based on Oracle bugs. Especially for trunk, by the time we release Lucene 4.0 some other company will probably "own" Java anyway. Not that we have a choice here... but I've often wondered whether .NET has this same hotspot fickleness problem .NET is not a choice but generating C/C++ code is?
        Hide
        Michael McCandless added a comment -

        The fickleness of the hotspot compiler is just awful, and, frankely
        unacceptable. Java (Oracle) really needs to do something to address
        this.

        EG, see my post here:

        http://forums.sun.com/thread.jspa?threadID=5420631

        In that standalone test, I can get drastically different search
        performance depending on what code runs first. Hotspot gets itself
        into a state where it's "stuck" and is not able to re-optimize for the
        code that's running. When I disassembled the methods hotspot had
        compiled, one thing I found was that readVInt (the hottest of hot in
        Lucene today) was compiled very differently depending on what code ran
        first!

        The changes we've had to make to Lucene/Solr in this issue to
        workaround hotspot are here are horrible – we've introduced ugly code
        dup specializations so that hotspot properly detects given method
        calls are in fact just an array lookup. We've made similar
        specializations elsewhere in Lucene...

        Weirdly, I've found that running java with -Xbatch gives far more
        repeatable results. This is bizarre because that option forces
        compilation to run in the foreground; it's not supposed to alter which
        methods hotspot chooses to optimize, and, how much (I think?). Though
        maybe because threads are paused awaiting compilation it alters
        hotspots targets? However, -Xbatch doesn't always give the fastest
        results.

        Not that we have a choice here... but I've often wondered whether .NET
        has this same hotspot fickleness problem.

        I think this is a severe and growing problem for Lucene going forward
        – our search performance is crucial and we can't risk hotspot
        randomly, substantially slowing things down by alot. We're unable
        to do true performance tuning when hospot "noise" easily dwarfs the
        effects we're trying to measure.

        I think the only viable option going forward is to create a search
        framework that's able to generate its own specialized java code. We'd
        use this, statically, to generate pieces of the search executation
        path that we think are common enough to warrent up-front specialization,
        but also expose it dynamically so apps can "optimize" for their query
        paths, either statically (pre-built/compiled in their apps) or
        dynamically (like how a JSP rewrites to java code and is then
        compiled). Of course we'd still retain the non-specialized code, as a
        fallback to handle those cases the specializer can't yet cover, or,
        for apps where the net bytecode must be kept smallish.

        In theory such a search autogen framework could also generate into
        C/C++, enabling us to choose a good point to wrap the result with JNI
        (eg, TopDocsCollector.topDocs), which'd be wonderful as it'd fully
        sidestep the hotspot fickleness.

        Show
        Michael McCandless added a comment - The fickleness of the hotspot compiler is just awful, and, frankely unacceptable. Java (Oracle) really needs to do something to address this. EG, see my post here: http://forums.sun.com/thread.jspa?threadID=5420631 In that standalone test, I can get drastically different search performance depending on what code runs first. Hotspot gets itself into a state where it's "stuck" and is not able to re-optimize for the code that's running. When I disassembled the methods hotspot had compiled, one thing I found was that readVInt (the hottest of hot in Lucene today) was compiled very differently depending on what code ran first! The changes we've had to make to Lucene/Solr in this issue to workaround hotspot are here are horrible – we've introduced ugly code dup specializations so that hotspot properly detects given method calls are in fact just an array lookup. We've made similar specializations elsewhere in Lucene... Weirdly, I've found that running java with -Xbatch gives far more repeatable results. This is bizarre because that option forces compilation to run in the foreground; it's not supposed to alter which methods hotspot chooses to optimize, and, how much (I think?). Though maybe because threads are paused awaiting compilation it alters hotspots targets? However, -Xbatch doesn't always give the fastest results. Not that we have a choice here... but I've often wondered whether .NET has this same hotspot fickleness problem. I think this is a severe and growing problem for Lucene going forward – our search performance is crucial and we can't risk hotspot randomly, substantially slowing things down by alot. We're unable to do true performance tuning when hospot "noise" easily dwarfs the effects we're trying to measure. I think the only viable option going forward is to create a search framework that's able to generate its own specialized java code. We'd use this, statically, to generate pieces of the search executation path that we think are common enough to warrent up-front specialization, but also expose it dynamically so apps can "optimize" for their query paths, either statically (pre-built/compiled in their apps) or dynamically (like how a JSP rewrites to java code and is then compiled). Of course we'd still retain the non-specialized code, as a fallback to handle those cases the specializer can't yet cover, or, for apps where the net bytecode must be kept smallish. In theory such a search autogen framework could also generate into C/C++, enabling us to choose a good point to wrap the result with JNI (eg, TopDocsCollector.topDocs), which'd be wonderful as it'd fully sidestep the hotspot fickleness.
        Hide
        Yonik Seeley added a comment -

        Yonik, just curious, how do you know what HotSpot is doing? Empirically based on performance numbers?

        Yeah - it's a best guess based on what I see when performance testing, and matching that up with what I've read in the past.
        As far as deoptmization is concerned, it's mentioned here: http://java.sun.com/products/hotspot/whitepaper.html, but I haven't read much elsewhere.

        Specific to this issue, the whole optimization/deoptimization issue is extremely complex.
        Recall that I reported this: "Median response time in my tests drops from 160 to 102 ms."

        For simplicity, there are some details I left out:
        Those numbers were for randomly sorting on different fields (hopefully the most realistic scenario).
        If you test differently, the results are far different.

        The first and second test runs measured median time sorting on a single field 100 times in a row, then moving to the next field.

        Trunk before patch:

        unique terms in field median sort time in ms (first run) second run
        100000 105 168
        10000 105 169
        1000 106 164
        100 127 163
        10 165 197

        Trunk after patch:

        unique terms in field median sort time in ms (first run) second run
        100000 85 130
        10000 92 129
        1000 92 126
        100 116 127
        10 117 128

        branch_3x

        unique terms in field median sort time in ms (first run) second run
        100000 102 102
        10000 102 103
        1000 101 103
        100 103 103
        10 118 118

        So, it seems by running in batches (sorting on the same field over and over), we cause hotspot to overspecialize somehow, and then when we switch things up the resulting deoptimization puts us in a permanently worse condition). branch_3x does not suffer from that, but trunk still does due to the increased amount of indirection. I imagine the differences are also due to the boundaries that the compiler tries to inline/specialize for a certain class.

        It certainly complicates performance testing, and we need to keep a sharp eye on how we actually test potential improvements.

        Show
        Yonik Seeley added a comment - Yonik, just curious, how do you know what HotSpot is doing? Empirically based on performance numbers? Yeah - it's a best guess based on what I see when performance testing, and matching that up with what I've read in the past. As far as deoptmization is concerned, it's mentioned here: http://java.sun.com/products/hotspot/whitepaper.html , but I haven't read much elsewhere. Specific to this issue, the whole optimization/deoptimization issue is extremely complex. Recall that I reported this: "Median response time in my tests drops from 160 to 102 ms." For simplicity, there are some details I left out: Those numbers were for randomly sorting on different fields (hopefully the most realistic scenario). If you test differently, the results are far different. The first and second test runs measured median time sorting on a single field 100 times in a row, then moving to the next field. Trunk before patch: unique terms in field median sort time in ms (first run) second run 100000 105 168 10000 105 169 1000 106 164 100 127 163 10 165 197 Trunk after patch: unique terms in field median sort time in ms (first run) second run 100000 85 130 10000 92 129 1000 92 126 100 116 127 10 117 128 branch_3x unique terms in field median sort time in ms (first run) second run 100000 102 102 10000 102 103 1000 101 103 100 103 103 10 118 118 So, it seems by running in batches (sorting on the same field over and over), we cause hotspot to overspecialize somehow, and then when we switch things up the resulting deoptimization puts us in a permanently worse condition). branch_3x does not suffer from that, but trunk still does due to the increased amount of indirection. I imagine the differences are also due to the boundaries that the compiler tries to inline/specialize for a certain class. It certainly complicates performance testing, and we need to keep a sharp eye on how we actually test potential improvements.
        Hide
        David Smiley added a comment -

        Yonik, just curious, how do you know what HotSpot is doing? Empirically based on performance numbers? HotSpot code or documentation that spells out exactly when it inlines? Or is there some tool or diagnostic capability to learn this information?

        Show
        David Smiley added a comment - Yonik, just curious, how do you know what HotSpot is doing? Empirically based on performance numbers? HotSpot code or documentation that spells out exactly when it inlines? Or is there some tool or diagnostic capability to learn this information?
        Hide
        Robert Muir added a comment -

        Cool thanks (unfortunately i ran a bunch of collators and encountered what looks like 0xff bytes)
        I think this will help.

        Show
        Robert Muir added a comment - Cool thanks (unfortunately i ran a bunch of collators and encountered what looks like 0xff bytes) I think this will help.
        Hide
        Yonik Seeley added a comment -

        OK, I've changed bigString to bigTerm and used 10 0xff bytes (to account for possible binary encoding of 8 byte numerics + other stuff like tags that trie encoding uses).

        Show
        Yonik Seeley added a comment - OK, I've changed bigString to bigTerm and used 10 0xff bytes (to account for possible binary encoding of 8 byte numerics + other stuff like tags that trie encoding uses).
        Hide
        Robert Muir added a comment -

        Yep, should be fine. Is there an upper bound on how long collated terms can be?

        There isn't, but...

        I can't promise (but i'll verify), i think actually a single 0xff might do, for the major encodings.

        • its invalid in utf-8
        • its technically valid, but unused (reset byte) in bocu-1
        • collation keys i understand are a modified bocu, likely unused there too.

        so its like a NaN sentinel, if someone is doing something very wierd, maybe it wont work,
        but in general, i think it will work. ill check.

        Show
        Robert Muir added a comment - Yep, should be fine. Is there an upper bound on how long collated terms can be? There isn't, but... I can't promise (but i'll verify), i think actually a single 0xff might do, for the major encodings. its invalid in utf-8 its technically valid, but unused (reset byte) in bocu-1 collation keys i understand are a modified bocu, likely unused there too. so its like a NaN sentinel, if someone is doing something very wierd, maybe it wont work, but in general, i think it will work. ill check.
        Hide
        Yonik Seeley added a comment -

        we can always safely use bytes of 0xff i think?

        Yep, should be fine. Is there an upper bound on how long collated terms can be?

        Show
        Yonik Seeley added a comment - we can always safely use bytes of 0xff i think? Yep, should be fine. Is there an upper bound on how long collated terms can be?
        Hide
        Robert Muir added a comment -

        Maybe... if it is supposed to be just a string (I know that's the name, but maybe it should be called bigTerm I guess). All of our terms are currently UTF8 - but I don't know if that will last?

        well you are right, for example collated terms for Locale-sensitive sort will hopefully use full byte range soon...

        we can always safely use bytes of 0xff i think?

        Show
        Robert Muir added a comment - Maybe... if it is supposed to be just a string (I know that's the name, but maybe it should be called bigTerm I guess). All of our terms are currently UTF8 - but I don't know if that will last? well you are right, for example collated terms for Locale-sensitive sort will hopefully use full byte range soon... we can always safely use bytes of 0xff i think?
        Hide
        Yonik Seeley added a comment -

        silly question, what does the bigString do?

        It's actually not currently used by Solr.... but it's basically to use as a proxy for a null if you want the Comparables returned by value() to match the sort order the Comparator actually used.

        (just wondering if it should be U+10FFFF,U+10FFFF,... now that we use utf-8 order, depending what it does)

        Maybe... if it is supposed to be just a string (I know that's the name, but maybe it should be called bigTerm I guess). All of our terms are currently UTF8 - but I don't know if that will last?

        Show
        Yonik Seeley added a comment - silly question, what does the bigString do? It's actually not currently used by Solr.... but it's basically to use as a proxy for a null if you want the Comparables returned by value() to match the sort order the Comparator actually used. (just wondering if it should be U+10FFFF,U+10FFFF,... now that we use utf-8 order, depending what it does) Maybe... if it is supposed to be just a string (I know that's the name, but maybe it should be called bigTerm I guess). All of our terms are currently UTF8 - but I don't know if that will last?
        Hide
        Robert Muir added a comment -

        silly question, what does the bigString do?

        (just wondering if it should be U+10FFFF,U+10FFFF,... now that we use utf-8 order, depending what it does)

        Show
        Robert Muir added a comment - silly question, what does the bigString do? (just wondering if it should be U+10FFFF,U+10FFFF,... now that we use utf-8 order, depending what it does)
        Hide
        Yonik Seeley added a comment -

        OK, here's a patch for solr's sort missing last.
        Median response time in my tests drops from 160 to 102 ms.

        Show
        Yonik Seeley added a comment - OK, here's a patch for solr's sort missing last. Median response time in my tests drops from 160 to 102 ms.
        Hide
        Yonik Seeley added a comment -

        Hmmm, turns out the sorting bug I just checked in (r996332) has been around a bit longer than I thought - since field cache was converted to bytes on 6/3/2010 (LUCENE-2380).

        So anyone using trunk since then and sorting on string fields with null values may want to upgrade.

        Show
        Yonik Seeley added a comment - Hmmm, turns out the sorting bug I just checked in (r996332) has been around a bit longer than I thought - since field cache was converted to bytes on 6/3/2010 ( LUCENE-2380 ). So anyone using trunk since then and sorting on string fields with null values may want to upgrade.
        Hide
        Yonik Seeley added a comment -

        I'm still seeing bad degredations in solr - I think it's because the default way for solr to sort strings is with MissingLastOrdComparator, which isn't specialized. I'll try and work up a patch based on Mike's work.

        Show
        Yonik Seeley added a comment - I'm still seeing bad degredations in solr - I think it's because the default way for solr to sort strings is with MissingLastOrdComparator, which isn't specialized. I'll try and work up a patch based on Mike's work.
        Hide
        Yonik Seeley added a comment -

        This is all quite silly: we are only doing this to "game" hotspot into properly inlining/compiling what is in fact an array lookup, just currently hidden behind method calls in the packed ints impls. We really "shouldn't have to" do this custom source code specialization.

        Yeah, but this is the way hotspot currently works, and I don't know if there are any plans to change it.
        Hotspot can be pretty aggressive at inlining, but then it deoptimizes when it turns out that the inline is no longer valid (because of a different implementation).

        It's something worth keeping in mind for the rest of Lucene too - bothin benchmarking and design. Multiple implementations used from a single spot will not be inlined (if multiple implementations are actually used in the same run).

        Show
        Yonik Seeley added a comment - This is all quite silly: we are only doing this to "game" hotspot into properly inlining/compiling what is in fact an array lookup, just currently hidden behind method calls in the packed ints impls. We really "shouldn't have to" do this custom source code specialization. Yeah, but this is the way hotspot currently works, and I don't know if there are any plans to change it. Hotspot can be pretty aggressive at inlining, but then it deoptimizes when it turns out that the inline is no longer valid (because of a different implementation). It's something worth keeping in mind for the rest of Lucene too - bothin benchmarking and design. Multiple implementations used from a single spot will not be inlined (if multiple implementations are actually used in the same run).
        Hide
        Michael McCandless added a comment -

        I just committed 2 places that the new returned comparator was ignored.

        Urgh – thanks!

        Show
        Michael McCandless added a comment - I just committed 2 places that the new returned comparator was ignored. Urgh – thanks!
        Hide
        Yonik Seeley added a comment -

        Great! Hopefully I'll get a change to test it out shortly on my system too.
        I just committed 2 places that the new returned comparator was ignored.

        Show
        Yonik Seeley added a comment - Great! Hopefully I'll get a change to test it out shortly on my system too. I just committed 2 places that the new returned comparator was ignored.
        Hide
        Michael McCandless added a comment -

        OK I implemented Yonik's suggestion here: the comparator may now
        return a new segment-specific FieldComparator on each call to
        .setNextReader. I fixed all FieldComparators to simply "return this",
        except for the TermOrdValComparator which returns a comparator
        specialized to the bit-width of the packed ints doc->ord mapping for
        the fixed-array (8, 16, 32) cases.

        This is all quite silly: we are only doing this to "game" hotspot into
        properly inlining/compiling what is in fact an array lookup, just
        currently hidden behind method calls in the packed ints impls. We
        really "shouldn't have to" do this custom source code specialization.

        And, I think a more general framework for source-code specialization
        is a cleaner way to minimize hotspot unpredictability (LUCENE-1594),
        in the future. Maybe once we cutover to that, we can remove these
        cases of custom specialization in Lucene's core (the 12 private
        inner Collector impls in TopFieldCollector is another example).

        Here are the results, comparing 3.x perf to trunk w/ the attached
        patch – all runs include the pending [separate] fix on LUCENE-2631:

        Optimized index:

        Query country unique10 unique100 unique1K unique10K unique100K unique1M score
        <all> 8.5% 8.5% 8.4% 8.7% 8.7% 8.4% 9.4% 10.7%
        +united +states 1.8% 0.6% 0.3% 0.4% 0.9% 0.7% 2.1% 2.9%
        "united states" 5.2% 5.5% 5.7% 5.2% 5.2% 4.8% 6.9% 7.1%
        states 4.6% 4.8% 4.1% 5.2% 5.1% 7.0% 3.8% 1.8%
        unite* 2.0% 1.7% 3.0% 2.6% 2.4% 5.7% 6.0% 3.0%
        united states 0.5% 0.4% 2.8% 2.6% 3.1% 2.1% 1.1% 2.0%

        Multi-segment index (5% deletions):

        Query country unique10 unique100 unique1K unique10K unique100K unique1M score
        <all> 10.0% 10.2% 10.1% 9.4% 9.4% 10.1% 10.0% 5.1%
        +united +states 7.2% 7.5% 7.7% 8.5% 8.4% 7.1% 5.4% 1.9%
        "united states" 4.5% 4.2% 4.0% 3.8% 4.5% 4.3% 3.7% 4.2%
        states 6.5% 8.6% 7.3% 6.9% 7.5% 9.4% 9.9% 1.3%
        unite* 4.5% 5.3% 4.3% 3.9% 4.5% 4.7% 4.7% 0.4%
        united states 4.6% 2.4% 3.2% 3.4% 1.9% 4.8% 3.3% 1.9%

        So... this fix does make up much of the difference; we still seem to
        be a bit (single digits) slower, but, I think this is acceptable given
        the massive reduction in RAM required for the FieldCache entry.

        Show
        Michael McCandless added a comment - OK I implemented Yonik's suggestion here: the comparator may now return a new segment-specific FieldComparator on each call to .setNextReader. I fixed all FieldComparators to simply "return this", except for the TermOrdValComparator which returns a comparator specialized to the bit-width of the packed ints doc->ord mapping for the fixed-array (8, 16, 32) cases. This is all quite silly: we are only doing this to "game" hotspot into properly inlining/compiling what is in fact an array lookup, just currently hidden behind method calls in the packed ints impls. We really "shouldn't have to" do this custom source code specialization. And, I think a more general framework for source-code specialization is a cleaner way to minimize hotspot unpredictability ( LUCENE-1594 ), in the future. Maybe once we cutover to that, we can remove these cases of custom specialization in Lucene's core (the 12 private inner Collector impls in TopFieldCollector is another example). Here are the results, comparing 3.x perf to trunk w/ the attached patch – all runs include the pending [separate] fix on LUCENE-2631 : Optimized index: Query country unique10 unique100 unique1K unique10K unique100K unique1M score <all> 8.5% 8.5% 8.4% 8.7% 8.7% 8.4% 9.4% 10.7% +united +states 1.8% 0.6% 0.3% 0.4% 0.9% 0.7% 2.1% 2.9% "united states" 5.2% 5.5% 5.7% 5.2% 5.2% 4.8% 6.9% 7.1% states 4.6% 4.8% 4.1% 5.2% 5.1% 7.0% 3.8% 1.8% unite* 2.0% 1.7% 3.0% 2.6% 2.4% 5.7% 6.0% 3.0% united states 0.5% 0.4% 2.8% 2.6% 3.1% 2.1% 1.1% 2.0% Multi-segment index (5% deletions): Query country unique10 unique100 unique1K unique10K unique100K unique1M score <all> 10.0% 10.2% 10.1% 9.4% 9.4% 10.1% 10.0% 5.1% +united +states 7.2% 7.5% 7.7% 8.5% 8.4% 7.1% 5.4% 1.9% "united states" 4.5% 4.2% 4.0% 3.8% 4.5% 4.3% 3.7% 4.2% states 6.5% 8.6% 7.3% 6.9% 7.5% 9.4% 9.9% 1.3% unite* 4.5% 5.3% 4.3% 3.9% 4.5% 4.7% 4.7% 0.4% united states 4.6% 2.4% 3.2% 3.4% 1.9% 4.8% 3.3% 1.9% So... this fix does make up much of the difference; we still seem to be a bit (single digits) slower, but, I think this is acceptable given the massive reduction in RAM required for the FieldCache entry.
        Hide
        Michael McCandless added a comment -

        Digging into this, finally...

        To try to make a somewhat more realistic search test, I created a
        standalone test (attached zip file), which runs different query types
        (term, phrase, OR of 2 terms, AND of 2 terms, prefix, phrase) sorting
        by score or by a string field (with increasing numbers of unique
        values: country (~250 values I think), and then
        unique10/100/1K/10K/100K/1M)). I derive the unique fields by taking
        first N unique titles from wikipedia; the country field comes from the
        SortableSingleDocSource in contrib/benchmark.

        It runs with 2 threads (machine has 2 cores), and each thread first
        shuffles the queries privately but deterministically, so that each
        matching thread in the trunk & 3x tests are running query+sort in same
        order.

        I then created a Wikipedia index with first 5M docs, one optimized and
        one not optimized (13 segments) and with 5% docs deleted, on trunk and
        3x.

        I sweep through all query+sorts 23 times (getting top 10 hits for
        each), using 2 threads, measuring wall clock time each time. I
        discard first 3 results for each query+sort, and then take fastest
        time of the remaining 20.

        Java is 1.6.0_17; I run with -server -Xmx1g -Xms1g (machine has 3G
        RAM); OS is Linux CentOS 5.5.

        NOTE: these results include the patch from LUCENE-2504, for both
        trunk & 3.x!

        Results (pctg change in query time, going from 3x -> trunk) on
        optimized index:

        Results on optimized index:

        Query country unique10 unique100 unique1K unique10K unique100K unique1M score
        <all> 40.5% 40.6% 41.0% 40.5% 40.7% 1.6% 1.8% 2.8%
        +united +states 6.1% 6.0% 6.0% 6.6% 6.3% 0.4% 1.4% 1.7%
        "united states" 8.4% 8.5% 8.2% 8.1% 8.1% 9.2% 9.3% 8.7%
        states 20.3% 20.4% 20.9% 22.5% 22.5% 8.0% 8.1% 0.1%
        unite* 8.1% 8.3% 8.3% 8.6% 9.0% 2.8% 0.8% 1.2%
        united states 1.3% 1.9% 2.5% 1.8% 2.2% 2.3% 1.3% 2.2%

        Results on unoptimized index (w/ 5% deletions):

        Query country unique10 unique100 unique1K unique10K unique100K unique1M score
        <all> 25.1% 25.8% 24.9% 27.2% 26.3% 27.4% 27.3% 1.4%
        +united +states 7.8% 7.6% 7.5% 7.8% 7.6% 8.6% 8.9% 6.5%
        "united states" 13.4% 13.7% 13.6% 13.8% 13.4% 14.1% 13.6% 14.8%
        states 13.6% 14.3% 14.2% 15.5% 15.5% 18.6% 18.8% 1.7%
        unite* 5.8% 5.3% 5.0% 5.7% 5.3% 6.9% 6.9% 2.4%
        united states 2.3% 2.6% 1.4% 1.9% 2.5% 4.9% 6.6% 0.1%

        Unfortunately, the tests have highish variance (up to maybe +/- 10%),
        I think thanks to hotspot's unpredictability ("java ghosts"). EG if I
        change the order in which the queries are run, the results change
        quite a bit. If I run the exact same test, results change alot. This
        of course makes conclusions nearly impossible... but still some rough
        observations:

        • Trunk is definitely slower when sorting by field; sorting by
          score is roughly the same perf.
        • For some reason, the unoptimized index generally takes less perf
          hit than the optimized index... odd.
        • Curious that phrase query is faster across the board... not sure
          why. Maybe my recent optos to PhraseQuery somehow favor flex?
        • Perf loss is in proportion to how "easy" the query is
          (AllDocsQuery is the worst; TermQuery next), which makes sense
          since the slowdown is in collection.

        Even though the results are noisy... I still think we should try to
        specialize direct access to the native array for doc->ord lookup.
        I'll work on that next...

        Show
        Michael McCandless added a comment - Digging into this, finally... To try to make a somewhat more realistic search test, I created a standalone test (attached zip file), which runs different query types (term, phrase, OR of 2 terms, AND of 2 terms, prefix, phrase) sorting by score or by a string field (with increasing numbers of unique values: country (~250 values I think), and then unique10/100/1K/10K/100K/1M)). I derive the unique fields by taking first N unique titles from wikipedia; the country field comes from the SortableSingleDocSource in contrib/benchmark. It runs with 2 threads (machine has 2 cores), and each thread first shuffles the queries privately but deterministically, so that each matching thread in the trunk & 3x tests are running query+sort in same order. I then created a Wikipedia index with first 5M docs, one optimized and one not optimized (13 segments) and with 5% docs deleted, on trunk and 3x. I sweep through all query+sorts 23 times (getting top 10 hits for each), using 2 threads, measuring wall clock time each time. I discard first 3 results for each query+sort, and then take fastest time of the remaining 20. Java is 1.6.0_17; I run with -server -Xmx1g -Xms1g (machine has 3G RAM); OS is Linux CentOS 5.5. NOTE : these results include the patch from LUCENE-2504 , for both trunk & 3.x! Results (pctg change in query time, going from 3x -> trunk) on optimized index: Results on optimized index: Query country unique10 unique100 unique1K unique10K unique100K unique1M score <all> 40.5% 40.6% 41.0% 40.5% 40.7% 1.6% 1.8% 2.8% +united +states 6.1% 6.0% 6.0% 6.6% 6.3% 0.4% 1.4% 1.7% "united states" 8.4% 8.5% 8.2% 8.1% 8.1% 9.2% 9.3% 8.7% states 20.3% 20.4% 20.9% 22.5% 22.5% 8.0% 8.1% 0.1% unite* 8.1% 8.3% 8.3% 8.6% 9.0% 2.8% 0.8% 1.2% united states 1.3% 1.9% 2.5% 1.8% 2.2% 2.3% 1.3% 2.2% Results on unoptimized index (w/ 5% deletions): Query country unique10 unique100 unique1K unique10K unique100K unique1M score <all> 25.1% 25.8% 24.9% 27.2% 26.3% 27.4% 27.3% 1.4% +united +states 7.8% 7.6% 7.5% 7.8% 7.6% 8.6% 8.9% 6.5% "united states" 13.4% 13.7% 13.6% 13.8% 13.4% 14.1% 13.6% 14.8% states 13.6% 14.3% 14.2% 15.5% 15.5% 18.6% 18.8% 1.7% unite* 5.8% 5.3% 5.0% 5.7% 5.3% 6.9% 6.9% 2.4% united states 2.3% 2.6% 1.4% 1.9% 2.5% 4.9% 6.6% 0.1% Unfortunately, the tests have highish variance (up to maybe +/- 10%), I think thanks to hotspot's unpredictability ("java ghosts"). EG if I change the order in which the queries are run, the results change quite a bit. If I run the exact same test, results change alot. This of course makes conclusions nearly impossible... but still some rough observations: Trunk is definitely slower when sorting by field; sorting by score is roughly the same perf. For some reason, the unoptimized index generally takes less perf hit than the optimized index... odd. Curious that phrase query is faster across the board... not sure why. Maybe my recent optos to PhraseQuery somehow favor flex? Perf loss is in proportion to how "easy" the query is (AllDocsQuery is the worst; TermQuery next), which makes sense since the slowdown is in collection. Even though the results are noisy... I still think we should try to specialize direct access to the native array for doc->ord lookup. I'll work on that next...
        Hide
        Shai Erera added a comment -

        Ahh ok. That makes sense then .

        Show
        Shai Erera added a comment - Ahh ok. That makes sense then .
        Hide
        Yonik Seeley added a comment -

        Do you mean Collector's setNextReader()?

        No, FieldComparator.setNextReader()

        Show
        Yonik Seeley added a comment - Do you mean Collector's setNextReader()? No, FieldComparator.setNextReader()
        Hide
        Shai Erera added a comment -

        One possibility: modify setNextReader to return a FieldComparator?

        Do you mean Collector's setNextReader()? That doesn't make sense. Most Collectors don't deal w/ FieldComparators at all ...

        Show
        Shai Erera added a comment - One possibility: modify setNextReader to return a FieldComparator? Do you mean Collector's setNextReader()? That doesn't make sense. Most Collectors don't deal w/ FieldComparators at all ...
        Hide
        Yonik Seeley added a comment -

        Hmmm, the way FieldComparator / FieldComparatorSource work now, it doesn't seem possible to specialize based on the underlying native array type. In order to do this, a new FieldComparator would need to be returned for each segment.

        One possibility: modify setNextReader to return a FieldComparator?

        Show
        Yonik Seeley added a comment - Hmmm, the way FieldComparator / FieldComparatorSource work now, it doesn't seem possible to specialize based on the underlying native array type. In order to do this, a new FieldComparator would need to be returned for each segment. One possibility: modify setNextReader to return a FieldComparator?
        Hide
        Yonik Seeley added a comment -

        More numbers: Windows 7:
        java version "1.6.0_17"
        Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
        Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01, mixed mode)

        f100000_s sort only: 115 ms
        sort against random field: 162 ms

        Show
        Yonik Seeley added a comment - More numbers: Windows 7: java version "1.6.0_17" Java(TM) SE Runtime Environment (build 1.6.0_17-b04) Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01, mixed mode) f100000_s sort only: 115 ms sort against random field: 162 ms
        Hide
        Yonik Seeley added a comment -

        More numbers: Ubuntu, Java 1.7.0-ea-b98 (64 bit):
        f100000_s sort only: 126 ms
        sort against random field: 175 ms

        Show
        Yonik Seeley added a comment - More numbers: Ubuntu, Java 1.7.0-ea-b98 (64 bit): f100000_s sort only: 126 ms sort against random field: 175 ms
        Hide
        Yonik Seeley added a comment -

        My guess is that this is caused by LUCENE-2380, but I opened a separate issue since I'm not sure.
        This is the same type of JVM performance issues reported by Mike in LUCENE-2143 and myself in LUCENE-2380.

        Setup:
        Same test index I used to test faceting: 10M doc index with 5 fields:

        • f100000_s: a single valued string field with 100,000 unique values
        • f10000_s: a single valued field with 10,000 unique values
        • f1000_s: a single valued field with 1000 unique values
        • f100_s: a single valued field with 100 unique values
        • f10_s: a single valued field with 10 unique values

        URLs I tested against Solr are of the form:
        http://localhost:8983/solr/select?q=*:*&rows=1&sort=f100000_s+asc

        branch_3x
        ----------------------------------------------------------
        f100000_s sort only: 101 ms
        sort against random field: 101 ms

        trunk:
        ----------------------------------------------------------
        f100000_s sort only: 111 ms
        sort against random field: 158 ms

        This is not due to garbage collection or cache effects. After you sort against a mix of fields, the performance is worse forever... you can go back to sorting against f100000_s only, and the performance never recovers.

        System: Ubuntu on Phenom II 4x3.0GHz, Java 1.6_20

        So my guess is that this is caused by the ord lookup going through PagedBytes, and the JVM not optimizing away the indirection when there is a mix of implementations.

        Show
        Yonik Seeley added a comment - My guess is that this is caused by LUCENE-2380 , but I opened a separate issue since I'm not sure. This is the same type of JVM performance issues reported by Mike in LUCENE-2143 and myself in LUCENE-2380 . Setup: Same test index I used to test faceting: 10M doc index with 5 fields: f100000_s: a single valued string field with 100,000 unique values f10000_s: a single valued field with 10,000 unique values f1000_s: a single valued field with 1000 unique values f100_s: a single valued field with 100 unique values f10_s: a single valued field with 10 unique values URLs I tested against Solr are of the form: http://localhost:8983/solr/select?q=*:*&rows=1&sort=f100000_s+asc branch_3x ---------------------------------------------------------- f100000_s sort only: 101 ms sort against random field: 101 ms trunk: ---------------------------------------------------------- f100000_s sort only: 111 ms sort against random field: 158 ms This is not due to garbage collection or cache effects. After you sort against a mix of fields, the performance is worse forever... you can go back to sorting against f100000_s only, and the performance never recovers. System: Ubuntu on Phenom II 4x3.0GHz, Java 1.6_20 So my guess is that this is caused by the ord lookup going through PagedBytes, and the JVM not optimizing away the indirection when there is a mix of implementations.
        Hide
        Michael McCandless added a comment -

        I'll dig.

        Show
        Michael McCandless added a comment - I'll dig.

          People

          • Assignee:
            Unassigned
            Reporter:
            Yonik Seeley
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development