Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 2.2.1
    • Fix Version/s: None
    • Component/s: jackrabbit-core
    • Labels:
      None

      Description

      Currently all SQL2 queries with "order by" expression are sorted with Collections.sort - which is slow and very bad if there are lots of hits. Lucene should be used for sorting hits instead.

      1. JCR-2959.patch
        35 kB
        Alex Parvulescu

        Issue Links

          Activity

          Hide
          Alex Parvulescu added a comment -

          I've added a new test suite that includes the changes to the 2.3 performance tests in revision #1161481.

          So we can compare the 2 sort implementations. (it will also help see the evolution of the new impl).

          Show
          Alex Parvulescu added a comment - I've added a new test suite that includes the changes to the 2.3 performance tests in revision #1161481. So we can compare the 2 sort implementations. (it will also help see the evolution of the new impl).
          Hide
          Alex Parvulescu added a comment -

          I pushed the first iteration based on the patch in revision #1161475.

          It does not currently use native lucene sorts (we're missing the apis to figure out if a given field is indexed or not).
          Also it is untested against joins of any kind.

          The code is disabled by default, to keep things working. You can enable it by setting the system property "useNativeSort" to true.
          This property will be removed once the implementation is stable enough.

          Show
          Alex Parvulescu added a comment - I pushed the first iteration based on the patch in revision #1161475. It does not currently use native lucene sorts (we're missing the apis to figure out if a given field is indexed or not). Also it is untested against joins of any kind. The code is disabled by default, to keep things working. You can enable it by setting the system property "useNativeSort" to true. This property will be removed once the implementation is stable enough.
          Hide
          Jukka Zitting added a comment -

          BTW, +1 on applying the patch as a work in progress. It's then much easier for us to propose improvements or alternatives.

          Show
          Jukka Zitting added a comment - BTW, +1 on applying the patch as a work in progress. It's then much easier for us to propose improvements or alternatives.
          Hide
          Alex Parvulescu added a comment -

          I've updated JCR-3047, so in order for the patch to still work, please replace 'String p = evaluator.getAffectedPropertyName' with 'String p = o.toString()'

          Yes, currently the sort is still done in JR, but it is extremely easy to fallback on native lucene fields (I just have to figure out how to know when that is available or not)

          I believe that having a native sort would benefit more in the long term, as it will allow for other performance improvements (see also JCR-2830)
          I'd aim at having equivalent run times in worst case scenarios, and be good on some of the worst performing scenarios, so that sql2 stops being the underdog in this query language wars
          And I find it hard to believe that the improvement is just a fluke.

          I agree with the fact that we need to test some more, this is why the patch is a working prototype.
          I'll go for the test scenario you proposed, so we can see what turns up.

          Show
          Alex Parvulescu added a comment - I've updated JCR-3047 , so in order for the patch to still work, please replace 'String p = evaluator.getAffectedPropertyName' with 'String p = o.toString()' Yes, currently the sort is still done in JR, but it is extremely easy to fallback on native lucene fields (I just have to figure out how to know when that is available or not) I believe that having a native sort would benefit more in the long term, as it will allow for other performance improvements (see also JCR-2830 ) I'd aim at having equivalent run times in worst case scenarios, and be good on some of the worst performing scenarios, so that sql2 stops being the underdog in this query language wars And I find it hard to believe that the improvement is just a fluke. I agree with the fact that we need to test some more, this is why the patch is a working prototype. I'll go for the test scenario you proposed, so we can see what turns up.
          Hide
          Jukka Zitting added a comment -

          If I understand correctly, the implementation in the patch still doesn't use the actual fields in Lucene for searching, it just loads the node using Session.getNodeByIdentifier() and uses the OperandEvaluator to get the values by which documents are to be compared. It seems to me that this is pretty much equivalent in terms of disk accesses to what the existing sorting mechanism does and the speedup you're seeing is probably simply caused by the sorting mechanism memorizing the values returned by the OperandEvaluator. I would expect that difference to fade away if the test set becomes larger than what fits into memory.

          Some math to give us more background: The execution of a single-selector query currently consists of the following main step:

          1. Constructing the Lucene query
          2. Executing the Lucene query
          3. Fetching all matching nodes to memory
          4. Sorting the nodes

          Step 1 is just an in-memory mapping, so not too time-consuming. Steps 2 and 3 are probably the most time-consuming bits here, but both outside the scope of this issue. Finally, step 4 consists of O(n log n) in-memory operations, where n is the number of returned nodes. That's some time, but as these are in-memory operations the overhead should be pretty small compared to the previous steps, whose worst case behavior is O disk accesses.

          The proposed change here is to switch steps 3 and 4 around and use Lucene for the sorting. The time required to sort all the results is still O(n log n) and may well require some extra disk accesses to fetch the sort fields from the Lucene index. However, the time taken in this step should still be significantly lower than in fetching all the matching nodes to memory. Thus the big benefit that we can gain from this optimization comes in cases where we don't want to fetch all the matching nodes to memory.

          A good test case for such a scenario could be a data set of 1m nodes, of which a query targets a subset of 100k nodes, and only the first 100 nodes of the sorted results are actually being read. With a proper implementation of this mechanism, the Lucene search would find all the matching 100k documents and read the sort fields (in the Lucene index) of those documents for quickly ordering the results. That should still be pretty fast compared to fetching all the 100k nodes into memory, which is what the current implementation (and the current patch) does. And then fetching only 100 first results should take no time at all. I'd expect us to reach up to an order of magnitude of performance improvement for such a scenario.

          Show
          Jukka Zitting added a comment - If I understand correctly, the implementation in the patch still doesn't use the actual fields in Lucene for searching, it just loads the node using Session.getNodeByIdentifier() and uses the OperandEvaluator to get the values by which documents are to be compared. It seems to me that this is pretty much equivalent in terms of disk accesses to what the existing sorting mechanism does and the speedup you're seeing is probably simply caused by the sorting mechanism memorizing the values returned by the OperandEvaluator. I would expect that difference to fade away if the test set becomes larger than what fits into memory. Some math to give us more background: The execution of a single-selector query currently consists of the following main step: 1. Constructing the Lucene query 2. Executing the Lucene query 3. Fetching all matching nodes to memory 4. Sorting the nodes Step 1 is just an in-memory mapping, so not too time-consuming. Steps 2 and 3 are probably the most time-consuming bits here, but both outside the scope of this issue. Finally, step 4 consists of O(n log n) in-memory operations, where n is the number of returned nodes. That's some time, but as these are in-memory operations the overhead should be pretty small compared to the previous steps, whose worst case behavior is O disk accesses. The proposed change here is to switch steps 3 and 4 around and use Lucene for the sorting. The time required to sort all the results is still O(n log n) and may well require some extra disk accesses to fetch the sort fields from the Lucene index. However, the time taken in this step should still be significantly lower than in fetching all the matching nodes to memory. Thus the big benefit that we can gain from this optimization comes in cases where we don't want to fetch all the matching nodes to memory. A good test case for such a scenario could be a data set of 1m nodes, of which a query targets a subset of 100k nodes, and only the first 100 nodes of the sorted results are actually being read. With a proper implementation of this mechanism, the Lucene search would find all the matching 100k documents and read the sort fields (in the Lucene index) of those documents for quickly ordering the results. That should still be pretty fast compared to fetching all the 100k nodes into memory, which is what the current implementation (and the current patch) does. And then fetching only 100 first results should take no time at all. I'd expect us to reach up to an order of magnitude of performance improvement for such a scenario.
          Hide
          Alex Parvulescu added a comment -

          Ok, I'm all for testing and profiling

          This is with iteration included (node.getpath called on each entry)

          lucene sort
          select took 507 ms
          fetch took 45 ms
          select took 106 ms
          fetch took 7 ms
          select took 102 ms
          fetch took 6 ms
          select took 84 ms
          fetch took 7 ms
          select took 100 ms
          fetch took 6 ms
          select took 84 ms
          fetch took 5 ms
          select took 61 ms
          fetch took 5 ms
          select took 55 ms
          fetch took 6 ms
          select took 46 ms
          fetch took 5 ms
          select took 39 ms
          fetch took 5 ms

          collections.sort
          select took 600 ms
          fetch took 43 ms
          select took 104 ms
          fetch took 6 ms
          select took 88 ms
          fetch took 7 ms
          select took 92 ms
          fetch took 7 ms
          select took 109 ms
          fetch took 5 ms
          select took 90 ms
          fetch took 5 ms
          select took 89 ms
          fetch took 6 ms
          select took 89 ms
          fetch took 5 ms
          select took 90 ms
          fetch took 5 ms
          select took 98 ms
          fetch took 6 ms

          It still looks better.

          While I agree that this solution has a different access pattern, which could be in the end not that good, I'd like to see a test scenario that can surface the problems.
          Also, this is a rough draft, so I'd like to validate (or not) the approach using more math rather than history

          Like you pointed out, having a native approach to sorting can allow us to improve certain use-cases that are really poor performing in sql2 right now. Sorting at the end on the entire result set does limit what we can do in the way of optimization and performance.

          Show
          Alex Parvulescu added a comment - Ok, I'm all for testing and profiling This is with iteration included (node.getpath called on each entry) lucene sort select took 507 ms fetch took 45 ms select took 106 ms fetch took 7 ms select took 102 ms fetch took 6 ms select took 84 ms fetch took 7 ms select took 100 ms fetch took 6 ms select took 84 ms fetch took 5 ms select took 61 ms fetch took 5 ms select took 55 ms fetch took 6 ms select took 46 ms fetch took 5 ms select took 39 ms fetch took 5 ms collections.sort select took 600 ms fetch took 43 ms select took 104 ms fetch took 6 ms select took 88 ms fetch took 7 ms select took 92 ms fetch took 7 ms select took 109 ms fetch took 5 ms select took 90 ms fetch took 5 ms select took 89 ms fetch took 6 ms select took 89 ms fetch took 5 ms select took 90 ms fetch took 5 ms select took 98 ms fetch took 6 ms It still looks better. While I agree that this solution has a different access pattern, which could be in the end not that good, I'd like to see a test scenario that can surface the problems. Also, this is a rough draft, so I'd like to validate (or not) the approach using more math rather than history Like you pointed out, having a native approach to sorting can allow us to improve certain use-cases that are really poor performing in sql2 right now. Sorting at the end on the entire result set does limit what we can do in the way of optimization and performance.
          Hide
          Jukka Zitting added a comment -

          Just measuring the Query.execute() time doesn't give you very meaningful results. You'll need to include the time it takes to iterate through the returned nodes, as that's the time that an end user would be seeing.

          I'd expect the Collections.sort() overhead to be a problem only for big query results of which the client only traverses a small subset of first results (i.e. the typical Google search result use case). For such cases it would indeed be great if the SQL2 engine could leverage the result ordering features of the underlying Lucene index.

          As for the implementation, I'd rethink the rather complicated DynamicOperandFieldComparator mechanism that can make the Lucene index call back through layers of abstraction with Session.getNodeByIdentifier(). Instead we could simply divide the given Orderings to those that the Lucene index can take care of and those that need to be applied with the current Collections.sort() implementation. Any Lucene-compatible Orderings at the beginning of the list would be converted and passed to Lucene, and the remaining Orderings (if any) would be handled with the current mechanism.

          Show
          Jukka Zitting added a comment - Just measuring the Query.execute() time doesn't give you very meaningful results. You'll need to include the time it takes to iterate through the returned nodes, as that's the time that an end user would be seeing. I'd expect the Collections.sort() overhead to be a problem only for big query results of which the client only traverses a small subset of first results (i.e. the typical Google search result use case). For such cases it would indeed be great if the SQL2 engine could leverage the result ordering features of the underlying Lucene index. As for the implementation, I'd rethink the rather complicated DynamicOperandFieldComparator mechanism that can make the Lucene index call back through layers of abstraction with Session.getNodeByIdentifier(). Instead we could simply divide the given Orderings to those that the Lucene index can take care of and those that need to be applied with the current Collections.sort() implementation. Any Lucene-compatible Orderings at the beginning of the list would be converted and passed to Lucene, and the remaining Orderings (if any) would be handled with the current mechanism.
          Hide
          Alex Parvulescu added a comment -

          watch out, big patch coming through!

          A first stab at a more integrated approach to sort. It doesn't cover all the bases yet, but it should be enough to get us started.

          Just to increase your interest, I've attached a perf test (SQL2OrderByTest#testBig), there are the first results I see:

          Old style (10 iterations)
          ran select, took 941 ms
          ran select, took 310 ms
          ran select, took 199 ms
          ran select, took 102 ms
          ran select, took 112 ms
          ran select, took 734 ms
          ran select, took 189 ms
          ran select, took 104 ms
          ran select, took 105 ms
          ran select, took 125 ms

          New style (10 iterations)
          ran select, took 629 ms
          ran select, took 134 ms
          ran select, took 117 ms
          ran select, took 117 ms
          ran select, took 133 ms
          ran select, took 618 ms
          ran select, took 108 ms
          ran select, took 94 ms
          ran select, took 84 ms
          ran select, took 80 ms

          It looks a bit interesting ~30% faster run times.

          You can switch between implementations via a global flag (I know it's not nice):
          QueryEngine.NATIVE_SORT true for lucene native, false for Collections.sort.

          It doesn't know how to sort joins yet, just simple queries.
          Also, it could do better (like falling back on lucene native properties when they are available in the index instead of loading the entire node).

          Feedback very welcome!

          Show
          Alex Parvulescu added a comment - watch out, big patch coming through! A first stab at a more integrated approach to sort. It doesn't cover all the bases yet, but it should be enough to get us started. Just to increase your interest, I've attached a perf test (SQL2OrderByTest#testBig), there are the first results I see: Old style (10 iterations) ran select, took 941 ms ran select, took 310 ms ran select, took 199 ms ran select, took 102 ms ran select, took 112 ms ran select, took 734 ms ran select, took 189 ms ran select, took 104 ms ran select, took 105 ms ran select, took 125 ms New style (10 iterations) ran select, took 629 ms ran select, took 134 ms ran select, took 117 ms ran select, took 117 ms ran select, took 133 ms ran select, took 618 ms ran select, took 108 ms ran select, took 94 ms ran select, took 84 ms ran select, took 80 ms It looks a bit interesting ~30% faster run times. You can switch between implementations via a global flag (I know it's not nice): QueryEngine.NATIVE_SORT true for lucene native, false for Collections.sort. It doesn't know how to sort joins yet, just simple queries. Also, it could do better (like falling back on lucene native properties when they are available in the index instead of loading the entire node). Feedback very welcome!
          Hide
          Alex Parvulescu added a comment -

          I've changed the name, it was a bit too inflammatory

          I agree with the idea that we should investigate native lucene search, with custom hooks into JR for fetching Node data.

          Show
          Alex Parvulescu added a comment - I've changed the name, it was a bit too inflammatory I agree with the idea that we should investigate native lucene search, with custom hooks into JR for fetching Node data.

            People

            • Assignee:
              Unassigned
              Reporter:
              Robert Seidel
            • Votes:
              4 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development