Apache Jena
  1. Apache Jena
  2. JENA-89

Avoid a total sort for ORDER BY + LIMIT queries

    Details

      Description

      In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
      As an alternative, discussed on jena-dev [1], we can use a PriorityQueue [2] (which is based on a priority heap) to avoid a sort operation.

      ARQ's algebra package contains already a OpTopN [3] operator. The OpExecutor [4] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.

      ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.

      Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT.

      [1] http://markmail.org/thread/5d2gtazkoxsa2ayv
      [2] http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html
      [3] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java
      [4] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java

      1. ARQ_JENA-89_r1156212.patch
        12 kB
        Paolo Castagna
      2. JENA-89-Top-FixHeapCollation.patch
        6 kB
        Andy Seaborne
      3. JENA-89-TopTests.patch
        33 kB
        Andy Seaborne

        Issue Links

          Activity

          Hide
          Andy Seaborne added a comment -

          Set fix version to clear up unmarked ones.

          Show
          Andy Seaborne added a comment - Set fix version to clear up unmarked ones.
          Hide
          Andy Seaborne added a comment -

          Reopen to set fix version.

          Show
          Andy Seaborne added a comment - Reopen to set fix version.
          Hide
          Paolo Castagna added a comment -

          Thanks Andy (new tests have been added and bug fixed).

          Show
          Paolo Castagna added a comment - Thanks Andy (new tests have been added and bug fixed).
          Hide
          Andy Seaborne added a comment -

          In creating the tests, a bug was discovered.

          The heap needs to keep the least N items. So when a new item to keep comes in, the greatest item so far needs to be evicted, not the least. In other words, the heap is kept in maximum order and the test to to look at the max of the N items (heap top) to see if the new binding is less than the max least N so far.

          Patch attached. heap uses a reversed comparator.

          Also (minor) avoid a copy by Arrays.asList.

          Show
          Andy Seaborne added a comment - In creating the tests, a bug was discovered. The heap needs to keep the least N items. So when a new item to keep comes in, the greatest item so far needs to be evicted, not the least. In other words, the heap is kept in maximum order and the test to to look at the max of the N items (heap top) to see if the new binding is less than the max least N so far. Patch attached. heap uses a reversed comparator. Also (minor) avoid a copy by Arrays.asList.
          Hide
          Andy Seaborne added a comment -

          The existing tests are for other things and don't gaurantee coverage. Having a specific set of tests for the feature means it's easier to see what is and is not covered.

          Patch JENA-89-TopTests attached for tests in testing/ARQ/Optimizer.

          Show
          Andy Seaborne added a comment - The existing tests are for other things and don't gaurantee coverage. Having a specific set of tests for the feature means it's easier to see what is and is not covered. Patch JENA-89 -TopTests attached for tests in testing/ARQ/Optimizer.
          Hide
          Andy Seaborne added a comment -

          TestOptimizer (which you've already found).

          Could add op->op tests.

          c.f. TestVarRename

          Show
          Andy Seaborne added a comment - TestOptimizer (which you've already found). Could add op->op tests. c.f. TestVarRename
          Hide
          Paolo Castagna added a comment -

          Thanks Andy.

          By the way, I had a look at the scripted tests: ORDER BY + LIMIT queries are present in various places. So, I did not duplicate those tests in testing/ARQ/Optimization.

          However, it would be good to have something similar to the scripted tests but to check the SPARQL algebra before optimization and after optimization. Is this something already there? I did not find it. If not, is it worth doing? I can give it a try.

          Show
          Paolo Castagna added a comment - Thanks Andy. By the way, I had a look at the scripted tests: ORDER BY + LIMIT queries are present in various places. So, I did not duplicate those tests in testing/ARQ/Optimization. However, it would be good to have something similar to the scripted tests but to check the SPARQL algebra before optimization and after optimization. Is this something already there? I did not find it. If not, is it worth doing? I can give it a try.
          Hide
          Andy Seaborne added a comment -

          To test this, we could have scripted queries in testing/ARQ/Optimization although it might incidently be partially covered elsewhere.

          Show
          Andy Seaborne added a comment - To test this, we could have scripted queries in testing/ARQ/Optimization although it might incidently be partially covered elsewhere.
          Hide
          Andy Seaborne added a comment -

          new PriorityQueue<Binding>(TOPN_LIMIT_THRESHOLD, comparator) ;

          Should this be:

          new PriorityQueue<Binding>(numItems, comparator) ;

          ?
          The PQ only needs to be the size of N.

          Show
          Andy Seaborne added a comment - new PriorityQueue<Binding>(TOPN_LIMIT_THRESHOLD, comparator) ; Should this be: new PriorityQueue<Binding>(numItems, comparator) ; ? The PQ only needs to be the size of N.
          Hide
          Andy Seaborne added a comment -

          A fixed threshold seems sensible. It can also always be made adjustable later. I think the key use case is small values (e.g. 10) so a fixed limit more than that is fine.

          BuilderOp fixes look right.

          qparse will test this out - it prefers various checks, including round-tripping through printed algebra (as you've probably already discovered )

          Show
          Andy Seaborne added a comment - A fixed threshold seems sensible. It can also always be made adjustable later. I think the key use case is small values (e.g. 10) so a fixed limit more than that is fine. BuilderOp fixes look right. qparse will test this out - it prefers various checks, including round-tripping through printed algebra (as you've probably already discovered )
          Hide
          Paolo Castagna added a comment - - edited

          The optimizer applies the transformation if and only if LIMIT is less than the threshold and no OFFSET is used.
          Are we happy with a fixed TOPN_LIMIT_THRESHOLD = 100?

          The optimization is active by default, but it can be turned off via --set http://jena.hpl.hp.com/ARQ#optTopNSorting=false

          The changes in BuilderOp are the ones I am less sure about:

          • BuilderLib.checkLength(4, list, Tags.tagTopN) ;
          • int N = BuilderNode.buildInt(list, 1, -1) ;
          • ItemList conditions = list.get(2).getList() ;
            + BuilderLib.checkLength(3, list, Tags.tagTopN) ;
            + long N = BuilderNode.buildInt(list.get(1).getList(), 0, -1) ;
            + ItemList conditions = list.get(1).getList().cdr() ;

          An example of the optimization in action follows:

          SPARQL query:

          SELECT *
          WHERE

          { ?s ?p ?o }

          ORDER BY ?p ?s
          LIMIT 10

          SPARQL algebra expression unchanged:

          (slice _ 10
          (order (?p ?s)
          (bgp (triple ?s ?p ?o))))

          SPARQL algebra expression with the optimization applied:

          (top (10 ?p ?s)
          (bgp (triple ?s ?p ?o)))

          Show
          Paolo Castagna added a comment - - edited The optimizer applies the transformation if and only if LIMIT is less than the threshold and no OFFSET is used. Are we happy with a fixed TOPN_LIMIT_THRESHOLD = 100? The optimization is active by default, but it can be turned off via --set http://jena.hpl.hp.com/ARQ#optTopNSorting=false The changes in BuilderOp are the ones I am less sure about: BuilderLib.checkLength(4, list, Tags.tagTopN) ; int N = BuilderNode.buildInt(list, 1, -1) ; ItemList conditions = list.get(2).getList() ; + BuilderLib.checkLength(3, list, Tags.tagTopN) ; + long N = BuilderNode.buildInt(list.get(1).getList(), 0, -1) ; + ItemList conditions = list.get(1).getList().cdr() ; An example of the optimization in action follows: SPARQL query: SELECT * WHERE { ?s ?p ?o } ORDER BY ?p ?s LIMIT 10 SPARQL algebra expression unchanged: (slice _ 10 (order (?p ?s) (bgp (triple ?s ?p ?o)))) SPARQL algebra expression with the optimization applied: (top (10 ?p ?s) (bgp (triple ?s ?p ?o)))
          Hide
          Paolo Castagna added a comment -

          I am attaching a patch for this to get some feedback before committing it. There are a couple of changes I am not 100% sure about.

          Show
          Paolo Castagna added a comment - I am attaching a patch for this to get some feedback before committing it. There are a couple of changes I am not 100% sure about.

            People

            • Assignee:
              Paolo Castagna
              Reporter:
              Paolo Castagna
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development