Apache Jena
  1. Apache Jena
  2. JENA-89

Avoid a total sort for ORDER BY + LIMIT queries

    Details

      Description

      In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
      As an alternative, discussed on jena-dev [1], we can use a PriorityQueue [2] (which is based on a priority heap) to avoid a sort operation.

      ARQ's algebra package contains already a OpTopN [3] operator. The OpExecutor [4] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.

      ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.

      Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT.

      [1] http://markmail.org/thread/5d2gtazkoxsa2ayv
      [2] http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html
      [3] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java
      [4] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java

      1. JENA-89-TopTests.patch
        33 kB
        Andy Seaborne
      2. JENA-89-Top-FixHeapCollation.patch
        6 kB
        Andy Seaborne
      3. ARQ_JENA-89_r1156212.patch
        12 kB
        Paolo Castagna

        Issue Links

          Activity

          Andy Seaborne made changes -
          Status Reopened [ 4 ] Closed [ 6 ]
          Resolution Done [ 11 ]
          Fix Version/s Jena 2.11.0 [ 12324437 ]
          Andy Seaborne made changes -
          Resolution Fixed [ 1 ]
          Status Closed [ 6 ] Reopened [ 4 ]
          Sara Magliacane made changes -
          Link This issue relates to JENA-111 [ JENA-111 ]
          Sara Magliacane made changes -
          Link This issue is blocked by JENA-111 [ JENA-111 ]
          Sara Magliacane made changes -
          Link This issue is blocked by JENA-111 [ JENA-111 ]
          Paolo Castagna made changes -
          Link This issue relates to JENA-109 [ JENA-109 ]
          Paolo Castagna made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Paolo Castagna made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Paolo Castagna made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Andy Seaborne made changes -
          Attachment JENA-89-TopTests.patch [ 12490344 ]
          Andy Seaborne made changes -
          Attachment JENA-89-Top-FixHeapCollation.patch [ 12490345 ]
          Andy Seaborne made changes -
          Attachment JENA-89-TopTests.patch [ 12490344 ]
          Andy Seaborne made changes -
          Attachment JENA-89-TopTests.patch [ 12490343 ]
          Paolo Castagna made changes -
          Status In Progress [ 3 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Andy Seaborne made changes -
          Comment [ OpBuilder was broken in another was as well: "Op sub = build(list, 3) ;" now fixed.
          ]
          Paolo Castagna made changes -
          Status Open [ 1 ] In Progress [ 3 ]
          Paolo Castagna made changes -
          Attachment ARQ_JENA-89_r1156212.patch [ 12490095 ]
          Paolo Castagna made changes -
          Link This issue is related to JENA-90 [ JENA-90 ]
          Paolo Castagna made changes -
          Description In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
          As an alternative, we can use a PriorityQueue [1] (which is based on a priority heap) to avoid a sort operation.

          ARQ's algebra package contains already a OpTopN [2] operator. The OpExecutor [3] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.

          ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.

          Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT.

            [1] http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html
            [2] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java
            [3] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java
          In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
          As an alternative, discussed on jena-dev [1], we can use a PriorityQueue [2] (which is based on a priority heap) to avoid a sort operation.

          ARQ's algebra package contains already a OpTopN [3] operator. The OpExecutor [4] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.

          ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.

          Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT.

            [1] http://markmail.org/thread/5d2gtazkoxsa2ayv
            [2] http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html
            [3] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java
            [4] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java
          Paolo Castagna made changes -
          Field Original Value New Value
          Description In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
          As an alternative, we can use a [PriorityQueue|http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html] (which is based on a priority heap) to avoid a sort operation.

          ARQ's algebra package contains already a [OpTopN|https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java] operator. The [OpExecutor|https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.

          ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.

          Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT.
          In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and then produces the first N, according to the specified LIMIT.
          As an alternative, we can use a PriorityQueue [1] (which is based on a priority heap) to avoid a sort operation.

          ARQ's algebra package contains already a OpTopN [2] operator. The OpExecutor [3] will need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also necessary.

          ORDER BY + LIMIT queries are typically used to construct the first page when results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using results from the cache. However, the improvement described by this issue is limited to the ORDER BY + LIMIT case for which a priority heap is a good enough solution.

          Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of small values specified on the LIMIT.

            [1] http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html
            [2] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java
            [3] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java
          Paolo Castagna created issue -

            People

            • Assignee:
              Paolo Castagna
              Reporter:
              Paolo Castagna
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development