Apache Jena
  1. Apache Jena
  2. JENA-90

Use OpReduce instead of OpDistinct for DISTINCT + ORDER BY queries

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: ARQ
    • Labels:

      Description

      ARQ's optimizer could use an OpReduce instead of OpDistinct if a query is DISTINCT + ORDER BY.
      OpReduce removes adjacent duplicates and it does not require a set of already seen bindings as the current OpDistinct implementation does.

        Issue Links

          Activity

          Hide
          Paolo Castagna added a comment -

          Thanks Stephen and Andy for your comments and guidance.

          I now want to think (again!) at how the optimizations in JENA-89 and JENA-90 interact each others when we have a DISTINCT + ORDER BY + LIMIT query (i.e. people want to find the 10 most something things in their data).

          Show
          Paolo Castagna added a comment - Thanks Stephen and Andy for your comments and guidance. I now want to think (again!) at how the optimizations in JENA-89 and JENA-90 interact each others when we have a DISTINCT + ORDER BY + LIMIT query (i.e. people want to find the 10 most something things in their data).
          Hide
          Paolo Castagna added a comment -

          Stephen, thanks for your comments and suggestions. They make sense, I'll do that.

          Show
          Paolo Castagna added a comment - Stephen, thanks for your comments and suggestions. They make sense, I'll do that.
          Hide
          Stephen Allen added a comment -

          Hi Paolo,

          I think the approach you want is to use QueryIterReduced instead of the new QueryIterDistinctSort class you propose (also, an important note: [1]). Perhaps QueryIterReduced could possibly be optimized a little bit by eliminating the general purpose window array and using a single variable in this particular case of a sorted input.

          Although, in my mind, a better approach would be to modify the algebra as part of a query optimization step (replace the OpDistinct with an OpReduced) when it is known that the QueryIterator to which it is applied to is sorted (either because of an underlying OpOrder or a sorted triple/quad index). This makes it easier to determine what is going on during a query execution by examining the transformed algebra instead of having branches in the physical operators themselves.

          [1] DistinctDataBag is not guaranteed to be sorted. The in-memory bindings are stored in a HashSet, thus if the bag does not spill to disk then no attempt is made to sort the bindings in the iterator (so as not to perform extra effort). It would not be hard to create a DistinctSortedDataBag, but I'm not sure that it is necessary (and IMO limiting the number of primitive operations helps simplify the system).

          Show
          Stephen Allen added a comment - Hi Paolo, I think the approach you want is to use QueryIterReduced instead of the new QueryIterDistinctSort class you propose (also, an important note: [1] ). Perhaps QueryIterReduced could possibly be optimized a little bit by eliminating the general purpose window array and using a single variable in this particular case of a sorted input. Although, in my mind, a better approach would be to modify the algebra as part of a query optimization step (replace the OpDistinct with an OpReduced) when it is known that the QueryIterator to which it is applied to is sorted (either because of an underlying OpOrder or a sorted triple/quad index). This makes it easier to determine what is going on during a query execution by examining the transformed algebra instead of having branches in the physical operators themselves. [1] DistinctDataBag is not guaranteed to be sorted. The in-memory bindings are stored in a HashSet, thus if the bag does not spill to disk then no attempt is made to sort the bindings in the iterator (so as not to perform extra effort). It would not be hard to create a DistinctSortedDataBag, but I'm not sure that it is necessary (and IMO limiting the number of primitive operations helps simplify the system).
          Hide
          Paolo Castagna added a comment -

          This is a first attempt which I am sharing for feedback (now, more testing is necessary... after the lesson learned from JENA-89 ).

          Another option would have been to replace distinct with reduce via an algebra transformation and used the QueryIterDistinctSort in OpExecutor.execute(OpReduced, ...) method instead. But I did not see advantages in doing so, unless we want to control the optimization via a symbol.

          Both are reasonable and very simple... and the tests will be the same.

          Show
          Paolo Castagna added a comment - This is a first attempt which I am sharing for feedback (now, more testing is necessary... after the lesson learned from JENA-89 ). Another option would have been to replace distinct with reduce via an algebra transformation and used the QueryIterDistinctSort in OpExecutor.execute(OpReduced, ...) method instead. But I did not see advantages in doing so, unless we want to control the optimization via a symbol. Both are reasonable and very simple... and the tests will be the same.
          Hide
          Paolo Castagna added a comment -

          A DistinctDataBag can be used to provide a new QueryIterDistinctSort implementation to be used in OpExecutor.execute(OpDistinct opDistinct, QueryIterator input) method instead of QueryIterDistinct when a distinct ( order (...)) pattern is spotted.

          Show
          Paolo Castagna added a comment - A DistinctDataBag can be used to provide a new QueryIterDistinctSort implementation to be used in OpExecutor.execute(OpDistinct opDistinct, QueryIterator input) method instead of QueryIterDistinct when a distinct ( order (...)) pattern is spotted.
          Hide
          Paolo Castagna added a comment -

          > I think 3 is neater.

          Ok.

          However, I still have a doubt.

          With this algebra expression:

          (slice _ 10
          (distinct
          (order (?p)
          ...)))

          The optimization in JENA-89 will not trigger. Therefore there will be no (top ...)

          Could we have the distinct before the order?

          (slice _ 10
          (order (?p)
          (distinct
          ...)))

          Or, should we do it in OpExecutor(OpSlice) checking if it sees the algebra patter (slice ... (distinct ... (order (...)))?

          Show
          Paolo Castagna added a comment - > I think 3 is neater. Ok. However, I still have a doubt. With this algebra expression: (slice _ 10 (distinct (order (?p) ...))) The optimization in JENA-89 will not trigger. Therefore there will be no (top ...) Could we have the distinct before the order? (slice _ 10 (order (?p) (distinct ...))) Or, should we do it in OpExecutor(OpSlice) checking if it sees the algebra patter (slice ... (distinct ... (order (...)))?
          Hide
          Andy Seaborne added a comment -

          There are a number of ways to approach this:

          1/ Add a parameter to the "top" operator e.g.(top distinct 10 (sort) (sub))
          2/ Add a new operator e.g. top_distinct
          3/ Decide when choosing QueryIterator based on (sub)

          In (3), if the OpExecutor(OpTop) sees the algebra pattern:

          (top N (?x)
          (distinct
          ....))

          by

          opTop.getSubOp() instanceof OpDistinct

          then it chooses QueryIterTopNDistinct over the sub-expression of OpDistinct
          else choose QueryIterTopN.

          This is treating it as a lower-level optimization.

          I think 3 is neater.

          Show
          Andy Seaborne added a comment - There are a number of ways to approach this: 1/ Add a parameter to the "top" operator e.g.(top distinct 10 (sort) (sub)) 2/ Add a new operator e.g. top_distinct 3/ Decide when choosing QueryIterator based on (sub) In (3), if the OpExecutor(OpTop) sees the algebra pattern: (top N (?x) (distinct ....)) by opTop.getSubOp() instanceof OpDistinct then it chooses QueryIterTopNDistinct over the sub-expression of OpDistinct else choose QueryIterTopN. This is treating it as a lower-level optimization. I think 3 is neater.
          Hide
          Andreas Schultz added a comment -

          About not needing a set of already seen bindings: This would only be true if all variables in the Select clause were also part of the Order By. In general a set would still be needed. The optimization in the general case would be to clear this set every time the variable bindings used in the Order By change.

          Show
          Andreas Schultz added a comment - About not needing a set of already seen bindings: This would only be true if all variables in the Select clause were also part of the Order By. In general a set would still be needed. The optimization in the general case would be to clear this set every time the variable bindings used in the Order By change.
          Hide
          Paolo Castagna added a comment -

          Let's take, for example, this SPARQL query:

          SELECT DISTINCT *
          WHERE

          { ?s ?p ?o }

          ORDER BY ?p
          LIMIT 10

          The correspondent algebra expression is:

          (slice _ 10
          (distinct
          (order (?p)
          (bgp (triple ?s ?p ?o)))))

          Which is equivalent to:

          (slice _ 10
          (reduced
          (order (?p)
          (bgp (triple ?s ?p ?o)))))

          However, the distinct or reduced operators forbid the optimization described in JENA-89. Maybe we can modify the 'top' operator to yields only distinct bindings or add a new 'top_distinct' operator for that:

          (top_distinct (10 ?p ?s)
          (bgp (triple ?s ?p ?o)))

          SPARQL queries of the type SELECT DISTINCT ... WHERE

          {...}

          ORDER BY ... LIMIT 10 are common when people want to display the 10 most 'something' things in their dataset.

          The implementation of a QueryIterTopNDistinct is almost the same as QueryIterTopN (see: JENA-89) but we add bindings to the PriorityQueue if and only if they are not already there (using .contains() to check).

          Is it worth adding a top_distinct operator or it just pollutes the algebra?

          Show
          Paolo Castagna added a comment - Let's take, for example, this SPARQL query: SELECT DISTINCT * WHERE { ?s ?p ?o } ORDER BY ?p LIMIT 10 The correspondent algebra expression is: (slice _ 10 (distinct (order (?p) (bgp (triple ?s ?p ?o))))) Which is equivalent to: (slice _ 10 (reduced (order (?p) (bgp (triple ?s ?p ?o))))) However, the distinct or reduced operators forbid the optimization described in JENA-89 . Maybe we can modify the 'top' operator to yields only distinct bindings or add a new 'top_distinct' operator for that: (top_distinct (10 ?p ?s) (bgp (triple ?s ?p ?o))) SPARQL queries of the type SELECT DISTINCT ... WHERE {...} ORDER BY ... LIMIT 10 are common when people want to display the 10 most 'something' things in their dataset. The implementation of a QueryIterTopNDistinct is almost the same as QueryIterTopN (see: JENA-89 ) but we add bindings to the PriorityQueue if and only if they are not already there (using .contains() to check). Is it worth adding a top_distinct operator or it just pollutes the algebra?

            People

            • Assignee:
              Paolo Castagna
              Reporter:
              Paolo Castagna
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development