Apache Jena
  1. Apache Jena
  2. JENA-111

Improving TopN optimization in case of an intermediate OpModifier

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: ARQ
    • Labels:

      Description

      In the TopN optimization (Jena-89) it would be useful to handle also the case in which there are some other OpModifiers (I think they are the only category of Ops that can be in that position in the tree) between Slice and Order By, for example OpProject:

      (slice _1
      (project ?s ...
      (order by <condition>

      ->
      (project ?s ...
      (top 1 <condition>

      1. JENA-111_ARQ_r1165795.patch
        12 kB
        Paolo Castagna
      2. topk_project.patch
        2 kB
        Sara Magliacane

        Issue Links

          Activity

          Hide
          Sara Magliacane added a comment -

          I tried to fix the problem in the most simple way possible, but it won't compile, because it fails two tests in the DAWG-FInal/solution-seq/ folder ( slice_10.rq and slice_11.rq).

          I'm not sure whether this is a problem of this particular code, because usually in this test the TopN optimization is not triggered since it contains an OpProject. Moreover, I tried to change the queries from a "SELECT ?v" -> "SELECT * " and the original code does not pass the test (but maybe my understanding of what should be returned by a SELECT * in case of a blank node is wrong.... )

          Show
          Sara Magliacane added a comment - I tried to fix the problem in the most simple way possible, but it won't compile, because it fails two tests in the DAWG-FInal/solution-seq/ folder ( slice_10.rq and slice_11.rq). I'm not sure whether this is a problem of this particular code, because usually in this test the TopN optimization is not triggered since it contains an OpProject. Moreover, I tried to change the queries from a "SELECT ?v" -> "SELECT * " and the original code does not pass the test (but maybe my understanding of what should be returned by a SELECT * in case of a blank node is wrong.... )
          Hide
          Andy Seaborne added a comment -

          For reference, the full possible stack of modifiers possible in a single level of SPARQL query is:

          Limit/Offset
          Distinct/reduce
          project
          OrderBy
          Bindings
          having
          select expressions
          group

          but, by using subSELECTs, you can write any sequence you want.

          SELECT DISTINCT ?x

          { ?x :p 123 . }

          ORDER BY ?x LIMIT 2

          (slice _ 2
          (distinct
          (project (?x)
          (order (?x)
          (bgp (triple ?x 123))))))

          Show
          Andy Seaborne added a comment - For reference, the full possible stack of modifiers possible in a single level of SPARQL query is: Limit/Offset Distinct/reduce project OrderBy Bindings having select expressions group but, by using subSELECTs, you can write any sequence you want. SELECT DISTINCT ?x { ?x :p 123 . } ORDER BY ?x LIMIT 2 (slice _ 2 (distinct (project (?x) (order (?x) (bgp (triple ?x 123))))))
          Hide
          Paolo Castagna added a comment - - edited

          Sara's proposal seems a good idea to me and I did not spotted initially.
          Are there other situations where an algebra operator between slice and order will inhibit the optimizations in JENA-89 and JENA-109?

          Given:

          (slice X N
          (distinct or reduced
          (project (vars)
          (order (cond)
          PATTERN ))))

          There are four possibilities:

          1) slice order [DONE]
          2) slice distinct|reduced order [DONE]
          3) slice distinct project order [NEW]
          4) slice project order [NEW]

          1)

          (slice X N
          (order (cond)
          PATTERN ))
          ==>
          (slice X _
          (top (X+N cond)
          PATTERN ))

          2)

          (slice X N
          (distinct or reduced
          (order (cond)
          PATTERN )))
          ==>
          (slice X _
          (top (X+N cond)
          (distinct
          PATTERN )))

          3)

          (slice X N
          (distinct or reduced
          (project (vars)
          (order (cond)
          PATTERN ))))
          ==>
          (slice X _
          (distinct
          (project (vars)
          (top (X+N cond)
          PATTERN ))))

          4)

          (slice X N
          (project (vars)
          (order (cond)
          PATTERN )))
          ==>
          (slice X _
          (project (vars)
          (top (X+N cond)
          PATTERN )))

          Although, I am not 100% sure about 3) and 4).

          Show
          Paolo Castagna added a comment - - edited Sara's proposal seems a good idea to me and I did not spotted initially. Are there other situations where an algebra operator between slice and order will inhibit the optimizations in JENA-89 and JENA-109 ? Given: (slice X N (distinct or reduced (project (vars) (order (cond) PATTERN )))) There are four possibilities: 1) slice order [DONE] 2) slice distinct|reduced order [DONE] 3) slice distinct project order [NEW] 4) slice project order [NEW] 1) (slice X N (order (cond) PATTERN )) ==> (slice X _ (top (X+N cond) PATTERN )) 2) (slice X N (distinct or reduced (order (cond) PATTERN ))) ==> (slice X _ (top (X+N cond) (distinct PATTERN ))) 3) (slice X N (distinct or reduced (project (vars) (order (cond) PATTERN )))) ==> (slice X _ (distinct (project (vars) (top (X+N cond) PATTERN )))) 4) (slice X N (project (vars) (order (cond) PATTERN ))) ==> (slice X _ (project (vars) (top (X+N cond) PATTERN ))) Although, I am not 100% sure about 3) and 4).
          Hide
          Andy Seaborne added a comment -

          (project) does change the number of rows and (slice) does not depend on the columns/variables, so the optimization which is the connection of slice and order by, is unaffected by pushing projection around. In ARQ project is a mask, it does not matter is it is inside or outside the slice but it's better to keep it in the right place.

          4 looks right.

          3 is harder because DISTINCT interacts with projection. The transformation above isn't right because distinct changes the number rows, so topN may not generate enough unique rows for the slice. It sin't possible statically, without reference to the data to know the right value for N in X+N. This one needs some further thinking.

          Show
          Andy Seaborne added a comment - (project) does change the number of rows and (slice) does not depend on the columns/variables, so the optimization which is the connection of slice and order by, is unaffected by pushing projection around. In ARQ project is a mask, it does not matter is it is inside or outside the slice but it's better to keep it in the right place. 4 looks right. 3 is harder because DISTINCT interacts with projection. The transformation above isn't right because distinct changes the number rows, so topN may not generate enough unique rows for the slice. It sin't possible statically, without reference to the data to know the right value for N in X+N. This one needs some further thinking.
          Hide
          Paolo Castagna added a comment -

          > 4 looks right.

          There must be something strange going on here. I added tests to TestOptimizer.java to ensure 4) is applied correctly. The tests pass, however there are failures elsewhere in the ARQ test suite (as Sara pointed out). I am going to share the patch.

          > 3 is harder because DISTINCT interacts with projection.

          Yep, I learned that by doing it... and I don't have a good idea at the moment.

          Show
          Paolo Castagna added a comment - > 4 looks right. There must be something strange going on here. I added tests to TestOptimizer.java to ensure 4) is applied correctly. The tests pass, however there are failures elsewhere in the ARQ test suite (as Sara pointed out). I am going to share the patch. > 3 is harder because DISTINCT interacts with projection. Yep, I learned that by doing it... and I don't have a good idea at the moment.
          Hide
          Paolo Castagna added a comment -

          I am sharing this patch (which is Sara's idea + a few tests) to let others see the failing tests:

          • ARQ - Scripts > Solution Sequence > Modifier OFFSET
          • TS_DAWG > Approved > SPARQL Query Evaluation tests > Solution Sequence > Offset 1
          • TS_DAWG > Approved > SPARQL Query Evaluation tests > Solution Sequence > Offset 2

          (Ignore the 4 failures in TestOptimizer.java for now).

          The three failures above are all very similar, for example:

          Failure: Test 389 :: Modifier OFFSET
          Query:
          PREFIX : <http://example.org/ns#>

          SELECT ?v
          WHERE

          { _:b0 :str ?v }

          ORDER BY ?v
          OFFSET 2

          Got: 0 --------------------------------


          v

          =====


          Expected: 3 -----------------------------
          ---------

          v

          =========

          "1"
          "AAA"
          "aaa"

          ---------

          I don't understand what's going wrong here. :-/

          Show
          Paolo Castagna added a comment - I am sharing this patch (which is Sara's idea + a few tests) to let others see the failing tests: ARQ - Scripts > Solution Sequence > Modifier OFFSET TS_DAWG > Approved > SPARQL Query Evaluation tests > Solution Sequence > Offset 1 TS_DAWG > Approved > SPARQL Query Evaluation tests > Solution Sequence > Offset 2 (Ignore the 4 failures in TestOptimizer.java for now). The three failures above are all very similar, for example: Failure: Test 389 :: Modifier OFFSET Query: PREFIX : < http://example.org/ns# > SELECT ?v WHERE { _:b0 :str ?v } ORDER BY ?v OFFSET 2 Got: 0 -------------------------------- v ===== Expected: 3 ----------------------------- --------- v ========= "1" "AAA" "aaa" --------- I don't understand what's going wrong here. :-/
          Hide
          Andy Seaborne added a comment -

          There is a bug in the transformation (it fails to check there is a LIMIT at all). fixed.

          Then adding code for case 4 (which is case 3 in the code) seems to work. Please review the current TransformTopN

          Test from the patch applied to the source code except slice_order_to_topn_0

          {5,6,7}

          a() which are false tests (they are for case 3 above which isn't a valid optimization).

          Note that the case of plain project over order is commented out in TransformTopN - if uncommented. I'm getting warnings about unclosed iterators - which suggests something is not passing down ,close() calls. Warnings are from "Solution Sequence" / "Limit 3" (recorded as JENA-114).

          TestOptimizer.slice_order_to_topn_01a() is comments out because of this.

          Show
          Andy Seaborne added a comment - There is a bug in the transformation (it fails to check there is a LIMIT at all). fixed. Then adding code for case 4 (which is case 3 in the code) seems to work. Please review the current TransformTopN Test from the patch applied to the source code except slice_order_to_topn_0 {5,6,7} a() which are false tests (they are for case 3 above which isn't a valid optimization). Note that the case of plain project over order is commented out in TransformTopN - if uncommented. I'm getting warnings about unclosed iterators - which suggests something is not passing down ,close() calls. Warnings are from "Solution Sequence" / "Limit 3" (recorded as JENA-114 ). TestOptimizer.slice_order_to_topn_01a() is comments out because of this.
          Hide
          Paolo Castagna added a comment -

          > There is a bug in the transformation (it fails to check there is a LIMIT at all). fixed.

          Thanks Andy, good catch.

          > TestOptimizer.slice_order_to_topn_01a() is comments out because of this.

          Put it back.

          Show
          Paolo Castagna added a comment - > There is a bug in the transformation (it fails to check there is a LIMIT at all). fixed. Thanks Andy, good catch. > TestOptimizer.slice_order_to_topn_01a() is comments out because of this. Put it back.
          Hide
          Paolo Castagna added a comment -

          Thank you Sara for the idea and the patch.
          Kudos to Andy for finding the bug in TransformTopN.

          Show
          Paolo Castagna added a comment - Thank you Sara for the idea and the patch. Kudos to Andy for finding the bug in TransformTopN.

            People

            • Assignee:
              Paolo Castagna
              Reporter:
              Sara Magliacane
            • Votes:
              2 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development