Uploaded image for project: 'Apache Jena'
  1. Apache Jena
  2. JENA-441

In some cases it may be useful to apply DISTINCT before applying ORDER BY

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

      Description

      One of our internal users highlighted an interesting query where changing the plan makes a big difference in performance.

      The query is essentially the following:

      SELECT DISTINCT ?p
      WHERE
      {
      ?s ?p ?o
      } ORDER BY ?p

      Leaving the fact that it is a fundamentally dumb query to write the user had an interesting suggestion about the query plan, currently this generates the following:

      (distinct
      (project (?predicate)
      (order (?predicate)
      (quadpattern (quad <urn:x-arq:DefaultGraphNode> ?s ?predicate ?o)))))

      For cases like this it may actually be much more performant to do the distinct first, because of the associated semantics of the various operators you can't just simply put distinct before the order but if you rewrite the query as follows:

      SELECT ?p
      WHERE
      {
      { SELECT DISTINCT ?p WHERE

      { ?s ?p ?o }

      }
      } ORDER BY ?p

      You get the likely much more performant plan:

      (project (?predicate)
      (order (?predicate)
      (distinct
      (project (?predicate)
      (quadpattern (quad <urn:x-arq:DefaultGraphNode> ?/s ?predicate ?/o))))))

      Clearly this optimization does not apply in the general case, I think it only applies in the case where you have a DISTINCT and all ORDER BY conditions are simple variables and only those variables are projected in which case I think you could produce a plan like the following:

      (order (?predicate)
      (project (?predicate)
      (distinct
      (quadpattern (quad <urn:x-arq:DefaultGraphNode> ?s ?predicate ?o)))))

      I am pretty certain this applies in only a few cases and I haven't reproduced this with TDB to see if the performance difference is noticeable yet.

      Andy - I will try and gather some more information and experiment with this so don't feel you have to look at this one for the time being.

        Attachments

        1. order-distinct.csv
          3 kB
          Rob Vesse
        2. order-distinct.txt
          16 kB
          Rob Vesse
        3. order-distinct.xml
          11 kB
          Rob Vesse
        4. order-distinct-opt.csv
          3 kB
          Rob Vesse
        5. order-distinct-opt.txt
          16 kB
          Rob Vesse
        6. order-distinct-opt.xml
          11 kB
          Rob Vesse
        7. order-distinct-opt-2.csv
          3 kB
          Rob Vesse
        8. order-distinct-opt-2.txt
          16 kB
          Rob Vesse
        9. order-distinct-opt-2.xml
          11 kB
          Rob Vesse

        Issue Links

          Activity

            People

            • Assignee:
              rvesse Rob Vesse
              Reporter:
              rvesse Rob Vesse

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment