Let's take, for example, this SPARQL query:

SELECT DISTINCT *

WHERE

{ ?s ?p ?o }

ORDER BY ?p

LIMIT 10

The correspondent algebra expression is:

(slice _ 10

(distinct

(order (?p)

(bgp (triple ?s ?p ?o)))))

Which is equivalent to:

(slice _ 10

(reduced

(order (?p)

(bgp (triple ?s ?p ?o)))))

However, the distinct or reduced operators forbid the optimization described in ~~JENA-89~~. Maybe we can modify the 'top' operator to yields only distinct bindings or add a new 'top_distinct' operator for that:

(top_distinct (10 ?p ?s)

(bgp (triple ?s ?p ?o)))

SPARQL queries of the type SELECT DISTINCT ... WHERE

{...}

ORDER BY ... LIMIT 10 are common when people want to display the 10 most 'something' things in their dataset.

The implementation of a QueryIterTopNDistinct is almost the same as QueryIterTopN (see: ~~JENA-89~~) but we add bindings to the PriorityQueue if and only if they are not already there (using .contains() to check).

Is it worth adding a top_distinct operator or it just pollutes the algebra?

Thanks Stephen and Andy for your comments and guidance.

I now want to think (again!) at how the optimizations in

~~JENA-89~~and~~JENA-90~~interact each others when we have a DISTINCT + ORDER BY + LIMIT query (i.e. people want to find the 10 most something things in their data).