Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-5611

When documents are uniformly distributed over shards, enable returning approximated results in distributed query




      Query with rows=1000, which sent to a collection of 100 shards (shard key behaviour is default - based on hash of the unique key), will generate 100 requests of rows=1000, on each shard.
      This results to total number of rows*numShards unique keys to be retrieved. This behaviour is getting worst as numShards grows.

      If the documents are uniformly distributed over the shards, the expected number of document should be ~ rows/numShards. Obviously, there might be extreme cases, when all of the top X documents are in a specific shard.

      I suggest adding an optional parameter, say approxResults=true, which decides whether we should limit the rows in the shard requests to rows/numShardsor not. Moreover, we can add a numeric parameter which increases the limit, to be more accurate.
      For example, the query approxResults=true&approxResults.factor=1.5 will retrieve 1.5*rows/numShards from each shard. In the case of 100 shards and rows=1000, each shard will return 15 documents.

      Furthermore, this can reduce the problem of deep paging, because the same thing can be applied there. when requested start=100000, Solr creating shard request with start=0 and rows=START+ROWS. In the approximated approach, start parameter (in the shard requests) can be set to 100000/numShards. The idea of the approxResults.factor creates some difficulties here, though.


        1. lec5-distributedIndexing.pdf
          196 kB
          Manuel Lenormand



            • Assignee:
              isaachebsh Isaac Hebsh
            • Votes:
              2 Vote for this issue
              8 Start watching this issue


              • Created: