Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-5611

When documents are uniformly distributed over shards, enable returning approximated results in distributed query

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      Query with rows=1000, which sent to a collection of 100 shards (shard key behaviour is default - based on hash of the unique key), will generate 100 requests of rows=1000, on each shard.
      This results to total number of rows*numShards unique keys to be retrieved. This behaviour is getting worst as numShards grows.

      If the documents are uniformly distributed over the shards, the expected number of document should be ~ rows/numShards. Obviously, there might be extreme cases, when all of the top X documents are in a specific shard.

      I suggest adding an optional parameter, say approxResults=true, which decides whether we should limit the rows in the shard requests to rows/numShardsor not. Moreover, we can add a numeric parameter which increases the limit, to be more accurate.
      For example, the query approxResults=true&approxResults.factor=1.5 will retrieve 1.5*rows/numShards from each shard. In the case of 100 shards and rows=1000, each shard will return 15 documents.

      Furthermore, this can reduce the problem of deep paging, because the same thing can be applied there. when requested start=100000, Solr creating shard request with start=0 and rows=START+ROWS. In the approximated approach, start parameter (in the shard requests) can be set to 100000/numShards. The idea of the approxResults.factor creates some difficulties here, though.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            isaachebsh Isaac Hebsh

            Dates

              Created:
              Updated:

              Slack

                Issue deployment