Solr
  1. Solr
  2. SOLR-1880

Performance: Distributed Search should skip GET_FIELDS stage if EXECUTE_QUERY stage gets all fields

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 4.8, 6.0
    • Component/s: search
    • Labels:
      None

      Description

      Right now, a typical distributed search using QueryComponent makes two HTTP requests to each shard:

      1. STAGE_EXECUTE_QUERY executes one HTTP request to each shard to get top N ids and sort keys, merges the results to produce a final list of document IDs (PURPOSE_GET_TOP_IDS).
      2. STAGE_GET_FIELDS executes a second HTTP request to each shard to get the document field values for the final list of document IDs (PURPOSE_GET_FIELDS).

      If the "fl" param is just "id" or just "id,score", all document data to return is already fetched by STAGE_EXECUTE_QUERY. The second STAGE_GET_FIELDS query is completely unnecessary. Eliminating that 2nd HTTP request can make a big difference in overall performance.

      Also, the "fl" param only gets id, score and sort columns, it would probably be cheaper to fetch the final sort column data in STAGE_EXECUTE_QUERY which has to read the sort column data anyway, and skip STAGE_GET_FIELDS.

      1. ASF.LICENSE.NOT.GRANTED--one-pass-query.patch
        4 kB
        Shawn Smith
      2. ASF.LICENSE.NOT.GRANTED--one-pass-query-v1.4.0.patch
        4 kB
        Shawn Smith
      3. SOLR-1880.patch
        10 kB
        Shalin Shekhar Mangar
      4. SOLR-1880.patch
        10 kB
        Vitaliy Zhovtyuk

        Issue Links

          Activity

          Hide
          Shawn Smith added a comment -

          We mainly use Solr mainly to fetch just document IDs, then look up those IDs in a database. So this would make a big difference for us.

          In particular, we have a few reports that fetch the IDs of top ~50,000 documents (rows=50000). With so many IDs to return, the GET_TOP_IDS requests execute in a couple of hundred milliseconds but the GET_FIELDS requests take 5-10 seconds. So on those queries we'd get more than a 10x speedup by skipping the 2nd request.

          Show
          Shawn Smith added a comment - We mainly use Solr mainly to fetch just document IDs, then look up those IDs in a database. So this would make a big difference for us. In particular, we have a few reports that fetch the IDs of top ~50,000 documents (rows=50000). With so many IDs to return, the GET_TOP_IDS requests execute in a couple of hundred milliseconds but the GET_FIELDS requests take 5-10 seconds. So on those queries we'd get more than a 10x speedup by skipping the 2nd request.
          Hide
          Shawn Smith added a comment -

          Attached a trunk patch that skips STAGE_GET_FIELDS if the "fl" param is just "id" or "id,score".

          Show
          Shawn Smith added a comment - Attached a trunk patch that skips STAGE_GET_FIELDS if the "fl" param is just "id" or "id,score".
          Hide
          Shawn Smith added a comment -

          Attached a version of the patch that can applied to v1.4.0 source. The trunk patch above assumes a couple of fixes made since v1.4.0.

          Show
          Shawn Smith added a comment - Attached a version of the patch that can applied to v1.4.0 source. The trunk patch above assumes a couple of fixes made since v1.4.0.
          Hide
          Erick Erickson added a comment -

          2013 Old JIRA cleanup

          Show
          Erick Erickson added a comment - 2013 Old JIRA cleanup
          Hide
          Vitaliy Zhovtyuk added a comment -

          Updated to latest trunk.
          Added functional distributed test org.apache.solr.handler.component.DistributedQueryComponentOptimizationTest for one step pass.
          Added trace to return error reason in org.apache.solr.client.solrj.impl.HttpSolrServer, otherwise runtime errors hard to detect.

          Show
          Vitaliy Zhovtyuk added a comment - Updated to latest trunk. Added functional distributed test org.apache.solr.handler.component.DistributedQueryComponentOptimizationTest for one step pass. Added trace to return error reason in org.apache.solr.client.solrj.impl.HttpSolrServer, otherwise runtime errors hard to detect.
          Hide
          Shalin Shekhar Mangar added a comment -

          Also, the "fl" param only gets id, score and sort columns, it would probably be cheaper to fetch the final sort column data in STAGE_EXECUTE_QUERY which has to read the sort column data anyway, and skip STAGE_GET_FIELDS.

          Thanks Vitaliy. When fl=id,score,sortField then the STAGE_GET_FIELDS is still executed, right? In other words, the only case which is optimized is when fl=id,score. That alone is also a nice improvement but since the issue description as well as your test has the above comment, I thought I should ask.

          Show
          Shalin Shekhar Mangar added a comment - Also, the "fl" param only gets id, score and sort columns, it would probably be cheaper to fetch the final sort column data in STAGE_EXECUTE_QUERY which has to read the sort column data anyway, and skip STAGE_GET_FIELDS. Thanks Vitaliy. When fl=id,score,sortField then the STAGE_GET_FIELDS is still executed, right? In other words, the only case which is optimized is when fl=id,score. That alone is also a nice improvement but since the issue description as well as your test has the above comment, I thought I should ask.
          Hide
          Vitaliy Zhovtyuk added a comment -

          Yes, this optimization will work in case fl=id,score only

          Show
          Vitaliy Zhovtyuk added a comment - Yes, this optimization will work in case fl=id,score only
          Hide
          Shalin Shekhar Mangar added a comment -

          There was some code duplication in QueryComponent.returnFields:

          for (ShardResponse srsp : sreq.responses) {
                  SolrDocumentList docs = (SolrDocumentList) srsp.getSolrResponse().getResponse().get("response");
          
                  for (SolrDocument doc : docs) {
                    Object id = doc.getFieldValue(keyFieldName);
                    ShardDoc sdoc = rb.resultIds.get(id.toString());
                    if (sdoc != null) {
                      if (returnScores && sdoc.score != null) {
                        doc.setField("score", sdoc.score);
                      }
                      rb._responseDocs.set(sdoc.positionInResponse, doc);
                    }
                    if (sdoc != null) {
                      if (returnScores && sdoc.score != null) {
                        doc.setField("score", sdoc.score);
                      }
                      if (removeKeyField) {
                        doc.removeFields(keyFieldName);
                      }
                      rb._responseDocs.set(sdoc.positionInResponse, doc);
                    }
                  }
                }
          

          I changed that to:

          for (ShardResponse srsp : sreq.responses) {
                  SolrDocumentList docs = (SolrDocumentList) srsp.getSolrResponse().getResponse().get("response");
          
                  for (SolrDocument doc : docs) {
                    Object id = doc.getFieldValue(keyFieldName);
                    ShardDoc sdoc = rb.resultIds.get(id.toString());
                    if (sdoc != null) {
                      if (returnScores && sdoc.score != null) {
                        doc.setField("score", sdoc.score);
                      }
                      if (removeKeyField) {
                        doc.removeFields(keyFieldName);
                      }
                      rb._responseDocs.set(sdoc.positionInResponse, doc);
                    }
                  }
                }
          

          I also removed the comment about fl=id,score,sortField in the DistributedQueryComponentOptimizationTest

          This is ready to go.

          Show
          Shalin Shekhar Mangar added a comment - There was some code duplication in QueryComponent.returnFields: for (ShardResponse srsp : sreq.responses) { SolrDocumentList docs = (SolrDocumentList) srsp.getSolrResponse().getResponse().get( "response" ); for (SolrDocument doc : docs) { Object id = doc.getFieldValue(keyFieldName); ShardDoc sdoc = rb.resultIds.get(id.toString()); if (sdoc != null ) { if (returnScores && sdoc.score != null ) { doc.setField( "score" , sdoc.score); } rb._responseDocs.set(sdoc.positionInResponse, doc); } if (sdoc != null ) { if (returnScores && sdoc.score != null ) { doc.setField( "score" , sdoc.score); } if (removeKeyField) { doc.removeFields(keyFieldName); } rb._responseDocs.set(sdoc.positionInResponse, doc); } } } I changed that to: for (ShardResponse srsp : sreq.responses) { SolrDocumentList docs = (SolrDocumentList) srsp.getSolrResponse().getResponse().get( "response" ); for (SolrDocument doc : docs) { Object id = doc.getFieldValue(keyFieldName); ShardDoc sdoc = rb.resultIds.get(id.toString()); if (sdoc != null ) { if (returnScores && sdoc.score != null ) { doc.setField( "score" , sdoc.score); } if (removeKeyField) { doc.removeFields(keyFieldName); } rb._responseDocs.set(sdoc.positionInResponse, doc); } } } I also removed the comment about fl=id,score,sortField in the DistributedQueryComponentOptimizationTest This is ready to go.
          Hide
          ASF subversion and git services added a comment -

          Commit 1571152 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1571152 ]

          SOLR-1880: Distributed Search skips GET_FIELDS stage if EXECUTE_QUERY stage gets all fields. Requests with fl=id or fl=id,score are now single-pass.

          Show
          ASF subversion and git services added a comment - Commit 1571152 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1571152 ] SOLR-1880 : Distributed Search skips GET_FIELDS stage if EXECUTE_QUERY stage gets all fields. Requests with fl=id or fl=id,score are now single-pass.
          Hide
          ASF subversion and git services added a comment -

          Commit 1571153 from shalin@apache.org in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1571153 ]

          SOLR-1880: Distributed Search skips GET_FIELDS stage if EXECUTE_QUERY stage gets all fields. Requests with fl=id or fl=id,score are now single-pass.

          Show
          ASF subversion and git services added a comment - Commit 1571153 from shalin@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1571153 ] SOLR-1880 : Distributed Search skips GET_FIELDS stage if EXECUTE_QUERY stage gets all fields. Requests with fl=id or fl=id,score are now single-pass.
          Hide
          Shalin Shekhar Mangar added a comment -

          Thanks Shawn and Vitaliy!

          I opened SOLR-5768 for another related improvement suggested by Yonik on the solr-user list.

          Show
          Shalin Shekhar Mangar added a comment - Thanks Shawn and Vitaliy! I opened SOLR-5768 for another related improvement suggested by Yonik on the solr-user list.
          Hide
          Uwe Schindler added a comment -

          Close issue after release of 4.8.0

          Show
          Uwe Schindler added a comment - Close issue after release of 4.8.0

            People

            • Assignee:
              Shalin Shekhar Mangar
              Reporter:
              Shawn Smith
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development