Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-8858

SolrIndexSearcher#doc() Completely Ignores Field Filters Unless Lazy Field Loading is Enabled

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 4.6, 4.10, 5.5
    • Fix Version/s: 6.2, 7.0
    • Component/s: None
    • Labels:

      Description

      If enableLazyFieldLoading=false, a perfectly valid fields filter will be ignored, and we'll create a DocumentStoredFieldVisitor without it.

      1. SOLR-8858.patch
        5 kB
        Shalin Shekhar Mangar

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user maedhroz opened a pull request:

          https://github.com/apache/lucene-solr/pull/21

          SOLR-8858 SolrIndexSearcher#doc() Completely Ignores Field Filters Unless Lazy Field Loading is Enabled

          Instead of just discarding fields if lazy loading is not enabled, SolrIndexSearcher now passes them through to IndexReader. This means IndexReader creates a DocumentStoredFieldVisitor that we can use to later determine which fields need to be read.

          https://issues.apache.org/jira/browse/SOLR-8858

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/maedhroz/lucene-solr SOLR-8858

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/lucene-solr/pull/21.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #21


          commit fa8075c7861dbc331588dfb5c9e28576e2eb31f2
          Author: Caleb Rackliffe <caleb.rackliffe@gmail.com>
          Date: 2016-03-16T18:15:20Z

          SOLR-8858 SolrIndexSearcher#doc() Completely Ignores Field Filters Unless Lazy Field Loading is Enabled

          Instead of just discarding fields if lazy loading is not enabled, SolrIndexSearcher now passes them through to IndexReader. This means IndexReader creates a DocumentStoredFieldVisitor that we can use to later determine which fields need to be read.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user maedhroz opened a pull request: https://github.com/apache/lucene-solr/pull/21 SOLR-8858 SolrIndexSearcher#doc() Completely Ignores Field Filters Unless Lazy Field Loading is Enabled Instead of just discarding fields if lazy loading is not enabled, SolrIndexSearcher now passes them through to IndexReader. This means IndexReader creates a DocumentStoredFieldVisitor that we can use to later determine which fields need to be read. https://issues.apache.org/jira/browse/SOLR-8858 You can merge this pull request into a Git repository by running: $ git pull https://github.com/maedhroz/lucene-solr SOLR-8858 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/lucene-solr/pull/21.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21 commit fa8075c7861dbc331588dfb5c9e28576e2eb31f2 Author: Caleb Rackliffe <caleb.rackliffe@gmail.com> Date: 2016-03-16T18:15:20Z SOLR-8858 SolrIndexSearcher#doc() Completely Ignores Field Filters Unless Lazy Field Loading is Enabled Instead of just discarding fields if lazy loading is not enabled, SolrIndexSearcher now passes them through to IndexReader. This means IndexReader creates a DocumentStoredFieldVisitor that we can use to later determine which fields need to be read.
          Hide
          maedhroz Caleb Rackliffe added a comment -

          I've posted a PR that fixes this in what I'm hoping is a reasonable way. I imagine the impact will mostly fall on custom StoredFieldsReader implementations.

          Show
          maedhroz Caleb Rackliffe added a comment - I've posted a PR that fixes this in what I'm hoping is a reasonable way. I imagine the impact will mostly fall on custom StoredFieldsReader implementations.
          Hide
          stephlag Stephan Lagraulet added a comment -

          Can you remove this issue from version 5.5.1 as this version is packaged without a fix for this issue?

          Show
          stephlag Stephan Lagraulet added a comment - Can you remove this issue from version 5.5.1 as this version is packaged without a fix for this issue?
          Hide
          mkhludnev Mikhail Khludnev added a comment -

          I skim through the patch, didn't find a test. It's hard to understand what does it changes .

          Show
          mkhludnev Mikhail Khludnev added a comment - I skim through the patch, didn't find a test. It's hard to understand what does it changes .
          Hide
          mkhludnev Mikhail Khludnev added a comment -

          and also, it seems like a feature to me. In my understanding the difference between two and only one field is neglectable. and it's enabled only when you really know why do you need it.

          Show
          mkhludnev Mikhail Khludnev added a comment - and also, it seems like a feature to me. In my understanding the difference between two and only one field is neglectable. and it's enabled only when you really know why do you need it.
          Hide
          maedhroz Caleb Rackliffe added a comment -

          We've subclassed StoredFieldsReader, and we rely on the StoredFieldVisitor passed to visitDocument() consistently indicating which fields have been requested, even if lazy field loading is disabled. The change to SolrIndexSearcher just makes sure that field information, if it's available, always makes it into the StoredFieldVisitor.

          Show
          maedhroz Caleb Rackliffe added a comment - We've subclassed StoredFieldsReader , and we rely on the StoredFieldVisitor passed to visitDocument() consistently indicating which fields have been requested, even if lazy field loading is disabled. The change to SolrIndexSearcher just makes sure that field information, if it's available, always makes it into the StoredFieldVisitor .
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          There are many tests that fail after applying this patch on master. I don't have the time to dig in to this right now. Caleb, if you can fix the failures, I'd be happy to commit this patch.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - There are many tests that fail after applying this patch on master. I don't have the time to dig in to this right now. Caleb, if you can fix the failures, I'd be happy to commit this patch.
          Hide
          maedhroz Caleb Rackliffe added a comment -

          After applying myself on master, the doc method looks like this:

            @Override
            public Document doc(int i, Set<String> fields) throws IOException {
              Document d;
              if (documentCache != null) {
                d = documentCache.get(i);
                if (d != null) return d;
              }
          
              final DirectoryReader reader = getIndexReader();
              if (fields != null) {
                if (enableLazyFieldLoading) {
                  final SetNonLazyFieldSelector visitor = new SetNonLazyFieldSelector(fields, reader, i);
                  reader.document(i, visitor);
                  d = visitor.doc;
                } else {
                  d = reader.document(i, fields);
                }
              } else {
                d = reader.document(i);
              }
          
              if (documentCache != null) {
                documentCache.put(i, d);
              }
          
              return d;
            }
          

          Running the tests (i.e. ant test -Dtests.slow=false), I get:

          [junit4] Tests with failures [seed: BB0B954A8C44DF29]:
             [junit4]   - org.apache.solr.response.transform.TestSubQueryTransformer.testTwoSubQueriesAndByNumberWithTwoFields
             [junit4]   - org.apache.solr.response.transform.TestSubQueryTransformer.testJustJohnJavabin
             [junit4]   - org.apache.solr.response.transform.TestSubQueryTransformer.testJustJohnJson
             [junit4]   - org.apache.solr.response.transform.TestSubQueryTransformer.testJohnOrNancySingleField
             [junit4]   - org.apache.solr.response.transform.TestSubQueryTransformer.testThreeLevel
             [junit4]   - org.apache.solr.cloud.DistribJoinFromCollectionTest.testScore
             [junit4]   - org.apache.solr.cloud.DistribJoinFromCollectionTest.testNoScore
             [junit4]   - org.apache.solr.cloud.TestCloudDeleteByQuery (suite)
             [junit4]
             [junit4]
             [junit4] JVM J0:     0.58 ..   415.88 =   415.29s
             [junit4] JVM J1:     0.58 ..   415.81 =   415.23s
             [junit4] JVM J2:     0.58 ..   415.88 =   415.30s
             [junit4] JVM J3:     0.58 ..   415.72 =   415.13s
             [junit4] Execution time total: 6 minutes 56 seconds
             [junit4] Tests summary: 616 suites (10 ignored), 2584 tests, 1 suite-level error, 4 errors, 3 failures, 279 ignored (258 assumptions)
          

          I'm going to dig into these a bit and see if using the fields set broke some assumptions somewhere...

          Show
          maedhroz Caleb Rackliffe added a comment - After applying myself on master, the doc method looks like this: @Override public Document doc(int i, Set<String> fields) throws IOException { Document d; if (documentCache != null) { d = documentCache.get(i); if (d != null) return d; } final DirectoryReader reader = getIndexReader(); if (fields != null) { if (enableLazyFieldLoading) { final SetNonLazyFieldSelector visitor = new SetNonLazyFieldSelector(fields, reader, i); reader.document(i, visitor); d = visitor.doc; } else { d = reader.document(i, fields); } } else { d = reader.document(i); } if (documentCache != null) { documentCache.put(i, d); } return d; } Running the tests (i.e. ant test -Dtests.slow=false ), I get: [junit4] Tests with failures [seed: BB0B954A8C44DF29]: [junit4] - org.apache.solr.response.transform.TestSubQueryTransformer.testTwoSubQueriesAndByNumberWithTwoFields [junit4] - org.apache.solr.response.transform.TestSubQueryTransformer.testJustJohnJavabin [junit4] - org.apache.solr.response.transform.TestSubQueryTransformer.testJustJohnJson [junit4] - org.apache.solr.response.transform.TestSubQueryTransformer.testJohnOrNancySingleField [junit4] - org.apache.solr.response.transform.TestSubQueryTransformer.testThreeLevel [junit4] - org.apache.solr.cloud.DistribJoinFromCollectionTest.testScore [junit4] - org.apache.solr.cloud.DistribJoinFromCollectionTest.testNoScore [junit4] - org.apache.solr.cloud.TestCloudDeleteByQuery (suite) [junit4] [junit4] [junit4] JVM J0: 0.58 .. 415.88 = 415.29s [junit4] JVM J1: 0.58 .. 415.81 = 415.23s [junit4] JVM J2: 0.58 .. 415.88 = 415.30s [junit4] JVM J3: 0.58 .. 415.72 = 415.13s [junit4] Execution time total: 6 minutes 56 seconds [junit4] Tests summary: 616 suites (10 ignored), 2584 tests, 1 suite-level error, 4 errors, 3 failures, 279 ignored (258 assumptions) I'm going to dig into these a bit and see if using the fields set broke some assumptions somewhere...
          Hide
          maedhroz Caleb Rackliffe added a comment -

          Shalin Shekhar Mangar The failures I've seen thus far all seem to be around how DocumentStoredFieldVisitor, SolrIndexSearcher, and SubQueryAugmenter work together. I'm only able to get the tests in TestSubQueryTransformer to pass if I prepend the fl parameters with *. If there is an actual field list, for instance...

          fl=name_s_dv,depts:[subquery]
          

          ...a DocumentStoredFieldVisitor that will only retrieve name_s_dv is created, and the result documents seem to be missing other necessary fields.

          Show
          maedhroz Caleb Rackliffe added a comment - Shalin Shekhar Mangar The failures I've seen thus far all seem to be around how DocumentStoredFieldVisitor , SolrIndexSearcher , and SubQueryAugmenter work together. I'm only able to get the tests in TestSubQueryTransformer to pass if I prepend the fl parameters with * . If there is an actual field list, for instance... fl=name_s_dv,depts:[subquery] ...a DocumentStoredFieldVisitor that will only retrieve name_s_dv is created, and the result documents seem to be missing other necessary fields.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user maedhroz closed the pull request at:

          https://github.com/apache/lucene-solr/pull/21

          Show
          githubbot ASF GitHub Bot added a comment - Github user maedhroz closed the pull request at: https://github.com/apache/lucene-solr/pull/21
          Hide
          maedhroz Caleb Rackliffe added a comment -

          I've been able to resolve all the failures in TestSubQueryTransformer by simply making sure that all the fields referenced with the rows.x syntax in the sub-queries are present in the fl list. I think that's actually a reasonable resolution.

          TestCloudDeleteByQuery and DistribJoinFromCollectionTest have been more elusive, but they smell like they're both being caused by shard queries that need to retrieve fields from the original fl, but can't with my patch in place, because only id and score are passed explicitly to SolrIndexSearcher (whereas they would have retrieved all fields before, given lazy loading isn't enabled).

          Show
          maedhroz Caleb Rackliffe added a comment - I've been able to resolve all the failures in TestSubQueryTransformer by simply making sure that all the fields referenced with the rows.x syntax in the sub-queries are present in the fl list. I think that's actually a reasonable resolution. TestCloudDeleteByQuery and DistribJoinFromCollectionTest have been more elusive, but they smell like they're both being caused by shard queries that need to retrieve fields from the original fl , but can't with my patch in place, because only id and score are passed explicitly to SolrIndexSearcher (whereas they would have retrieved all fields before, given lazy loading isn't enabled).
          Hide
          maedhroz Caleb Rackliffe added a comment -
          Show
          maedhroz Caleb Rackliffe added a comment - Work in progress is at https://github.com/maedhroz/lucene-solr/tree/SOLR-8858-trunk
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user maedhroz opened a pull request:

          https://github.com/apache/lucene-solr/pull/47

          SOLR-8858 SolrIndexSearcher#doc() Completely Ignores Field Filters Unless Lazy Field Loading is Enabled

          https://issues.apache.org/jira/browse/SOLR-8858

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/maedhroz/lucene-solr SOLR-8858-trunk

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/lucene-solr/pull/47.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #47



          Show
          githubbot ASF GitHub Bot added a comment - GitHub user maedhroz opened a pull request: https://github.com/apache/lucene-solr/pull/47 SOLR-8858 SolrIndexSearcher#doc() Completely Ignores Field Filters Unless Lazy Field Loading is Enabled https://issues.apache.org/jira/browse/SOLR-8858 You can merge this pull request into a Git repository by running: $ git pull https://github.com/maedhroz/lucene-solr SOLR-8858 -trunk Alternatively you can review and apply these changes as the patch at: https://github.com/apache/lucene-solr/pull/47.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #47
          Hide
          maedhroz Caleb Rackliffe added a comment -

          I've posted a new PR against master with some comments. I'm seeing zero failures in an ant test -Dtests.slow=false run, so I think things are in a reviewable state.

          Show
          maedhroz Caleb Rackliffe added a comment - I've posted a new PR against master with some comments. I'm seeing zero failures in an ant test -Dtests.slow=false run, so I think things are in a reviewable state.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user shalinmangar commented on a diff in the pull request:

          https://github.com/apache/lucene-solr/pull/47#discussion_r69263675

          — Diff: solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java —
          @@ -910,8 +910,12 @@ protected void createMainQuery(ResponseBuilder rb)

          { additionalAdded = addFL(additionalFL, "score", additionalAdded); }

          } else {

          • // reset so that only unique key is requested in shard requests
          • sreq.params.set(CommonParams.FL, rb.req.getSchema().getUniqueKeyField().getName());
            + if (rb.req.getSearcher().enableLazyFieldLoading) { + // reset so that only unique key is requested in shard requests + sreq.params.set(CommonParams.FL, rb.req.getSchema().getUniqueKeyField().getName()); + }

            else {
            + sreq.params.set(CommonParams.FL, "*");

              • End diff –

          I don't understand this change. Why should QueryComponent know about whether lazy loading is enabled or not?

          Show
          githubbot ASF GitHub Bot added a comment - Github user shalinmangar commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/47#discussion_r69263675 — Diff: solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java — @@ -910,8 +910,12 @@ protected void createMainQuery(ResponseBuilder rb) { additionalAdded = addFL(additionalFL, "score", additionalAdded); } } else { // reset so that only unique key is requested in shard requests sreq.params.set(CommonParams.FL, rb.req.getSchema().getUniqueKeyField().getName()); + if (rb.req.getSearcher().enableLazyFieldLoading) { + // reset so that only unique key is requested in shard requests + sreq.params.set(CommonParams.FL, rb.req.getSchema().getUniqueKeyField().getName()); + } else { + sreq.params.set(CommonParams.FL, "*"); End diff – I don't understand this change. Why should QueryComponent know about whether lazy loading is enabled or not?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user shalinmangar commented on a diff in the pull request:

          https://github.com/apache/lucene-solr/pull/47#discussion_r69316868

          — Diff: solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java —
          @@ -910,8 +910,12 @@ protected void createMainQuery(ResponseBuilder rb)

          { additionalAdded = addFL(additionalFL, "score", additionalAdded); }

          } else {

          • // reset so that only unique key is requested in shard requests
          • sreq.params.set(CommonParams.FL, rb.req.getSchema().getUniqueKeyField().getName());
            + if (rb.req.getSearcher().enableLazyFieldLoading) { + // reset so that only unique key is requested in shard requests + sreq.params.set(CommonParams.FL, rb.req.getSchema().getUniqueKeyField().getName()); + }

            else {
            + sreq.params.set(CommonParams.FL, "*");

              • End diff –

          I can see why this change was made but my point is that there is probably an assumption in the join queries or in the test (I haven't looked) which make it fail without this change.

          Show
          githubbot ASF GitHub Bot added a comment - Github user shalinmangar commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/47#discussion_r69316868 — Diff: solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java — @@ -910,8 +910,12 @@ protected void createMainQuery(ResponseBuilder rb) { additionalAdded = addFL(additionalFL, "score", additionalAdded); } } else { // reset so that only unique key is requested in shard requests sreq.params.set(CommonParams.FL, rb.req.getSchema().getUniqueKeyField().getName()); + if (rb.req.getSearcher().enableLazyFieldLoading) { + // reset so that only unique key is requested in shard requests + sreq.params.set(CommonParams.FL, rb.req.getSchema().getUniqueKeyField().getName()); + } else { + sreq.params.set(CommonParams.FL, "*"); End diff – I can see why this change was made but my point is that there is probably an assumption in the join queries or in the test (I haven't looked) which make it fail without this change.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user maedhroz commented on a diff in the pull request:

          https://github.com/apache/lucene-solr/pull/47#discussion_r69363380

          — Diff: solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java —
          @@ -910,8 +910,12 @@ protected void createMainQuery(ResponseBuilder rb)

          { additionalAdded = addFL(additionalFL, "score", additionalAdded); }

          } else {

          • // reset so that only unique key is requested in shard requests
          • sreq.params.set(CommonParams.FL, rb.req.getSchema().getUniqueKeyField().getName());
            + if (rb.req.getSearcher().enableLazyFieldLoading) { + // reset so that only unique key is requested in shard requests + sreq.params.set(CommonParams.FL, rb.req.getSchema().getUniqueKeyField().getName()); + }

            else {
            + sreq.params.set(CommonParams.FL, "*");

              • End diff –

          I agree that what I have here is unsatisfying. I've been debugging through the fields retrieval phase of the join in `DistribJoinFromCollectionTest`, so hopefully that turns up something...

          Show
          githubbot ASF GitHub Bot added a comment - Github user maedhroz commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/47#discussion_r69363380 — Diff: solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java — @@ -910,8 +910,12 @@ protected void createMainQuery(ResponseBuilder rb) { additionalAdded = addFL(additionalFL, "score", additionalAdded); } } else { // reset so that only unique key is requested in shard requests sreq.params.set(CommonParams.FL, rb.req.getSchema().getUniqueKeyField().getName()); + if (rb.req.getSearcher().enableLazyFieldLoading) { + // reset so that only unique key is requested in shard requests + sreq.params.set(CommonParams.FL, rb.req.getSchema().getUniqueKeyField().getName()); + } else { + sreq.params.set(CommonParams.FL, "*"); End diff – I agree that what I have here is unsatisfying. I've been debugging through the fields retrieval phase of the join in `DistribJoinFromCollectionTest`, so hopefully that turns up something...
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user maedhroz commented on a diff in the pull request:

          https://github.com/apache/lucene-solr/pull/47#discussion_r69373171

          — Diff: solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java —
          @@ -910,8 +910,12 @@ protected void createMainQuery(ResponseBuilder rb)

          { additionalAdded = addFL(additionalFL, "score", additionalAdded); }

          } else {

          • // reset so that only unique key is requested in shard requests
          • sreq.params.set(CommonParams.FL, rb.req.getSchema().getUniqueKeyField().getName());
            + if (rb.req.getSearcher().enableLazyFieldLoading) { + // reset so that only unique key is requested in shard requests + sreq.params.set(CommonParams.FL, rb.req.getSchema().getUniqueKeyField().getName()); + }

            else {
            + sreq.params.set(CommonParams.FL, "*");

              • End diff –

          In the current master (without my patch), the query stage shard request for join in `DistribJoinFromCollectionTest` will pull the document from `SolrIndexSearcher#doc()' with only `id` in the specified `fields`. It does not use lazy field loading, and so uses a `DocumentStoredFieldVisitor` with no `fields` specified to load the whole document, and then put it in the `documentCache`. If we used lazy field loading, the cached document would still have some representation of all the fields, albeit lazy ones.

          With only the `SolrIndexSearcher` piece of my patch in place, the `TestSubQueryTransformer` failures are easy to avoidl, and I was able to fix them by simply reading the JavaDoc. (See the [comment](https://github.com/apache/lucene-solr/pull/47/files/4f9e67c63ce5130795df647ef5e86ae970601cb6#r69015716) below.) `DistribJoinFromCollectionTest` (and `TestCloudDeleteByQuery`) fails, because when, as I've laid out above, `doc()` actually respects the `fields` list during the main query phase, it caches a document that only contains those fields. When the actual field retrieval stage of the query hits the shard, `doc()` spits out a document that doesn't have the all fields in `fl`. (I'm not sure `DistribJoinFromCollectionTest` or `TestCloudDeleteByQuery` are doing something wrong, and they actually pass if they enable lazy field loading.)

          The reason I raised this issue in the first place is that I have a custom `StoredFieldsVisitor` that relies on `DocumentStoredFieldVisitor` providing the fields requested by the query. The unfortunate thing is that I think the `QueryComponent` bit of this PR isn't actually compatible with that, and I think that will need to be reverted no matter what. The only other ways I can imagine fixing this are:

          a.) Always cache an entire document, regardless of what we return from `doc()`. (Seems like it adds overhead.)
          b.) Skip caching under certain conditions, like if the `fields` list only contains the unique key (or key and score). (Seems very reliant on `QueryComponent` still.)
          c.) Always use lazy loading. (Seems invasive, but most of the examples I see use it anyway.)

          I don't love any of these options, but I'd be interested to get more informed opinions.

          Show
          githubbot ASF GitHub Bot added a comment - Github user maedhroz commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/47#discussion_r69373171 — Diff: solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java — @@ -910,8 +910,12 @@ protected void createMainQuery(ResponseBuilder rb) { additionalAdded = addFL(additionalFL, "score", additionalAdded); } } else { // reset so that only unique key is requested in shard requests sreq.params.set(CommonParams.FL, rb.req.getSchema().getUniqueKeyField().getName()); + if (rb.req.getSearcher().enableLazyFieldLoading) { + // reset so that only unique key is requested in shard requests + sreq.params.set(CommonParams.FL, rb.req.getSchema().getUniqueKeyField().getName()); + } else { + sreq.params.set(CommonParams.FL, "*"); End diff – In the current master (without my patch), the query stage shard request for join in `DistribJoinFromCollectionTest` will pull the document from `SolrIndexSearcher#doc()' with only `id` in the specified `fields`. It does not use lazy field loading, and so uses a `DocumentStoredFieldVisitor` with no `fields` specified to load the whole document, and then put it in the `documentCache`. If we used lazy field loading, the cached document would still have some representation of all the fields, albeit lazy ones. With only the `SolrIndexSearcher` piece of my patch in place, the `TestSubQueryTransformer` failures are easy to avoidl, and I was able to fix them by simply reading the JavaDoc. (See the [comment] ( https://github.com/apache/lucene-solr/pull/47/files/4f9e67c63ce5130795df647ef5e86ae970601cb6#r69015716 ) below.) `DistribJoinFromCollectionTest` (and `TestCloudDeleteByQuery`) fails, because when, as I've laid out above, `doc()` actually respects the `fields` list during the main query phase, it caches a document that only contains those fields . When the actual field retrieval stage of the query hits the shard, `doc()` spits out a document that doesn't have the all fields in `fl`. (I'm not sure `DistribJoinFromCollectionTest` or `TestCloudDeleteByQuery` are doing something wrong, and they actually pass if they enable lazy field loading.) The reason I raised this issue in the first place is that I have a custom `StoredFieldsVisitor` that relies on `DocumentStoredFieldVisitor` providing the fields requested by the query. The unfortunate thing is that I think the `QueryComponent` bit of this PR isn't actually compatible with that, and I think that will need to be reverted no matter what. The only other ways I can imagine fixing this are: a.) Always cache an entire document, regardless of what we return from `doc()`. (Seems like it adds overhead.) b.) Skip caching under certain conditions, like if the `fields` list only contains the unique key (or key and score). (Seems very reliant on `QueryComponent` still.) c.) Always use lazy loading. (Seems invasive, but most of the examples I see use it anyway.) I don't love any of these options, but I'd be interested to get more informed opinions.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user shalinmangar commented on a diff in the pull request:

          https://github.com/apache/lucene-solr/pull/47#discussion_r69419263

          — Diff: solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java —
          @@ -910,8 +910,12 @@ protected void createMainQuery(ResponseBuilder rb)

          { additionalAdded = addFL(additionalFL, "score", additionalAdded); }

          } else {

          • // reset so that only unique key is requested in shard requests
          • sreq.params.set(CommonParams.FL, rb.req.getSchema().getUniqueKeyField().getName());
            + if (rb.req.getSearcher().enableLazyFieldLoading) { + // reset so that only unique key is requested in shard requests + sreq.params.set(CommonParams.FL, rb.req.getSchema().getUniqueKeyField().getName()); + }

            else {
            + sreq.params.set(CommonParams.FL, "*");

              • End diff –

          Thanks for investigating and the detailed note.

          I don't think DistribJoinFromCollectionTest or TestCloudDeleteByQuery are doing anything wrong. The difference, as you said, is that your patch changes `doc()` to actually respect the fields list and when lazy loading is disabled, proceeds to cache an incomplete document returned by Lucene. So this patch changes Solr from always caching either complete documents (when lazy loading is disabled) or lazy documents (when lazy loading is enabled) to caching potentially incomplete documents which have no idea about the other (non-requested) fields. So if we want to go back to the old behavior then option 'a' is the way to go when lazy loading is disabled.

          Show
          githubbot ASF GitHub Bot added a comment - Github user shalinmangar commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/47#discussion_r69419263 — Diff: solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java — @@ -910,8 +910,12 @@ protected void createMainQuery(ResponseBuilder rb) { additionalAdded = addFL(additionalFL, "score", additionalAdded); } } else { // reset so that only unique key is requested in shard requests sreq.params.set(CommonParams.FL, rb.req.getSchema().getUniqueKeyField().getName()); + if (rb.req.getSearcher().enableLazyFieldLoading) { + // reset so that only unique key is requested in shard requests + sreq.params.set(CommonParams.FL, rb.req.getSchema().getUniqueKeyField().getName()); + } else { + sreq.params.set(CommonParams.FL, "*"); End diff – Thanks for investigating and the detailed note. I don't think DistribJoinFromCollectionTest or TestCloudDeleteByQuery are doing anything wrong. The difference, as you said, is that your patch changes `doc()` to actually respect the fields list and when lazy loading is disabled, proceeds to cache an incomplete document returned by Lucene. So this patch changes Solr from always caching either complete documents (when lazy loading is disabled) or lazy documents (when lazy loading is enabled) to caching potentially incomplete documents which have no idea about the other (non-requested) fields. So if we want to go back to the old behavior then option 'a' is the way to go when lazy loading is disabled.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user maedhroz commented on a diff in the pull request:

          https://github.com/apache/lucene-solr/pull/47#discussion_r69510618

          — Diff: solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java —
          @@ -766,16 +766,29 @@ public Document doc(int i, Set<String> fields) throws IOException {
          }

          final DirectoryReader reader = getIndexReader();

          • if (!enableLazyFieldLoading || fields == null) {
          • d = reader.document;
            + boolean containsAllFields = true;
            +
            + if (fields != null)
            Unknown macro: { + if (enableLazyFieldLoading) { + final SetNonLazyFieldSelector visitor = new SetNonLazyFieldSelector(fields, reader, i); + reader.document(i, visitor); + d = visitor.doc; + } else { + d = reader.document(i, fields); + containsAllFields = false; + } }

            else

            { - final SetNonLazyFieldSelector visitor = new SetNonLazyFieldSelector(fields, reader, i); - reader.document(i, visitor); - d = visitor.doc; + d = reader.document(i); }

          if (documentCache != null) {

          • documentCache.put(i, d);
            + // Only cache the already retrieved document if it is complete...
            + if (containsAllFields) { + documentCache.put(i, d); + }

            else

            { + // ...and retrieve a complete document for caching otherwise. + documentCache.put(i, reader.document(i)); + }
              • End diff –

          @shalinmangar I've made a pass at restoring the previous caching behavior (with some comments around the rationale). It feels like we might have a complete solution at this point.

          Show
          githubbot ASF GitHub Bot added a comment - Github user maedhroz commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/47#discussion_r69510618 — Diff: solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java — @@ -766,16 +766,29 @@ public Document doc(int i, Set<String> fields) throws IOException { } final DirectoryReader reader = getIndexReader(); if (!enableLazyFieldLoading || fields == null) { d = reader.document ; + boolean containsAllFields = true; + + if (fields != null) Unknown macro: { + if (enableLazyFieldLoading) { + final SetNonLazyFieldSelector visitor = new SetNonLazyFieldSelector(fields, reader, i); + reader.document(i, visitor); + d = visitor.doc; + } else { + d = reader.document(i, fields); + containsAllFields = false; + } } else { - final SetNonLazyFieldSelector visitor = new SetNonLazyFieldSelector(fields, reader, i); - reader.document(i, visitor); - d = visitor.doc; + d = reader.document(i); } if (documentCache != null) { documentCache.put(i, d); + // Only cache the already retrieved document if it is complete... + if (containsAllFields) { + documentCache.put(i, d); + } else { + // ...and retrieve a complete document for caching otherwise. + documentCache.put(i, reader.document(i)); + } End diff – @shalinmangar I've made a pass at restoring the previous caching behavior (with some comments around the rationale). It feels like we might have a complete solution at this point.
          Hide
          dsmiley David Smiley added a comment -

          a.) Always cache an entire document, regardless of what we return from `doc()`. (Seems like it adds overhead.)

          I think caching the full doc is expected and what I thought it did. There isn't necessarily overhead; maybe some other component (e.g. highlighting) wants a field even though 'fl' doesn't refer to it. "b" seems fine, too, not sure, but I think "c" (always lazy-load) is bad.

          Show
          dsmiley David Smiley added a comment - a.) Always cache an entire document, regardless of what we return from `doc()`. (Seems like it adds overhead.) I think caching the full doc is expected and what I thought it did. There isn't necessarily overhead; maybe some other component (e.g. highlighting) wants a field even though 'fl' doesn't refer to it. "b" seems fine, too, not sure, but I think "c" (always lazy-load) is bad.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          I think caching the full doc is expected and what I thought it did.

          That is only when lazy loading is disabled. If lazy loading is enabled then a lazy document is cached which may have only a few fields' data and the rest are loaded on access.

          @shalinmangar I've made a pass at restoring the previous caching behavior (with some comments around the rationale). It feels like we might have a complete solution at this point.

          I am not very happy with this solution because from a Solr user's perspective, this feature adds no benefit but causes stored fields to be read twice for an uncached read? I must also admit that I do not have a good suggestion on how to avoid that.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - I think caching the full doc is expected and what I thought it did. That is only when lazy loading is disabled. If lazy loading is enabled then a lazy document is cached which may have only a few fields' data and the rest are loaded on access. @shalinmangar I've made a pass at restoring the previous caching behavior (with some comments around the rationale). It feels like we might have a complete solution at this point. I am not very happy with this solution because from a Solr user's perspective, this feature adds no benefit but causes stored fields to be read twice for an uncached read? I must also admit that I do not have a good suggestion on how to avoid that.
          Hide
          dsmiley David Smiley added a comment -

          That is only when lazy loading is disabled. If lazy loading is enabled then a lazy document is cached which may have only a few fields' data and the rest are loaded on access.

          Yeah I know that but wasn't clear in what I said I knew

          Caleb Rackliffe the current behavior seems reasonable – if no lazy load then grab everything, and if there's a doc cache stick it there. In your app with a custom stored field codec, why did you set enableLazyFieldLoading=false? What do you want to achieve?

          Show
          dsmiley David Smiley added a comment - That is only when lazy loading is disabled. If lazy loading is enabled then a lazy document is cached which may have only a few fields' data and the rest are loaded on access. Yeah I know that but wasn't clear in what I said I knew Caleb Rackliffe the current behavior seems reasonable – if no lazy load then grab everything, and if there's a doc cache stick it there. In your app with a custom stored field codec, why did you set enableLazyFieldLoading=false? What do you want to achieve?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dsmiley commented on a diff in the pull request:

          https://github.com/apache/lucene-solr/pull/47#discussion_r69763447

          — Diff: solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java —
          @@ -766,12 +766,16 @@ public Document doc(int i, Set<String> fields) throws IOException {
          }

          final DirectoryReader reader = getIndexReader();

          • if (!enableLazyFieldLoading || fields == null) {
          • d = reader.document;
            + if (fields != null) {
            + if (enableLazyFieldLoading) { + final SetNonLazyFieldSelector visitor = new SetNonLazyFieldSelector(fields, reader, i); + reader.document(i, visitor); + d = visitor.doc; + }

            else {
            + d = reader.document(i, fields);

              • End diff –

          This could lead to a bug when there is a document cache, since we'd cache a partial document without lazy loading. Then imagine a subsequent doc(i,otherFields), is called and then a document is returned without those fields even if the doc on disk might actually has those fields.

          On line 770 if (enableLazyFieldLoading) could become: if (enableLazyFieldLoading || documentCache != null). In this sense, "enableLazyFieldLoading" would have no effect unless there is no doc cache... I'm not sure what to think of that.

          Show
          githubbot ASF GitHub Bot added a comment - Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/47#discussion_r69763447 — Diff: solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java — @@ -766,12 +766,16 @@ public Document doc(int i, Set<String> fields) throws IOException { } final DirectoryReader reader = getIndexReader(); if (!enableLazyFieldLoading || fields == null) { d = reader.document ; + if (fields != null) { + if (enableLazyFieldLoading) { + final SetNonLazyFieldSelector visitor = new SetNonLazyFieldSelector(fields, reader, i); + reader.document(i, visitor); + d = visitor.doc; + } else { + d = reader.document(i, fields); End diff – This could lead to a bug when there is a document cache, since we'd cache a partial document without lazy loading. Then imagine a subsequent doc(i,otherFields), is called and then a document is returned without those fields even if the doc on disk might actually has those fields. On line 770 if (enableLazyFieldLoading) could become: if (enableLazyFieldLoading || documentCache != null). In this sense, "enableLazyFieldLoading" would have no effect unless there is no doc cache... I'm not sure what to think of that.
          Hide
          maedhroz Caleb Rackliffe added a comment -

          I am not very happy with this solution because from a Solr user's perspective, this feature adds no benefit but causes stored fields to be read twice for an uncached read? I must also admit that I do not have a good suggestion on how to avoid that.

          Shalin Shekhar Mangar Unless there's a less invasive solution I'm overlooking, I think it might be best to abandon this issue as something to handle in Solr proper.

          David Smiley Our fork actually reads most stored fields from an embedded database and relies on the visitor's fields information to make decisions about when (and when not) to read stored fields from Solr itself. We don't actually use documentCache at all, so the fixes I made around the initial patch to get the unit tests passing won't even be necessary.

          Let me know if there are any objections to closing this...

          Show
          maedhroz Caleb Rackliffe added a comment - I am not very happy with this solution because from a Solr user's perspective, this feature adds no benefit but causes stored fields to be read twice for an uncached read? I must also admit that I do not have a good suggestion on how to avoid that. Shalin Shekhar Mangar Unless there's a less invasive solution I'm overlooking, I think it might be best to abandon this issue as something to handle in Solr proper. David Smiley Our fork actually reads most stored fields from an embedded database and relies on the visitor's fields information to make decisions about when (and when not) to read stored fields from Solr itself. We don't actually use documentCache at all, so the fixes I made around the initial patch to get the unit tests passing won't even be necessary. Let me know if there are any objections to closing this...
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user maedhroz commented on a diff in the pull request:

          https://github.com/apache/lucene-solr/pull/47#discussion_r69787760

          — Diff: solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java —
          @@ -766,12 +766,16 @@ public Document doc(int i, Set<String> fields) throws IOException {
          }

          final DirectoryReader reader = getIndexReader();

          • if (!enableLazyFieldLoading || fields == null) {
          • d = reader.document;
            + if (fields != null) {
            + if (enableLazyFieldLoading) { + final SetNonLazyFieldSelector visitor = new SetNonLazyFieldSelector(fields, reader, i); + reader.document(i, visitor); + d = visitor.doc; + }

            else {
            + d = reader.document(i, fields);

              • End diff –

          Right. I think I addressed that in the last commit, since only full documents are cached now. The problem is that the overhead of doing this might be unacceptable.

          Show
          githubbot ASF GitHub Bot added a comment - Github user maedhroz commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/47#discussion_r69787760 — Diff: solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java — @@ -766,12 +766,16 @@ public Document doc(int i, Set<String> fields) throws IOException { } final DirectoryReader reader = getIndexReader(); if (!enableLazyFieldLoading || fields == null) { d = reader.document ; + if (fields != null) { + if (enableLazyFieldLoading) { + final SetNonLazyFieldSelector visitor = new SetNonLazyFieldSelector(fields, reader, i); + reader.document(i, visitor); + d = visitor.doc; + } else { + d = reader.document(i, fields); End diff – Right. I think I addressed that in the last commit, since only full documents are cached now. The problem is that the overhead of doing this might be unacceptable.
          Hide
          dsmiley David Smiley added a comment -

          If there is no document cache and lazy field loading is disabled, then we can pass through the fields requested to the codec instead of getting them all right? That wouldn't break anything nor add inefficiencies that aren't inherent with a user opting out of these 2 optimizations.

          Show
          dsmiley David Smiley added a comment - If there is no document cache and lazy field loading is disabled, then we can pass through the fields requested to the codec instead of getting them all right? That wouldn't break anything nor add inefficiencies that aren't inherent with a user opting out of these 2 optimizations.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          If there is no document cache and lazy field loading is disabled, then we can pass through the fields requested to the codec instead of getting them all right? That wouldn't break anything nor add inefficiencies that aren't inherent with a user opting out of these 2 optimizations.

          +1 to that. That should work for Caleb's use-case as well.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - If there is no document cache and lazy field loading is disabled, then we can pass through the fields requested to the codec instead of getting them all right? That wouldn't break anything nor add inefficiencies that aren't inherent with a user opting out of these 2 optimizations. +1 to that. That should work for Caleb's use-case as well.
          Hide
          maedhroz Caleb Rackliffe added a comment -

          If there is no document cache and lazy field loading is disabled, then we can pass through the fields requested to the codec instead of getting them all right?

          That works for me.

          Show
          maedhroz Caleb Rackliffe added a comment - If there is no document cache and lazy field loading is disabled, then we can pass through the fields requested to the codec instead of getting them all right? That works for me.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Condensed patch which applies to master. I added the following comment for the case where document cache is enabled:

          // we do not pass the fields in this case because that would return an incomplete document which would
          // be eventually cached. The alternative would be to read the stored fields twice; once with the fields
          // and then without for caching leading to a performance hit
          // see SOLR-8858 for related discussion

          All tests and precommit passes. I'll commit this now.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Condensed patch which applies to master. I added the following comment for the case where document cache is enabled: // we do not pass the fields in this case because that would return an incomplete document which would // be eventually cached. The alternative would be to read the stored fields twice; once with the fields // and then without for caching leading to a performance hit // see SOLR-8858 for related discussion All tests and precommit passes. I'll commit this now.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 24d6b7846995542c5ccbb4ddcdaa844f78555205 in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=24d6b78 ]

          SOLR-8858: SolrIndexSearcher#doc() completely ignores field filters unless lazy field loading is enabled.

          This closes #47.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 24d6b7846995542c5ccbb4ddcdaa844f78555205 in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=24d6b78 ] SOLR-8858 : SolrIndexSearcher#doc() completely ignores field filters unless lazy field loading is enabled. This closes #47.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/lucene-solr/pull/47

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/lucene-solr/pull/47
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 4921dcd80cb7b8adbb35b62d753b2c965ec92f28 in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4921dcd ]

          SOLR-8858: SolrIndexSearcher#doc() completely ignores field filters unless lazy field loading is enabled.

          This closes #47.
          (cherry picked from commit 24d6b78)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 4921dcd80cb7b8adbb35b62d753b2c965ec92f28 in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4921dcd ] SOLR-8858 : SolrIndexSearcher#doc() completely ignores field filters unless lazy field loading is enabled. This closes #47. (cherry picked from commit 24d6b78)
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Thanks Caleb and David!

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Thanks Caleb and David!
          Hide
          mikemccand Michael McCandless added a comment -

          Bulk close resolved issues after 6.2.0 release.

          Show
          mikemccand Michael McCandless added a comment - Bulk close resolved issues after 6.2.0 release.

            People

            • Assignee:
              shalinmangar Shalin Shekhar Mangar
              Reporter:
              maedhroz Caleb Rackliffe
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development