Lucene - Core
  1. Lucene - Core
  2. LUCENE-3638

IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/index, core/search
    • Labels:
    • Environment:

      64bit linux java 1.6

    • Lucene Fields:
      New, Patch Available

      Description

      when generating digest for some documents with huge fields, it should be unnecessary to load the field but just interesting part of the field with the offset information. but indexreader always return the whole field content. afterward, the customized storedfieldsreader will got a repeated loading

      1. doc.fields.patch
        5 kB
        peter chang
      2. LUCENE-3638.patch
        6 kB
        Michael McCandless

        Activity

        Hide
        Shai Erera added a comment -

        IndexReader and IndexSearcher already offer a doc/document method which takes StoredFieldVisitor, so why adding another version to them?

        Also, I don't think that DocumentStoredFieldVisitor should change. I find it very intuitive that I need to specify that fields that I want to load, rather than the fields that I don't want to. I.e., in my apps, there are many fields that are stored, but not loaded for results display.

        However, I do see the convenience of specifying just 1-2 fields that you don't want to load, rather than 20 that you do. So how about you create a new StoredFieldVisitor, which takes the list of fields 'not to load'? It can extend DocumentStoredFieldVisitor by overriding needsField?

        Show
        Shai Erera added a comment - IndexReader and IndexSearcher already offer a doc/document method which takes StoredFieldVisitor, so why adding another version to them? Also, I don't think that DocumentStoredFieldVisitor should change. I find it very intuitive that I need to specify that fields that I want to load, rather than the fields that I don't want to. I.e., in my apps, there are many fields that are stored, but not loaded for results display. However, I do see the convenience of specifying just 1-2 fields that you don't want to load, rather than 20 that you do. So how about you create a new StoredFieldVisitor, which takes the list of fields 'not to load'? It can extend DocumentStoredFieldVisitor by overriding needsField?
        Hide
        Robert Muir added a comment -

        However, I do see the convenience of specifying just 1-2 fields that you don't want to load, rather than 20 that you do. So how about you create a new StoredFieldVisitor, which takes the list of fields 'not to load'? It can extend DocumentStoredFieldVisitor by overriding needsField?

        DocumentStoredFieldsVisitor already supports this in its ctors.

        So i would recommend we consider adding some sugar to indexreader:

        public final Document document(int docID, Set<String> fields) {
          return document(docid, new DocumentStoredFieldsVisitor(fields));
        }
        

        sure there are cases where you want more complicated logic like to return STOP after certain fields, for that write your own logic.
        But this is probably pretty common.

        Show
        Robert Muir added a comment - However, I do see the convenience of specifying just 1-2 fields that you don't want to load, rather than 20 that you do. So how about you create a new StoredFieldVisitor, which takes the list of fields 'not to load'? It can extend DocumentStoredFieldVisitor by overriding needsField? DocumentStoredFieldsVisitor already supports this in its ctors. So i would recommend we consider adding some sugar to indexreader: public final Document document(int docID, Set<String> fields) { return document(docid, new DocumentStoredFieldsVisitor(fields)); } sure there are cases where you want more complicated logic like to return STOP after certain fields, for that write your own logic. But this is probably pretty common.
        Hide
        Uwe Schindler added a comment -

        I think the issue is more about something else:

        when generating digest for some documents with huge fields, it should be unnecessary to load the field but just interesting part of the field with the offset information. but indexreader always return the whole field content. afterward, the customized storedfieldsreader will got a repeated loading

        He wants to specify an offset/length into a very large field and only retrieve the subslice.

        Show
        Uwe Schindler added a comment - I think the issue is more about something else: when generating digest for some documents with huge fields, it should be unnecessary to load the field but just interesting part of the field with the offset information. but indexreader always return the whole field content. afterward, the customized storedfieldsreader will got a repeated loading He wants to specify an offset/length into a very large field and only retrieve the subslice.
        Hide
        Uwe Schindler added a comment -

        So i would recommend we consider adding some sugar to indexreader:

        Thats unrelated, BUT: yes please! And you already noted in the signature, one is very important: This method must be FINAL in IR!

        Show
        Uwe Schindler added a comment - So i would recommend we consider adding some sugar to indexreader: Thats unrelated, BUT: yes please! And you already noted in the signature, one is very important: This method must be FINAL in IR!
        Hide
        Robert Muir added a comment -

        He wants to specify an offset/length into a very large field and only retrieve the subslice.

        I really think the solution here is just to put the interesting part in its own field...

        Otherwise we have to add lots of complexity to the codec apis to support this (for instance what are the offsets/lengths? bytes? utf-16?)

        Show
        Robert Muir added a comment - He wants to specify an offset/length into a very large field and only retrieve the subslice. I really think the solution here is just to put the interesting part in its own field... Otherwise we have to add lots of complexity to the codec apis to support this (for instance what are the offsets/lengths? bytes? utf-16?)
        Hide
        Shai Erera added a comment -

        He wants to specify an offset/length into a very large field and only retrieve the subslice.

        That's indeed what the issue's description says, but there's no evidence to it in the patch. And I agree with Robert that in that case, one should store the interesting part in a special field.

        DocumentStoredFieldsVisitor already supports this in its ctors.

        Where? What am I missing? DSFV only takes a list of fieldsToAdd, not fieldsToFilter. If you have 20 fields in your index, and you want to load all but 2 fields, it may be more convenient to specify these two, and I proposed that it can be done in a DSFV extension.

        So i would recommend we consider adding some sugar to indexreader:

        I think this method is redundant, because one can easily call new DSFV(fields) and use the SFV document version. And why do we favor Set<String> over String...?

        Show
        Shai Erera added a comment - He wants to specify an offset/length into a very large field and only retrieve the subslice. That's indeed what the issue's description says, but there's no evidence to it in the patch. And I agree with Robert that in that case, one should store the interesting part in a special field. DocumentStoredFieldsVisitor already supports this in its ctors. Where? What am I missing? DSFV only takes a list of fieldsToAdd, not fieldsToFilter. If you have 20 fields in your index, and you want to load all but 2 fields, it may be more convenient to specify these two, and I proposed that it can be done in a DSFV extension. So i would recommend we consider adding some sugar to indexreader: I think this method is redundant, because one can easily call new DSFV(fields) and use the SFV document version. And why do we favor Set<String> over String...?
        Hide
        peter chang added a comment -

        1. i agree with robert, fieldsToAdd, fieldsToFilter something like this can be added for IR and IS.doc
        2. yes, the offset info is specified topic related. it can be process in app level when process multi-bytes encoded languages such as Zh_CN. in this situation, the offset is just an estimation.

        Show
        peter chang added a comment - 1. i agree with robert, fieldsToAdd, fieldsToFilter something like this can be added for IR and IS.doc 2. yes, the offset info is specified topic related. it can be process in app level when process multi-bytes encoded languages such as Zh_CN. in this situation, the offset is just an estimation.
        Hide
        peter chang added a comment -

        3. i do not think i can store only the interesting part because i do not know which is interesting part at index time. For example, the digest part of the search results is generated according to the query of somebody's.

        Show
        peter chang added a comment - 3. i do not think i can store only the interesting part because i do not know which is interesting part at index time. For example, the digest part of the search results is generated according to the query of somebody's.
        Hide
        Uwe Schindler added a comment -

        I think this method is redundant, because one can easily call new DSFV(fields) and use the SFV document version. And why do we favor Set<String> over String...?

        Ahm this would favour veeeery slow code. We would need to create a Set<String> on every call

        But I agree here with Robert we should add the sugar method (final please and without maxDoc checks, that's up to the abstract impl) for easier use.

        Show
        Uwe Schindler added a comment - I think this method is redundant, because one can easily call new DSFV(fields) and use the SFV document version. And why do we favor Set<String> over String...? Ahm this would favour veeeery slow code. We would need to create a Set<String> on every call But I agree here with Robert we should add the sugar method (final please and without maxDoc checks, that's up to the abstract impl) for easier use.
        Hide
        Uwe Schindler added a comment - - edited

        3. i do not think i can store only the interesting part because i do not know which is interesting part at index time. For example, the digest part of the search results is generated according to the query of somebody's.

        Digest is the wrong word, this confused here lots of people. The use case you talk about is "highlighting". I agree for very large fields this is expensive.

        In fact your patch does not handle this case and I agree with the others as it's to heavy to implement and adds back the crazy complexity we had with lazy fields & co.

        Show
        Uwe Schindler added a comment - - edited 3. i do not think i can store only the interesting part because i do not know which is interesting part at index time. For example, the digest part of the search results is generated according to the query of somebody's. Digest is the wrong word, this confused here lots of people. The use case you talk about is "highlighting". I agree for very large fields this is expensive. In fact your patch does not handle this case and I agree with the others as it's to heavy to implement and adds back the crazy complexity we had with lazy fields & co.
        Hide
        peter chang added a comment -

        yes, i mean hightlighting or sth. else dynamic generated at search time. Thnaks for Uwe's reminding.

        Show
        peter chang added a comment - yes, i mean hightlighting or sth. else dynamic generated at search time. Thnaks for Uwe's reminding.
        Hide
        Robert Muir added a comment -

        Where? What am I missing? DSFV only takes a list of fieldsToAdd, not fieldsToFilter. If you have 20 fields in your index, and you want to load all but 2 fields, it may be more convenient to specify these two, and I proposed that it can be done in a DSFV extension.

        Well, you certainly can do this in a DSFV extension. The only question is do we need to provide one that does this? I think in general
        each app will be different and having this visitor interface is "enough" rather than us supplying tons of concrete implementations for various
        use cases.

        I think this method is redundant, because one can easily call new DSFV(fields) and use the SFV document version.

        Yes its definitely redundant. But I think this is probably very common? Doesn't matter to me either way though.

        Show
        Robert Muir added a comment - Where? What am I missing? DSFV only takes a list of fieldsToAdd, not fieldsToFilter. If you have 20 fields in your index, and you want to load all but 2 fields, it may be more convenient to specify these two, and I proposed that it can be done in a DSFV extension. Well, you certainly can do this in a DSFV extension. The only question is do we need to provide one that does this? I think in general each app will be different and having this visitor interface is "enough" rather than us supplying tons of concrete implementations for various use cases. I think this method is redundant, because one can easily call new DSFV(fields) and use the SFV document version. Yes its definitely redundant. But I think this is probably very common? Doesn't matter to me either way though.
        Hide
        Shai Erera added a comment -

        The only question is do we need to provide one that does this?

        Oh, I did not propose that we do it in Lucene, but rather that Peter do it himself (I wrote "how about you ..."). I agree we should not cater for all use cases out there.

        Yes its definitely redundant. But I think this is probably very common? Doesn't matter to me either way though.

        I don't mind much either. It's just that this sugar method suggests that you have to create a Set<String> on every call, while if we point people to DSFV, people will fins that they can pass String... too.

        I anyway think that for most apps, this object is probably constructed just once, because usually the list of fields does not change between queries, or at least you will have a handful of those, one per query type. Perhaps if we omit the sugar method, people will think that way, and indeed create the object just once. Dunno, it's your call.

        If someone ends up committing that method in the context of this issue, I suggest that its subject is renamed accordingly. Otherwise, just close it.

        Show
        Shai Erera added a comment - The only question is do we need to provide one that does this? Oh, I did not propose that we do it in Lucene, but rather that Peter do it himself (I wrote "how about you ..."). I agree we should not cater for all use cases out there. Yes its definitely redundant. But I think this is probably very common? Doesn't matter to me either way though. I don't mind much either. It's just that this sugar method suggests that you have to create a Set<String> on every call, while if we point people to DSFV, people will fins that they can pass String... too. I anyway think that for most apps, this object is probably constructed just once, because usually the list of fields does not change between queries, or at least you will have a handful of those, one per query type. Perhaps if we omit the sugar method, people will think that way, and indeed create the object just once. Dunno, it's your call. If someone ends up committing that method in the context of this issue, I suggest that its subject is renamed accordingly. Otherwise, just close it.
        Hide
        Robert Muir added a comment -

        I don't mind much either. It's just that this sugar method suggests that you have to create a Set<String> on every call, while if we point people to DSFV, people will fins that they can pass String... too.

        True, but thats just because DSFV creates the hashset on the fly

        Perhaps if we omit the sugar method, people will think that way, and indeed create the object just once. Dunno, it's your call.

        Thats true too, because if you reuse the DSFV then the String... method is not harmful since you are only doing it once.
        So I think the String... method is ok on DSFV for this reason.

        However on indexreader, i think its also ok to have a sugar method with Set, because it just creates a DSFV around that hashset,
        so its hardly wasteful.

        In other words: Create a Set<String> and reuse your own Set via the proposed sugar method, and I think its fine,
        and a lot friendlier. Its not hashing anything. Sure its creating a DSFV each time, but like using a DSFV in any way,
        its also creating a Document object each time. If you are really worried about this stuff, implement your own visitor
        and don't use Document at all Don't forget we are talking about stored fields too!

        And I say keep the String... on DSFV only, but don't add to IR, so we don't encourage lots of wasteful rehashing.

        Show
        Robert Muir added a comment - I don't mind much either. It's just that this sugar method suggests that you have to create a Set<String> on every call, while if we point people to DSFV, people will fins that they can pass String... too. True, but thats just because DSFV creates the hashset on the fly Perhaps if we omit the sugar method, people will think that way, and indeed create the object just once. Dunno, it's your call. Thats true too, because if you reuse the DSFV then the String... method is not harmful since you are only doing it once. So I think the String... method is ok on DSFV for this reason. However on indexreader, i think its also ok to have a sugar method with Set, because it just creates a DSFV around that hashset, so its hardly wasteful. In other words: Create a Set<String> and reuse your own Set via the proposed sugar method, and I think its fine, and a lot friendlier. Its not hashing anything. Sure its creating a DSFV each time, but like using a DSFV in any way, its also creating a Document object each time. If you are really worried about this stuff, implement your own visitor and don't use Document at all Don't forget we are talking about stored fields too! And I say keep the String... on DSFV only, but don't add to IR, so we don't encourage lots of wasteful rehashing.
        Hide
        Shai Erera added a comment -

        Ok. One last comment (b/c I really don't mind if it's added or not) – I meant that if we'll put a Set method on IR, users might falsely create a Set on every document() call, b/c it's there and it's convenient. Maybe javadocs can warn people against doing this ...

        Show
        Shai Erera added a comment - Ok. One last comment (b/c I really don't mind if it's added or not) – I meant that if we'll put a Set method on IR, users might falsely create a Set on every document() call, b/c it's there and it's convenient. Maybe javadocs can warn people against doing this ...
        Hide
        Michael McCandless added a comment -

        +1 to adding simple sugar method to IR to only load the specified fields (Set<String>) of the document.

        It's just sugar to forward to DSFV.

        Show
        Michael McCandless added a comment - +1 to adding simple sugar method to IR to only load the specified fields (Set<String>) of the document. It's just sugar to forward to DSFV.
        Hide
        peter chang added a comment -

        i upload this patch just for convenience

        IndexSearcher.java
          /* Sugar for <code>.getIndexReader().document(docID)</code> */
          /** see {@link IndexReader#document(int, Set, Set)} for detail*/
          public Document doc(int docID, Set<String> fieldsToAdd, Set<String> fieldsToFilter) throws CorruptIndexException, IOException {
        	return reader.document(docID, fieldsToAdd, fieldsToFilter);
          }
        

        here, you see the IS also has the access to document fetch. so in this case, IS will look like powerless if IR can not supply such method or interface to the external.

        Show
        peter chang added a comment - i upload this patch just for convenience IndexSearcher.java /* Sugar for <code>.getIndexReader().document(docID)</code> */ /** see {@link IndexReader#document( int , Set, Set)} for detail*/ public Document doc( int docID, Set< String > fieldsToAdd, Set< String > fieldsToFilter) throws CorruptIndexException, IOException { return reader.document(docID, fieldsToAdd, fieldsToFilter); } here, you see the IS also has the access to document fetch. so in this case, IS will look like powerless if IR can not supply such method or interface to the external.
        Hide
        Michael McCandless added a comment -

        I was thinking just simple sugar, like the attached patch...

        Show
        Michael McCandless added a comment - I was thinking just simple sugar, like the attached patch...
        Hide
        Michael McCandless added a comment -

        Thanks Peter!

        Show
        Michael McCandless added a comment - Thanks Peter!

          People

          • Assignee:
            Unassigned
            Reporter:
            peter chang
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development