Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-762

[PATCH] Efficiently retrieve sizes of field values

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.1
    • Fix Version/s: None
    • Component/s: core/store
    • Labels:
      None

      Description

      Sometimes an application would like to know how large a document is before retrieving it. This can be important for memory management or choosing between algorithms, especially in cases where documents might be very large.

      This patch extends the existing FieldSelector mechanism with two new FieldSelectorResults: SIZE and SIZE_AND_BREAK. SIZE creates fields on the retrieved document that store field sizes instead of actual values. SIZE_AND_BREAK is especially efficient if one field comprises the bulk of the document size (e.g., the body field) and can thus be used as a reasonable size approximation.

        Activity

        Hide
        gsingers Grant Ingersoll added a comment -

        I have committed the original patch. All tests pass. In the end, I could not find a way I felt comfortable with for getting rid of the if-then-else clause in FieldsReader. I did add a TODO item there to remind us to go back and take a look at it again later.

        Since the if clauses are ordered according to their most common usages (I think), I don't think there will be much of a performance issue w/ the current approach.

        Show
        gsingers Grant Ingersoll added a comment - I have committed the original patch. All tests pass. In the end, I could not find a way I felt comfortable with for getting rid of the if-then-else clause in FieldsReader. I did add a TODO item there to remind us to go back and take a look at it again later. Since the if clauses are ordered according to their most common usages (I think), I don't think there will be much of a performance issue w/ the current approach.
        Hide
        gsingers Grant Ingersoll added a comment -

        I agree, Chuck, I sometimes wish you could have JAR level access or something like that. Off topic, but doesn't C++ have a friend operator (not that I'm wishing for the C++ days) that lets you do these kind of things.

        I originally put FieldSelector in the Document package b/c I felt it was more closely associated with Documents and Fields than the Index package. Now, I guess, since the FieldsReader is in Index, I can see a strong argument for that location.

        Anyway, I thought of an alternative that should keep everyone happy:
        Create a new Interface named FieldInfoAccessor which provides get methods for the properties on FieldInfo, which will implement the interface. Pass that to FieldSelectorResult.readField method instead of FieldInfo and proceed as stated above.

        The only sticking point I am now having is what to do with the Lazy Fields, since they require the ThreadLocal in FieldsReader. I guess I would have to pass in the cloneable Fields Reader and the threadlocal, but that doesn't seem quite right either. Any thoughts? Moving the FieldSelector stuff into index package wouldn't address this, I don't think. Maybe a brilliant thought will occur to me in the morning.

        Show
        gsingers Grant Ingersoll added a comment - I agree, Chuck, I sometimes wish you could have JAR level access or something like that. Off topic, but doesn't C++ have a friend operator (not that I'm wishing for the C++ days) that lets you do these kind of things. I originally put FieldSelector in the Document package b/c I felt it was more closely associated with Documents and Fields than the Index package. Now, I guess, since the FieldsReader is in Index, I can see a strong argument for that location. Anyway, I thought of an alternative that should keep everyone happy: Create a new Interface named FieldInfoAccessor which provides get methods for the properties on FieldInfo, which will implement the interface. Pass that to FieldSelectorResult.readField method instead of FieldInfo and proceed as stated above. The only sticking point I am now having is what to do with the Lazy Fields, since they require the ThreadLocal in FieldsReader. I guess I would have to pass in the cloneable Fields Reader and the threadlocal, but that doesn't seem quite right either. Any thoughts? Moving the FieldSelector stuff into index package wouldn't address this, I don't think. Maybe a brilliant thought will occur to me in the morning.
        Hide
        manawiz Chuck Williams added a comment -

        I use FieldInfo heavily and many other package-level API's, but put my classes into Lucene packages to do this. To bad Java doesn't have intermediate access levels between public and package, e.g., open to this package and it's subpackages.

        Maybe move FIeldSelector and FieldSelectorResult out of org.apache.lucene.document and into org.apache.lucene.index as they are deeply integrated with FieldsReader? They don't have any package-level API's at present.

        Show
        manawiz Chuck Williams added a comment - I use FieldInfo heavily and many other package-level API's, but put my classes into Lucene packages to do this. To bad Java doesn't have intermediate access levels between public and package, e.g., open to this package and it's subpackages. Maybe move FIeldSelector and FieldSelectorResult out of org.apache.lucene.document and into org.apache.lucene.index as they are deeply integrated with FieldsReader? They don't have any package-level API's at present.
        Hide
        gsingers Grant Ingersoll added a comment -

        Hmm, I forgot to mention that FieldInfo is package local. This complicates things a little bit, but I could expand the proposed method to pass in the appropriate FieldInfo properties, which seems to be name and omitNorms, but that doesn't seem quite right b/c we may want other FieldInfo later. I suppose I could copy the info into a holding structure, but that seems like a waste. Another option is to make FieldInfo public and label it as Expert Use/Internal Use only. It exposes some of the implementation details, but not all and wouldn't be meaningful to most people, I guess.

        Any strong opinions on making FieldInfo public?

        Show
        gsingers Grant Ingersoll added a comment - Hmm, I forgot to mention that FieldInfo is package local. This complicates things a little bit, but I could expand the proposed method to pass in the appropriate FieldInfo properties, which seems to be name and omitNorms, but that doesn't seem quite right b/c we may want other FieldInfo later. I suppose I could copy the info into a holding structure, but that seems like a waste. Another option is to make FieldInfo public and label it as Expert Use/Internal Use only. It exposes some of the implementation details, but not all and wouldn't be meaningful to most people, I guess. Any strong opinions on making FieldInfo public?
        Hide
        gsingers Grant Ingersoll added a comment -

        This would also eliminate the need to change FieldsReader for every new FieldSelectorResult that we want. The downside is that it would most likely involve some reworking of FieldSelectorResult. This probably isn't a big deal, since the FieldSelector stuff hasn't been released yet and has been noted to be experimental, but it has been on trunk for a while, so it is in use, I'm sure. I think Solr has incorporated it, but I am not sure.

        Patch that incorporates your existing patch and this new approach to follow soon.

        Show
        gsingers Grant Ingersoll added a comment - This would also eliminate the need to change FieldsReader for every new FieldSelectorResult that we want. The downside is that it would most likely involve some reworking of FieldSelectorResult. This probably isn't a big deal, since the FieldSelector stuff hasn't been released yet and has been noted to be experimental, but it has been on trunk for a while, so it is in use, I'm sure. I think Solr has incorporated it, but I am not sure. Patch that incorporates your existing patch and this new approach to follow soon.
        Hide
        manawiz Chuck Williams added a comment -

        Hi Grant,

        Maybe even better would be to have an appropriate method on FieldSelectorResult. E.g.:

        FieldSelectorResult.readField(doc, fieldsStream, fi, binary, compressed, tokenized)

        This would eliminate the tests or map lookup in performance-critical code.

        Show
        manawiz Chuck Williams added a comment - Hi Grant, Maybe even better would be to have an appropriate method on FieldSelectorResult. E.g.: FieldSelectorResult.readField(doc, fieldsStream, fi, binary, compressed, tokenized) This would eliminate the tests or map lookup in performance-critical code.
        Hide
        gsingers Grant Ingersoll added a comment -

        Hi Chuck,

        I haven't applied this yet, but it looks good. My only hesitation, and this is no reflection on how you have implemented it, but the if-then-else structure around line 104 is starting to get ugly. I can imagine it growing and growing as more cases are handled. I was wondering if maybe we should convert to a map lookup approach. In Java 1.5 the map would be something like:
        Map<FieldSelectorResult, FieldSelectorFunctor>

        where the Functor does the work of what is in the clause of each of the different cases
        then, the if-else structure could be replaced by
        FieldSelectorFunctor functor = map.get(FieldSelectorResult);
        if (functor != null)
        {
        functor.apply(doc, fi, binary, compressed, tokenize);
        }
        else
        {
        skipField(binary, compressed)
        }

        The constructor/static would be responsible for instantiating the Map. The Functors could be implemented as inner classes (although this can lead to a bunch of inner classes too)

        What do you think?

        Show
        gsingers Grant Ingersoll added a comment - Hi Chuck, I haven't applied this yet, but it looks good. My only hesitation, and this is no reflection on how you have implemented it, but the if-then-else structure around line 104 is starting to get ugly. I can imagine it growing and growing as more cases are handled. I was wondering if maybe we should convert to a map lookup approach. In Java 1.5 the map would be something like: Map<FieldSelectorResult, FieldSelectorFunctor> where the Functor does the work of what is in the clause of each of the different cases then, the if-else structure could be replaced by FieldSelectorFunctor functor = map.get(FieldSelectorResult); if (functor != null) { functor.apply(doc, fi, binary, compressed, tokenize); } else { skipField(binary, compressed) } The constructor/static would be responsible for instantiating the Map. The Functors could be implemented as inner classes (although this can lead to a bunch of inner classes too) What do you think?

          People

          • Assignee:
            gsingers Grant Ingersoll
            Reporter:
            manawiz Chuck Williams
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development