Lucene - Core
  1. Lucene - Core
  2. LUCENE-3593

Add a filter returning all document without a value in a field

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.6, 4.0-ALPHA
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      In some situations it would be useful to have a Filter that simply returns all document that either have at least one or no value in a certain field. We don't have something like that out of the box and adding it seems straight forward.

      1. LUCENE-3593.patch
        10 kB
        Simon Willnauer
      2. LUCENE-3593.patch
        10 kB
        Simon Willnauer
      3. LUCENE-3593.patch
        8 kB
        Simon Willnauer

        Issue Links

          Activity

          Simon Willnauer created issue -
          Hide
          Simon Willnauer added a comment -

          here is an initial patch. maybe somebody more familiar with our new filter hierarchy etc. can review this. There is still a nocommit in there since I am not sure if that is the best way of doing this. Uwe had some ideas I think..

          Show
          Simon Willnauer added a comment - here is an initial patch. maybe somebody more familiar with our new filter hierarchy etc. can review this. There is still a nocommit in there since I am not sure if that is the best way of doing this. Uwe had some ideas I think..
          Simon Willnauer made changes -
          Field Original Value New Value
          Attachment LUCENE-3593.patch [ 12505040 ]
          Hide
          Uwe Schindler added a comment -

          Hi Simon,

          the first patch looks exactly like I said in our first ideas-exchange!

          There are smaller problems (but solveable) and one optimization... (a trick one...): FieldCacheDocIdSet has some special cases which work with the implementation here, but are unclean and should violate some assertions - and should be fixed...:

          • FieldCacheDocIdSet excepts that the match() method throws ArrayIndexOutOfBoundsException when the FieldCacheArray is out of bounds. With the FixedBitSet behind that implementation of the FieldCache this basically works, but should violate some code assertions added by MikeMcCandless (not sure why the testcase does not hit this - doesn't it - I assume it does not because the trunk bits() on DocIdSet will intercept this as our filter is not sparse -> it switches to random access)
          • The FieldCacheDocIdSet should maybe made un-private and refactored out of the FieldCacheRangeFilter.
          • The positive case could be optimized: A instanceof check in the getDocIdSet() method could check for the positive case that the FieldCacheImpl itsself returns a FixedBitSet/DocIdSet already and return this directly:
          final Bits docsWithField = FieldCache.DEFAULT.getDocsWithField(context.reader, field);
          if (negate && docsWithField instanceof DocIdSet) // this is always the case for our current impl - but who knows :-)
            return (DocIdSet) docsWithField;
          

          In general the other cases can be easily done by the default stupid (stupid in the case that its slowly iterating by doc++ and in trunk directly uses the Bits) impl like you did, but once factoring out the FieldCacheRangeFilter.FieldCacheDocIdSet we could optimize this and maybe have a better negation.

          In all cases I dont like double negation of this Filter.

          I'll work on the problems and make this filter work better. Should I take this issue and solve the problems first? I also want to backport the FieldCacheTermsFilter code-duplication removal in trunk to 3.x, so some cleanup is really needed!

          I will come with a patch adressing those problems later or tomorrow.

          Show
          Uwe Schindler added a comment - Hi Simon, the first patch looks exactly like I said in our first ideas-exchange! There are smaller problems (but solveable) and one optimization... (a trick one...): FieldCacheDocIdSet has some special cases which work with the implementation here, but are unclean and should violate some assertions - and should be fixed...: FieldCacheDocIdSet excepts that the match() method throws ArrayIndexOutOfBoundsException when the FieldCacheArray is out of bounds. With the FixedBitSet behind that implementation of the FieldCache this basically works, but should violate some code assertions added by MikeMcCandless (not sure why the testcase does not hit this - doesn't it - I assume it does not because the trunk bits() on DocIdSet will intercept this as our filter is not sparse -> it switches to random access) The FieldCacheDocIdSet should maybe made un-private and refactored out of the FieldCacheRangeFilter. The positive case could be optimized: A instanceof check in the getDocIdSet() method could check for the positive case that the FieldCacheImpl itsself returns a FixedBitSet/DocIdSet already and return this directly: final Bits docsWithField = FieldCache.DEFAULT.getDocsWithField(context.reader, field); if (negate && docsWithField instanceof DocIdSet) // this is always the case for our current impl - but who knows :-) return (DocIdSet) docsWithField; In general the other cases can be easily done by the default stupid (stupid in the case that its slowly iterating by doc++ and in trunk directly uses the Bits) impl like you did, but once factoring out the FieldCacheRangeFilter.FieldCacheDocIdSet we could optimize this and maybe have a better negation. In all cases I dont like double negation of this Filter. I'll work on the problems and make this filter work better. Should I take this issue and solve the problems first? I also want to backport the FieldCacheTermsFilter code-duplication removal in trunk to 3.x, so some cleanup is really needed! I will come with a patch adressing those problems later or tomorrow.
          Hide
          Robert Muir added a comment -

          Hello,

          I think rather than having NoFieldValueFilter with boolean negate, we should instead use a positive name for the filter, to avoid requiring double negations.

          So the name would something positive like FieldValueFilter instead.

          Show
          Robert Muir added a comment - Hello, I think rather than having NoFieldValueFilter with boolean negate, we should instead use a positive name for the filter, to avoid requiring double negations. So the name would something positive like FieldValueFilter instead.
          Hide
          Uwe Schindler added a comment -

          Robert: I already said that. The Filter could also live of the shortcuts, if getDocsWithField returns MatchAllBits it could shortcut to MatchAllDocsQuery/Filter or return null (depending on negation or not). In all cases this Filter could in all cases do some shortcuts

          I will provide a patch later adressing also the stupid "rely on ArrayIndexOutOfBoundsException". I am not even sure that FieldCacheTermsFilter works correct in trunk... I would prefer to add the "< maxDoc" check in the iterator impl beyond FieldCacheDocIdSet.

          Show
          Uwe Schindler added a comment - Robert: I already said that. The Filter could also live of the shortcuts, if getDocsWithField returns MatchAllBits it could shortcut to MatchAllDocsQuery/Filter or return null (depending on negation or not). In all cases this Filter could in all cases do some shortcuts I will provide a patch later adressing also the stupid "rely on ArrayIndexOutOfBoundsException". I am not even sure that FieldCacheTermsFilter works correct in trunk... I would prefer to add the "< maxDoc" check in the iterator impl beyond FieldCacheDocIdSet.
          Hide
          Simon Willnauer added a comment -

          guys thanks for the comments! I incorporated uwes shortcut and fixed the name to FieldValueFilter. I extened the testcase to also use deletions. Uwe, whatever you want to do feel free to upload a new patch. I already added a changes.txt entry for this too...

          Show
          Simon Willnauer added a comment - guys thanks for the comments! I incorporated uwes shortcut and fixed the name to FieldValueFilter. I extened the testcase to also use deletions. Uwe, whatever you want to do feel free to upload a new patch. I already added a changes.txt entry for this too...
          Simon Willnauer made changes -
          Attachment LUCENE-3593.patch [ 12505042 ]
          Simon Willnauer made changes -
          Assignee Simon Willnauer [ simonw ]
          Hide
          Uwe Schindler added a comment -

          Patch looks good, except the AIOOBE requirement - which is a broken API. I will fix this now by always doing (doc < maxDoc) check in the iterator and also factor out the abstract class from FCRF. Also 3.x has a bug in FCRF with respecting deletions (I already discussed with Mike several weeks ago).

          Show
          Uwe Schindler added a comment - Patch looks good, except the AIOOBE requirement - which is a broken API. I will fix this now by always doing (doc < maxDoc) check in the iterator and also factor out the abstract class from FCRF. Also 3.x has a bug in FCRF with respecting deletions (I already discussed with Mike several weeks ago).
          Hide
          Simon Willnauer added a comment -

          next iteration. I added a AIOOBE to all the match methods to make the interface happy. I think we should fix this as uwe mentioned. Uwe, can you create a subtask / issue to fix this.. I think you already on it no? I added some more optimizations if the Bits is instanceof MatchNoBits / MatchAllBits respectively I return null instead of the DocIdSetIterator when appropriate. all test pass..

          I think it's ready..

          Show
          Simon Willnauer added a comment - next iteration. I added a AIOOBE to all the match methods to make the interface happy. I think we should fix this as uwe mentioned. Uwe, can you create a subtask / issue to fix this.. I think you already on it no? I added some more optimizations if the Bits is instanceof MatchNoBits / MatchAllBits respectively I return null instead of the DocIdSetIterator when appropriate. all test pass.. I think it's ready..
          Simon Willnauer made changes -
          Attachment LUCENE-3593.patch [ 12505043 ]
          Hide
          Uwe Schindler added a comment -

          +1 looks fine. Cool heavy fast ping-pong

          I will open new issue and fix the TODO, especially as I dont know if trunk really likes e.g. on FieldCacheTermsFilter that getOrd() is called with doc id out of range... Maybe it throws AIOOBE, but thats ugly.

          Show
          Uwe Schindler added a comment - +1 looks fine. Cool heavy fast ping-pong I will open new issue and fix the TODO, especially as I dont know if trunk really likes e.g. on FieldCacheTermsFilter that getOrd() is called with doc id out of range... Maybe it throws AIOOBE, but thats ugly.
          Hide
          Uwe Schindler added a comment -

          Opened LUCENE-3595.

          Show
          Uwe Schindler added a comment - Opened LUCENE-3595 .
          Uwe Schindler made changes -
          Link This issue relates to LUCENE-3595 [ LUCENE-3595 ]
          Hide
          Simon Willnauer added a comment -

          Opened LUCENE-3595.

          thanks! I committed this to trunk and backported to 3.x

          nice iterations.. heavy committing

          Show
          Simon Willnauer added a comment - Opened LUCENE-3595 . thanks! I committed this to trunk and backported to 3.x nice iterations.. heavy committing
          Simon Willnauer made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Uwe Schindler added a comment - - edited

          Hi Simon, I fixed the TODO for the DocIdSet special case (@UweSays) to resepect deletions (with fixes from LUCENE-3595). Code needed for Lucene 3.x was copied from CachingWrapperFilter.

          Show
          Uwe Schindler added a comment - - edited Hi Simon, I fixed the TODO for the DocIdSet special case (@UweSays) to resepect deletions (with fixes from LUCENE-3595 ). Code needed for Lucene 3.x was copied from CachingWrapperFilter.
          Uwe Schindler made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Simon Willnauer
              Reporter:
              Simon Willnauer
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development