Lucene - Core
  1. Lucene - Core
  2. LUCENE-1613

TermEnum.docFreq() is not updated with there are deletes

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 2.4
    • Fix Version/s: None
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      TermEnum.docFreq is used in many places, especially scoring. However, if there are deletes in the index and it is not yet merged, this value is not updated.

      Attached is a test case.

        Activity

        Hide
        John Wang added a comment -

        Test showing docFreq not updated when there are deletes.

        Show
        John Wang added a comment - Test showing docFreq not updated when there are deletes.
        Hide
        John Wang added a comment -

        I understand this is a rather difficult problem to fix. I thought keeping a jira ticket would still be good for tracking purposes. Will let the committers decide on the urgency on this issue.

        Show
        John Wang added a comment - I understand this is a rather difficult problem to fix. I thought keeping a jira ticket would still be good for tracking purposes. Will let the committers decide on the urgency on this issue.
        Hide
        Michael McCandless added a comment -

        John, do you have cases in practice where this is causing problems?

        I understand the problem, and it's certainly real, and is not easy to fix "automatically", but I'm wondering in practice whether the difference in the resulting scores is ever significant.

        I suppose we could make a "fixTermCounts()" method, which takes a looong time as it iterates through the postings for each term to compute the actual count, and then writes a new terms dict. The app would have to manually call this method.

        Show
        Michael McCandless added a comment - John, do you have cases in practice where this is causing problems? I understand the problem, and it's certainly real, and is not easy to fix "automatically", but I'm wondering in practice whether the difference in the resulting scores is ever significant. I suppose we could make a "fixTermCounts()" method, which takes a looong time as it iterates through the postings for each term to compute the actual count, and then writes a new terms dict. The app would have to manually call this method.
        Hide
        John Wang added a comment -

        Michael: We ran into this actually in facet search. When there is a null search, instead of counting on results on a MatchAllDocsQuery, we were just using docFreq() method to avoid facet counting. The problem came with there were updates. We did get around it, but was rather cumbersome.

        I agree the fix is non-trivial, just wanted to open up an issue for tracking purposes incase we think of some thing.

        Show
        John Wang added a comment - Michael: We ran into this actually in facet search. When there is a null search, instead of counting on results on a MatchAllDocsQuery, we were just using docFreq() method to avoid facet counting. The problem came with there were updates. We did get around it, but was rather cumbersome. I agree the fix is non-trivial, just wanted to open up an issue for tracking purposes incase we think of some thing.
        Hide
        Matt Chaput added a comment - - edited

        Given how fundamental the issue is w.r.t. how Lucene stores the index, it's unlikely to ever be fixed. (A clean, performant fix other than simply merging the segments would be a pretty incredible revelation.) As an outside observer I would argue against keeping the bug open forever for correctness sake.

        Show
        Matt Chaput added a comment - - edited Given how fundamental the issue is w.r.t. how Lucene stores the index, it's unlikely to ever be fixed. (A clean, performant fix other than simply merging the segments would be a pretty incredible revelation.) As an outside observer I would argue against keeping the bug open forever for correctness sake.
        Hide
        Mark Miller added a comment -

        This is a dupe I believe, but for the life of me, I cannot find the original to link them.

        Show
        Mark Miller added a comment - This is a dupe I believe, but for the life of me, I cannot find the original to link them.
        Hide
        Mark Miller added a comment -

        As an outside observer I would argue against keeping the bug open forever for correctness sake.

        I agree - its not really a bug. Its by design.

        I suppose we could make a "fixTermCounts()" method, which takes a looong time as it iterates through the postings for each term to compute the actual count,

        Just call expungeDeletes?

        +1 on closing.

        Show
        Mark Miller added a comment - As an outside observer I would argue against keeping the bug open forever for correctness sake. I agree - its not really a bug. Its by design. I suppose we could make a "fixTermCounts()" method, which takes a looong time as it iterates through the postings for each term to compute the actual count, Just call expungeDeletes? +1 on closing.
        Hide
        John Wang added a comment -

        Maybe to just add a javadoc comment on the call to explain the behavior in this case?

        Many times calling docFreq happens in a readonly context, calling expungeDeletes in that context is not a good idea.

        I agree it is not trivial to fix while keeping the performance. I don't mind closing the bug either.

        Show
        John Wang added a comment - Maybe to just add a javadoc comment on the call to explain the behavior in this case? Many times calling docFreq happens in a readonly context, calling expungeDeletes in that context is not a good idea. I agree it is not trivial to fix while keeping the performance. I don't mind closing the bug either.
        Hide
        Michael McCandless added a comment -

        Just changing resolution to wontfix ...

        Show
        Michael McCandless added a comment - Just changing resolution to wontfix ...

          People

          • Assignee:
            Unassigned
            Reporter:
            John Wang
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development