Lucene - Core
  1. Lucene - Core
  2. LUCENE-5200

HighFreqTerms has confusing behavior with -t option

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.5, 6.0
    • Component/s: modules/other
    • Labels:
      None
    • Lucene Fields:
      New

      Description

       * <code>HighFreqTerms</code> class extracts the top n most frequent terms
       * (by document frequency) from an existing Lucene index and reports their
       * document frequency.
       * <p>
       * If the -t flag is given, both document frequency and total tf (total
       * number of occurrences) are reported, ordered by descending total tf.
      

      Problem #1:
      Its tricky what happens with -t: if you ask for the top-100 terms, it requests the top-100 terms (by docFreq), then resorts the top-N by totalTermFreq.

      So its not really the top 100 most frequently occurring terms.

      Problem #2:
      Using the -t option can be confusing and slow: the reported docFreq includes deletions, but totalTermFreq does not (it actually walks postings lists if there is even one deletion).

      I think this is a relic from 3.x days when lucene did not support this statistic. I think we should just always output both TermsEnum.docFreq() and TermsEnum.totalTermFreq(), and -t just determines the comparator of the PQ.

        Activity

        Hide
        ASF subversion and git services added a comment -

        Commit 1520615 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1520615 ]

        LUCENE-5200: HighFreqTerms has confusing behavior with -t option

        Show
        ASF subversion and git services added a comment - Commit 1520615 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1520615 ] LUCENE-5200 : HighFreqTerms has confusing behavior with -t option
        Hide
        ASF subversion and git services added a comment -

        Commit 1520616 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1520616 ]

        LUCENE-5200: HighFreqTerms has confusing behavior with -t option

        Show
        ASF subversion and git services added a comment - Commit 1520616 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1520616 ] LUCENE-5200 : HighFreqTerms has confusing behavior with -t option
        Hide
        Adrien Grand added a comment -

        4.5 release -> bulk close

        Show
        Adrien Grand added a comment - 4.5 release -> bulk close

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development