Lucene - Core
  1. Lucene - Core
  2. LUCENE-2393

Utility to output total term frequency and df from a lucene index

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/other
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      This is a pair of command line utilities that provide information on the total number of occurrences of a term in a Lucene index. The first takes a field name, term, and index directory and outputs the document frequency for the term and the total number of occurrences of the term in the index (i.e. the sum of the tf of the term for each document). The second reads the index to determine the top N most frequent terms (by document frequency) and then outputs a list of those terms along with the document frequency and the total number of occurrences of the term. Both utilities are useful for estimating the size of the term's entry in the *prx files and consequent Disk I/O demands.

      1. ASF.LICENSE.NOT.GRANTED--LUCENE-2393.patch
        12 kB
        Tom Burton-West
      2. ASF.LICENSE.NOT.GRANTED--LUCENE-2393.patch
        11 kB
        Tom Burton-West
      3. ASF.LICENSE.NOT.GRANTED--LUCENE-2393.patch
        4 kB
        Tom Burton-West
      4. LUCENE-2393.patch
        22 kB
        Tom Burton-West
      5. LUCENE-2393.patch
        17 kB
        Michael McCandless
      6. LUCENE-2393.patch
        21 kB
        Tom Burton-West
      7. LUCENE-2393.patch
        12 kB
        Tom Burton-West
      8. LUCENE-2393-3x.patch
        23 kB
        Michael McCandless
      9. LUCENE-2393-3xbranch.patch
        22 kB
        Tom Burton-West

        Activity

        Hide
        Tom Burton-West added a comment - - edited

        Patch against recent trunk. Can someone please suggest an appropriate existing unit test to use as a model for creating a unit test for this? Would it be appropriate to include a small index file for testing or is it better to programatically create the index file?

        Show
        Tom Burton-West added a comment - - edited Patch against recent trunk. Can someone please suggest an appropriate existing unit test to use as a model for creating a unit test for this? Would it be appropriate to include a small index file for testing or is it better to programatically create the index file?
        Hide
        Tom Burton-West added a comment -

        For an example of how this utility can be used please see: http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-1

        Show
        Tom Burton-West added a comment - For an example of how this utility can be used please see: http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-1
        Hide
        Otis Gospodnetic added a comment -

        I think creating a small index with a couple of docs would be the way to go.

        Show
        Otis Gospodnetic added a comment - I think creating a small index with a couple of docs would be the way to go.
        Hide
        Michael McCandless added a comment -

        Programmatically indexing those docs is fine – most tests make a MockRAMDir, index a few docs into it, and test against that.

        This tool looks useful, thanks Tom!

        Note that with flex scoring (LUCENE-2392) we are planning on storing this statistic (sum of tf for the term across all docs) in the terms dict, for fields that enable statistics. So when that lands, this tool can pull from that, or regenerate it if the field didn't store stats.

        Show
        Michael McCandless added a comment - Programmatically indexing those docs is fine – most tests make a MockRAMDir, index a few docs into it, and test against that. This tool looks useful, thanks Tom! Note that with flex scoring ( LUCENE-2392 ) we are planning on storing this statistic (sum of tf for the term across all docs) in the terms dict, for fields that enable statistics. So when that lands, this tool can pull from that, or regenerate it if the field didn't store stats.
        Hide
        Mark Miller added a comment -

        Perhaps this should be combined with high freq terms tool ... could make a ton of this little guys, so prob best to consolidate them.

        Show
        Mark Miller added a comment - Perhaps this should be combined with high freq terms tool ... could make a ton of this little guys, so prob best to consolidate them.
        Hide
        Tom Burton-West added a comment -

        New patch includes a (pre-flex ) version of HighFreqTerms that finds the top N terms with the highest docFreq and looks up the total term frequency and outputs the list of terms sorted by highest term frequency (which approximates the largest entries in the *prx files). I'm not sure how to combine the GetTermInfo program, with either version of HighFreqTerms in a way that leads to sane command line arguments and argument processing. I suppose that HighFreqTerms could have a flag that turns on or off the inclusion of total term frequency.

        Show
        Tom Burton-West added a comment - New patch includes a (pre-flex ) version of HighFreqTerms that finds the top N terms with the highest docFreq and looks up the total term frequency and outputs the list of terms sorted by highest term frequency (which approximates the largest entries in the *prx files). I'm not sure how to combine the GetTermInfo program, with either version of HighFreqTerms in a way that leads to sane command line arguments and argument processing. I suppose that HighFreqTerms could have a flag that turns on or off the inclusion of total term frequency.
        Hide
        Tom Burton-West added a comment -

        Updated the HighFreqTermsWithTF to use flex API.

        I don't understand the flex API well enough yet to determine if I should have used DocsEnum.read/DocsEnum.getBulkResult() to do a bulk read instead of DocsEnum.nextDoc() and DocsEnum.freq()..

        Show
        Tom Burton-West added a comment - Updated the HighFreqTermsWithTF to use flex API. I don't understand the flex API well enough yet to determine if I should have used DocsEnum.read/DocsEnum.getBulkResult() to do a bulk read instead of DocsEnum.nextDoc() and DocsEnum.freq()..
        Hide
        Michael McCandless added a comment -

        Patch looks good Tom – thanks for cutting over to flex. You could in fact use the bulk read API here; it'd be faster. But performance isn't a big deal here

        Maybe you should require a field instead of defaulting to "ocr"?

        Why does GetTermInfo.getTermInfo take a String[] fields (it's not used I think)?

        Probably we should cutover to BytesRef here too, eg TermInfoWithTotalTF?

        Maybe you could share the code between HighFreqTermsWithTF.getTermFreqOrdered & GetTermInfo.getTermInfo? (They both loop, summing up the .freq() of each doc to get the total term freq).

        Small typo in javadoc thier -> their.

        Show
        Michael McCandless added a comment - Patch looks good Tom – thanks for cutting over to flex. You could in fact use the bulk read API here; it'd be faster. But performance isn't a big deal here Maybe you should require a field instead of defaulting to "ocr"? Why does GetTermInfo.getTermInfo take a String[] fields (it's not used I think)? Probably we should cutover to BytesRef here too, eg TermInfoWithTotalTF? Maybe you could share the code between HighFreqTermsWithTF.getTermFreqOrdered & GetTermInfo.getTermInfo? (They both loop, summing up the .freq() of each doc to get the total term freq). Small typo in javadoc thier -> their.
        Hide
        Tom Burton-West added a comment -

        Revised patch updated everything to flex. Replaces all references to Term with BytesRef and field.
        GetTermInfo now requires a field instead of default= ocr
        removed unused String[] fields argument
        GetTermInfo now uses shared code HighFreqTermsWithTF.getTotalTF(); to get total tf.
        Removed GetTermInfo dependency on TermInfoWithTotalTF[] and inlined it into HighFreqTermsWithTF.

        Still don't understand the bulk read API, but given that I have indexes with *frq files of 60GB I'd like to use it. Is there some documentation, code, or a test case I might look at ?

        Show
        Tom Burton-West added a comment - Revised patch updated everything to flex. Replaces all references to Term with BytesRef and field. GetTermInfo now requires a field instead of default= ocr removed unused String[] fields argument GetTermInfo now uses shared code HighFreqTermsWithTF.getTotalTF(); to get total tf. Removed GetTermInfo dependency on TermInfoWithTotalTF[] and inlined it into HighFreqTermsWithTF. Still don't understand the bulk read API, but given that I have indexes with *frq files of 60GB I'd like to use it. Is there some documentation, code, or a test case I might look at ?
        Hide
        Michael McCandless added a comment -

        Still don't understand the bulk read API, but given that I have indexes with *frq files of 60GB I'd like to use it. Is there some documentation, code, or a test case I might look at ?

        I just committed some small improvements to the javdadocs for this – can you look & see if it's understandable now?

        Also, have a look at oal.search.TermScorer – it consumes the bulk API.

        Show
        Michael McCandless added a comment - Still don't understand the bulk read API, but given that I have indexes with *frq files of 60GB I'd like to use it. Is there some documentation, code, or a test case I might look at ? I just committed some small improvements to the javdadocs for this – can you look & see if it's understandable now? Also, have a look at oal.search.TermScorer – it consumes the bulk API.
        Hide
        Michael McCandless added a comment -

        Thanks for the updated patch Tom... feedback:

        • Maybe do away with the "hack to allow tokens with whitespace"?
          One should use quotes with their shell for this? (And eg the hack
          doens't work with tokens that have 2 spaces).
        • Can you rename things like total_tf --> totalTF (consistent w/
          Lucene's standard code style)
        • Maybe rename TermInfoWithTotalTF -> TermStats? (It also has
          .docFreq)
        • Maybe rename TermInfoWithTotalTF.termFreq -> .totalTermFreq?
        • Maybe rename .getTermFreqOrdered -> .sortByTotalTermFreq?
        • You don't really need a priority queue to for the
          getTermFreqOrdered case? Ie, instead, just fill in the
          .totalTermFreq and then do a normal sort (make a
          Comparator<TermStats> that sorts by the .totalTermFreq)
        Show
        Michael McCandless added a comment - Thanks for the updated patch Tom... feedback: Maybe do away with the "hack to allow tokens with whitespace"? One should use quotes with their shell for this? (And eg the hack doens't work with tokens that have 2 spaces). Can you rename things like total_tf --> totalTF (consistent w/ Lucene's standard code style) Maybe rename TermInfoWithTotalTF -> TermStats? (It also has .docFreq) Maybe rename TermInfoWithTotalTF.termFreq -> .totalTermFreq? Maybe rename .getTermFreqOrdered -> .sortByTotalTermFreq? You don't really need a priority queue to for the getTermFreqOrdered case? Ie, instead, just fill in the .totalTermFreq and then do a normal sort (make a Comparator<TermStats> that sorts by the .totalTermFreq)
        Hide
        Tom Burton-West added a comment -

        Added unit tests. Made changes outlined by Mike. Still working on BulkRead.

        Show
        Tom Burton-West added a comment - Added unit tests. Made changes outlined by Mike. Still working on BulkRead.
        Hide
        Tom Burton-West added a comment -

        Patch that includes unit tests and changes outlined in Mike's comment

        Show
        Tom Burton-West added a comment - Patch that includes unit tests and changes outlined in Mike's comment
        Hide
        Tom Burton-West added a comment -

        Updated to use BulkResult api.

        Show
        Tom Burton-West added a comment - Updated to use BulkResult api.
        Hide
        Michael McCandless added a comment -

        Patch looks good Tom!

        I cleaned things up a bit – eg, you don't need to use the class members when interacting w/ the bulk DocsEnum API.

        I think it's ready to go in!

        Show
        Michael McCandless added a comment - Patch looks good Tom! I cleaned things up a bit – eg, you don't need to use the class members when interacting w/ the bulk DocsEnum API. I think it's ready to go in!
        Hide
        Michael McCandless added a comment -

        I think we should just replace the current HighFreqTerms with the HighFreqTermsWithTF?

        Show
        Michael McCandless added a comment - I think we should just replace the current HighFreqTerms with the HighFreqTermsWithTF?
        Hide
        Tom Burton-West added a comment -

        Hi Mike,

        Thanks for all your help.

        If we replace the current HighFreqTerms with the HighFreqTermsWithTF should there be a command line switch so that you could ask for the default behavior of the current HighFreqTerms? Or perhaps the default should be the current behavior and the switch should turn on the additional step of gathering and reporting on the totalTF for the terms.

        I haven't bench-marked it but I'm wondering if getting the totalTF could take a significant additional amount of time for large indexes. When I ask for the top 10,000 terms using HighFreqTermsWithTF for our 500,000 document indexes it takes about 40 minutes to an hour. I'm guessing that most of that time is taken in the first step of getting the top 10,000 terms by docFreq, but still it seems that reading the data and calculating the totalTF for 10,000 terms might be a significant enough fraction of the total time that the option to skip that step might be useful.

        Tom

        Show
        Tom Burton-West added a comment - Hi Mike, Thanks for all your help. If we replace the current HighFreqTerms with the HighFreqTermsWithTF should there be a command line switch so that you could ask for the default behavior of the current HighFreqTerms? Or perhaps the default should be the current behavior and the switch should turn on the additional step of gathering and reporting on the totalTF for the terms. I haven't bench-marked it but I'm wondering if getting the totalTF could take a significant additional amount of time for large indexes. When I ask for the top 10,000 terms using HighFreqTermsWithTF for our 500,000 document indexes it takes about 40 minutes to an hour. I'm guessing that most of that time is taken in the first step of getting the top 10,000 terms by docFreq, but still it seems that reading the data and calculating the totalTF for 10,000 terms might be a significant enough fraction of the total time that the option to skip that step might be useful. Tom
        Hide
        Michael McCandless added a comment -

        Tom, I agree, we should make it optional to compute the totalTF, and probably default it to off? Can you tweak the latest patch to do this?

        Show
        Michael McCandless added a comment - Tom, I agree, we should make it optional to compute the totalTF, and probably default it to off? Can you tweak the latest patch to do this?
        Hide
        Tom Burton-West added a comment -

        I tweaked the latest patch to mimic the current HighFreqTerms unless you give it a -t argument. However, while testing the argument parsing I found a bug I suspect I inserted into the patch a few versions ago. Am in the process of writing a unit test to exercise the bug and then will fix bug and post both tests and code.

        Show
        Tom Burton-West added a comment - I tweaked the latest patch to mimic the current HighFreqTerms unless you give it a -t argument. However, while testing the argument parsing I found a bug I suspect I inserted into the patch a few versions ago. Am in the process of writing a unit test to exercise the bug and then will fix bug and post both tests and code.
        Hide
        Tom Burton-West added a comment -

        Rewrote argument processing so the default behavior is that of HighFreqTerms. The field and number of terms are now both optional with the default being all fields and 100 terms (same default as currrent HighFreqTerms). If a -t flag is used the totalTermFreq stats will be read,calculated, and displayed.

        The bug surfaced when not specifying a field. Added test data with multiple fields and tests to check that correct results are returned with and without a field being specified. Fixed bug and new tests pass.

        With the increasing number of options, I started thinking about more robust command line argument processing. I'm used to languages where there is a commonly used Getopt(s) library. There appear to be several for Java with different features, different levels of active development and different licenses. Is it worth the overhead of using one, and if so which one would be the best to use?

        Tom

        Show
        Tom Burton-West added a comment - Rewrote argument processing so the default behavior is that of HighFreqTerms. The field and number of terms are now both optional with the default being all fields and 100 terms (same default as currrent HighFreqTerms). If a -t flag is used the totalTermFreq stats will be read,calculated, and displayed. The bug surfaced when not specifying a field. Added test data with multiple fields and tests to check that correct results are returned with and without a field being specified. Fixed bug and new tests pass. With the increasing number of options, I started thinking about more robust command line argument processing. I'm used to languages where there is a commonly used Getopt(s) library. There appear to be several for Java with different features, different levels of active development and different licenses. Is it worth the overhead of using one, and if so which one would be the best to use? Tom
        Hide
        Michael McCandless added a comment -

        Patch looks good Tom! I'll re-merge my small changes from the prior patch, add a CHANGES, and commit.

        I don't think we need to upgrade to CL processing lib...

        Show
        Michael McCandless added a comment - Patch looks good Tom! I'll re-merge my small changes from the prior patch, add a CHANGES, and commit. I don't think we need to upgrade to CL processing lib...
        Hide
        Michael McCandless added a comment -

        Thanks Tom!

        Show
        Michael McCandless added a comment - Thanks Tom!
        Hide
        Tom Burton-West added a comment -

        Since many people will want to use branch 3.x instead of trunk, I back-ported the flex version to 3x ( patched against http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/lucene : 955141)
        Mike, can this be committed to branch_3x?

        Tom

        Show
        Tom Burton-West added a comment - Since many people will want to use branch 3.x instead of trunk, I back-ported the flex version to 3x ( patched against http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/lucene : 955141) Mike, can this be committed to branch_3x? Tom
        Hide
        Michael McCandless added a comment -

        Thanks Tom!

        Reopening for backport to 3x....

        Show
        Michael McCandless added a comment - Thanks Tom! Reopening for backport to 3x....
        Hide
        Michael McCandless added a comment -

        New patch, just cleans up a few minor things...

        Show
        Michael McCandless added a comment - New patch, just cleans up a few minor things...
        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Tom Burton-West
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development