Lucene - Core
  1. Lucene - Core
  2. LUCENE-474

High Frequency Terms/Phrases at the Index level

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: 1.4
    • Fix Version/s: None
    • Component/s: modules/other
    • Labels:
      None

      Description

      We should be able to find the all the high frequency terms/phrases ( where frequency is the search criteria / benchmark)

      1. colloc.zip
        5 kB
        Mark Harwood
      2. collocations.zip
        39 kB
        Ivan Provalov

        Activity

        Hide
        Erick Erickson added a comment -

        It's been about 2-1/2 years since anyone touched this, and I suspect that much of the underlying terms data is now available so I'll close this. We can re-open if there's interest. SPRING_CLEANING_2013

        Show
        Erick Erickson added a comment - It's been about 2-1/2 years since anyone touched this, and I suspect that much of the underlying terms data is now available so I'll close this. We can re-open if there's interest. SPRING_CLEANING_2013
        Hide
        Ivan Provalov added a comment -

        Included the scoring in the CollocationsSearcher which now will return the LinkedHashMap of Collocated Terms and their scores relative to the query term. Did some minor refactoring and changed the test.

        Show
        Ivan Provalov added a comment - Included the scoring in the CollocationsSearcher which now will return the LinkedHashMap of Collocated Terms and their scores relative to the query term. Did some minor refactoring and changed the test.
        Hide
        Ivan Provalov added a comment -

        I saw some activity on the term collocations in the lucene user forum recently and decided to make a few changes to the colloc.zip package which Mark worked on. I used it before and it worked well for my project.

        I ended up doing some fixes and refactoring and adding couple of unit tests, as well as a new class which will search the collocated terms if provided with a term. This version works with Lucene 3.0.2. Also, I changed package names, added the license verbage, as well as added maven and ant for contrib packaging.

        If Mark is OK with these changes, it could be published as a contrib.

        Show
        Ivan Provalov added a comment - I saw some activity on the term collocations in the lucene user forum recently and decided to make a few changes to the colloc.zip package which Mark worked on. I used it before and it worked well for my project. I ended up doing some fixes and refactoring and adding couple of unit tests, as well as a new class which will search the collocated terms if provided with a term. This version works with Lucene 3.0.2. Also, I changed package names, added the license verbage, as well as added maven and ant for contrib packaging. If Mark is OK with these changes, it could be published as a contrib.
        Hide
        Otis Gospodnetic added a comment -

        Mark,
        Can we:

        • change package names to o.a.l.index.collocations
        • slap ASL to all source code
        • reformat to fit Lucene
        • add contrib-style build.xml
        • svn add + svn diff to get a patch
        • pray for unit tests but commit to contrib/collcations even if you don't have time for them
        • anything else?
        Show
        Otis Gospodnetic added a comment - Mark, Can we: change package names to o.a.l.index.collocations slap ASL to all source code reformat to fit Lucene add contrib-style build.xml svn add + svn diff to get a patch pray for unit tests but commit to contrib/collcations even if you don't have time for them anything else?
        Hide
        Grant Ingersoll added a comment -

        Hi Mark,

        I looked at this zip, and it seems useful, but are you intending to donate it? If so, can we get a patch?

        Show
        Grant Ingersoll added a comment - Hi Mark, I looked at this zip, and it seems useful, but are you intending to donate it? If so, can we get a patch?
        Hide
        Mark Harwood added a comment -

        It looks like you will need a later version. Try check out the latest code from Subversion

        Mark

        Show
        Mark Harwood added a comment - It looks like you will need a later version. Try check out the latest code from Subversion Mark
        Hide
        Suri Babu B added a comment -

        Hi Mark,

        I have tried executing your classes but I failed to see the output
        coz it gave me class cast exception at the following line

        //get TermPositions for matching doc
        TermPositionVector tpv = (TermPositionVector) reader.getTermFreqVector(docId, fieldName);

        and while indexing , I have added the contents field like below

        Field.Text("contents", fileInfo.getReader(),true); // isStoreTermVector to true

        and also found some mismatches in the Field class that I have and Field class that you are referring in the CollocationIndexer class

        I am using lucene 1.4.3 version and also observed 1.4.3 doesnot have implementation for TermPositionVector

        Pls let me know if I am using old ver or i have to update some patches in my env

        Thanks
        Suri

        Show
        Suri Babu B added a comment - Hi Mark, I have tried executing your classes but I failed to see the output coz it gave me class cast exception at the following line //get TermPositions for matching doc TermPositionVector tpv = (TermPositionVector) reader.getTermFreqVector(docId, fieldName); and while indexing , I have added the contents field like below Field.Text("contents", fileInfo.getReader(),true); // isStoreTermVector to true and also found some mismatches in the Field class that I have and Field class that you are referring in the CollocationIndexer class I am using lucene 1.4.3 version and also observed 1.4.3 doesnot have implementation for TermPositionVector Pls let me know if I am using old ver or i have to update some patches in my env Thanks Suri
        Hide
        Mark Harwood added a comment -

        Here's some code that I've used before to find phrases in an index - see CollocationFinder.java.
        If your index has termvector support enabled you can run it to mine the collocated terms. This is typically a long operation that you dont want to do too often.
        The CollocationIndexer can be used to store the mined collocations in an index.

        Possible uses for collocations are:

        • automatically identifying candidate terms in a query that can be turned into a phrase query
        • better spelling correction by using all terms in query as context to pick the most likely spelling variation

        Haven't done too much with this code but I've added it here because it sounds like it could be relevant

        Cheers
        Mark

        Show
        Mark Harwood added a comment - Here's some code that I've used before to find phrases in an index - see CollocationFinder.java. If your index has termvector support enabled you can run it to mine the collocated terms. This is typically a long operation that you dont want to do too often. The CollocationIndexer can be used to store the mined collocations in an index. Possible uses for collocations are: automatically identifying candidate terms in a query that can be turned into a phrase query better spelling correction by using all terms in query as context to pick the most likely spelling variation Haven't done too much with this code but I've added it here because it sounds like it could be relevant Cheers Mark
        Hide
        Otis Gospodnetic added a comment -

        Using JIRA for discussion? Why, when you can use java-user@lucene mailing list for that?
        You can figure out common/frequent phrases using the existing Lucene API by keeping track of terms and their positions. The naive way may be slow and memory intensive.

        Show
        Otis Gospodnetic added a comment - Using JIRA for discussion? Why, when you can use java-user@lucene mailing list for that? You can figure out common/frequent phrases using the existing Lucene API by keeping track of terms and their positions. The naive way may be slow and memory intensive.
        Hide
        Suri Babu B added a comment -

        High Frequency phrases are like high frequency terms but they will have multiple terms repeated in the index

        Lets say
        the X document has the phrase "Session Bean" 12 times
        the Y document has the phrase "Session Bean" 2 times
        the Y document has the phrase Bean 3 times
        the Z document has the phrase "Bean" 5 times

        so I should get a output like below

        Phrase/Term Frequency
        ------------------ ---------------
        Session Bean 14
        Bean 8

        Show
        Suri Babu B added a comment - High Frequency phrases are like high frequency terms but they will have multiple terms repeated in the index Lets say the X document has the phrase "Session Bean" 12 times the Y document has the phrase "Session Bean" 2 times the Y document has the phrase Bean 3 times the Z document has the phrase "Bean" 5 times so I should get a output like below Phrase/Term Frequency ------------------ --------------- Session Bean 14 Bean 8
        Hide
        Pasha Bizhan added a comment -

        I understand what is high freq terms. But what is high freq phrases?
        Could you please explain your index structure?

        Show
        Pasha Bizhan added a comment - I understand what is high freq terms. But what is high freq phrases? Could you please explain your index structure?
        Hide
        Suri Babu B added a comment -

        HighFreqTerms.java available in misc package is about terms that have high document frequency.
        Actually whats my requirement is

        I have set of documents which are indexed
        I need to find out the high frequency terms as well phrases at the index level, not document level

        I am able to find out the high frequency terms by iterating through the termDocs.

        But how to find out the high frequency phrased that are occurring in the index

        Show
        Suri Babu B added a comment - HighFreqTerms.java available in misc package is about terms that have high document frequency. Actually whats my requirement is I have set of documents which are indexed I need to find out the high frequency terms as well phrases at the index level, not document level I am able to find out the high frequency terms by iterating through the termDocs. But how to find out the high frequency phrased that are occurring in the index
        Show
        Pasha Bizhan added a comment - Look for the HighFreqTerms package in contib area: http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/contrib/miscellaneous/src/java/org/apache/lucene/misc/HighFreqTerms.java?rev=164963&view=log

          People

          • Assignee:
            Otis Gospodnetic
            Reporter:
            Suri Babu B
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development