Lucene - Core
  1. Lucene - Core
  2. LUCENE-1910

Extension to MoreLikeThis to use tag information

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: modules/other
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      I would like to contribute a class based on the MoreLikeThis class in
      contrib/queries that generates a query based on the tags associated
      with a document. The class assumes that documents are tagged with a
      set of tags (which are stored in the index in a seperate Field). The
      class determines the top document terms associated with a given tag
      using the information gain metric.

      While generating a MoreLikeThis query for a document the tags
      associated with document are used to determine the terms in the query.
      This class is useful for finding similar documents to a document that
      does not have many relevant terms but was tagged.

      1. LUCENE-1910.patch
        40 kB
        Thomas D'Silva

        Activity

        Thomas D'Silva created issue -
        Thomas D'Silva made changes -
        Field Original Value New Value
        Attachment LUCENE-1910.patch [ 12419443 ]
        Thomas D'Silva made changes -
        Attachment LUCENE-1910.patch [ 12419444 ]
        Thomas D'Silva made changes -
        Attachment LUCENE-1910.patch [ 12419443 ]
        Thomas D'Silva made changes -
        Priority Major [ 3 ] Minor [ 4 ]
        Hide
        Mark Harwood added a comment -

        Hi Thomas,
        Following your request for feedback, some initial thoughts from a very quick look.

        • The "Information Gain" algo could use a little more explanation e.g. using variable names other than "num1" and "num2" and could perhaps be extracted into a utility class
        • Is this scalable? It looks like in initialize it is loading this:
          MoreLikeThisUsingTags.java
          	/**
                * All terms in the index
                */
          	protected HashSet docTerms=new HashSet();
          

          ..that seems a little scary!
          It's also doing a seperate BooleanQuery for all items in this list ( and repeated for >1 tag?). Thats look like a lot of searches.

        I need to spend a little more time looking at it before I understand it in more detail.
        Before then - have you tested this on a big (millions of docs/terms) index? Some performance figures would be useful to accompany this.

        Cheers,
        Mark

        Show
        Mark Harwood added a comment - Hi Thomas, Following your request for feedback, some initial thoughts from a very quick look. The "Information Gain" algo could use a little more explanation e.g. using variable names other than "num1" and "num2" and could perhaps be extracted into a utility class Is this scalable? It looks like in initialize it is loading this: MoreLikeThisUsingTags.java /** * All terms in the index */ protected HashSet docTerms= new HashSet(); ..that seems a little scary! It's also doing a seperate BooleanQuery for all items in this list ( and repeated for >1 tag?). Thats look like a lot of searches. I need to spend a little more time looking at it before I understand it in more detail. Before then - have you tested this on a big (millions of docs/terms) index? Some performance figures would be useful to accompany this. Cheers, Mark
        Hide
        Thomas D'Silva added a comment - - edited

        Mark,

        I refactored the class to include more descriptive variable names. I also modified the code so that while calculating information gain only terms belonging to documents that have been tagged with the given tag and used (and not all the terms in the index).
        I tested this class on a test index containing one million documents. The documents were tagged with five tags (tag_0...tag_4). tag_0 was assigned to approximately 10% of the documents, tag_1 to 20% and so on.

        tag name, number of documents, time in ms
        tag_0, 10134, 137314
        tag_1, 19996, 219527
        tag_2, 30010, 315336
        tag_3, 39907, 413615
        tag_4, 50148, 507350

        The time taken to generate the query for a tag depends on the number of documents in the index containing the tag and scales linearly with the number of documents.
        The top document terms for a given are cached in a hashmap once they have been generated in order to speed up subsequent lookups.

        Thanks,
        Thomas

        Show
        Thomas D'Silva added a comment - - edited Mark, I refactored the class to include more descriptive variable names. I also modified the code so that while calculating information gain only terms belonging to documents that have been tagged with the given tag and used (and not all the terms in the index). I tested this class on a test index containing one million documents. The documents were tagged with five tags (tag_0...tag_4). tag_0 was assigned to approximately 10% of the documents, tag_1 to 20% and so on. tag name, number of documents, time in ms tag_0, 10134, 137314 tag_1, 19996, 219527 tag_2, 30010, 315336 tag_3, 39907, 413615 tag_4, 50148, 507350 The time taken to generate the query for a tag depends on the number of documents in the index containing the tag and scales linearly with the number of documents. The top document terms for a given are cached in a hashmap once they have been generated in order to speed up subsequent lookups. Thanks, Thomas
        Thomas D'Silva made changes -
        Attachment LUCENE-1910.patch [ 12421221 ]
        Thomas D'Silva made changes -
        Attachment LUCENE-1910.patch [ 12419444 ]
        Hide
        Mark Harwood added a comment -

        > 2 minutes to create a query based on 10,000 documents?

        Unfortunately, I can't see this being generally useful until the performance is improved dramatically.

        Show
        Mark Harwood added a comment - > 2 minutes to create a query based on 10,000 documents? Unfortunately, I can't see this being generally useful until the performance is improved dramatically.
        Hide
        Thomas D'Silva added a comment -

        Mark,

        I refactored the code so that the tag and document probabilities are computed and used to find the most important document terms corresponding to a given tag term during the index creation phase. These most important document terms (ranked by information gain) for a given tag term is stored as meta information in the index when the index is created. I added a class TagIndexWriter which extends IndexWriter which is used to create an index which can be used to run MoreLikeThisUsingTags queries.

        I recreated a test index with one million documents, and assigned tags (tag_0,...tag_4) to 10%,20%.. and so on of the documents.

        The time taken to generate a query on an index created using TagIndexWriter:
        tag name, number of documents, time in ms
        tag_0, 10134, 22
        tag_1, 19996, 29
        tag_2, 30010, 6
        tag_3, 39907, 6
        tag_4, 50148, 9

        Since the document terms corresponding to a tag term is computed during the indexing phase, the time taken to generate a MoreLikeThisUsingTags query is constant.

        Thanks,
        Thomas

        Show
        Thomas D'Silva added a comment - Mark, I refactored the code so that the tag and document probabilities are computed and used to find the most important document terms corresponding to a given tag term during the index creation phase. These most important document terms (ranked by information gain) for a given tag term is stored as meta information in the index when the index is created. I added a class TagIndexWriter which extends IndexWriter which is used to create an index which can be used to run MoreLikeThisUsingTags queries. I recreated a test index with one million documents, and assigned tags (tag_0,...tag_4) to 10%,20%.. and so on of the documents. The time taken to generate a query on an index created using TagIndexWriter: tag name, number of documents, time in ms tag_0, 10134, 22 tag_1, 19996, 29 tag_2, 30010, 6 tag_3, 39907, 6 tag_4, 50148, 9 Since the document terms corresponding to a tag term is computed during the indexing phase, the time taken to generate a MoreLikeThisUsingTags query is constant. Thanks, Thomas
        Thomas D'Silva made changes -
        Attachment LUCENE-1910.patch [ 12421221 ]
        Thomas D'Silva made changes -
        Attachment LUCENE-1910.patch [ 12426261 ]
        Hide
        Otis Gospodnetic added a comment -
        • I'll second Mark's suggestion to extract the Information Gain piece of the patch into separate class(es), so we can reuse it in other places. It looks like it's currently an integral part of MoreLikeThisUsingTags class. Would that be possible?
        • I noticed the code needs ASL (the Apache Software License) added.
        Show
        Otis Gospodnetic added a comment - I'll second Mark's suggestion to extract the Information Gain piece of the patch into separate class(es), so we can reuse it in other places. It looks like it's currently an integral part of MoreLikeThisUsingTags class. Would that be possible? I noticed the code needs ASL (the Apache Software License) added. Also, could you please use the Lucene code format? (Eclipse/IntelliJ templates are at the bottom of http://wiki.apache.org/lucene-java/HowToContribute )
        Hide
        Thomas D'Silva added a comment -

        I extracted the Information Gain code into a separate class. I also added the Apache Software License and reformatted the code to use the Lucene code format. Are there any other changes or modifications that are needed?

        Thanks,
        Thomas

        Show
        Thomas D'Silva added a comment - I extracted the Information Gain code into a separate class. I also added the Apache Software License and reformatted the code to use the Lucene code format. Are there any other changes or modifications that are needed? Thanks, Thomas
        Thomas D'Silva made changes -
        Attachment LUCENE-1910.patch [ 12426261 ]
        Thomas D'Silva made changes -
        Attachment LUCENE-1910.patch [ 12451015 ]
        Mark Thomas made changes -
        Workflow jira [ 12476847 ] Default workflow, editable Closed status [ 12562630 ]
        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12562630 ] jira [ 12583577 ]

          People

          • Assignee:
            Unassigned
            Reporter:
            Thomas D'Silva
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development