Lucene - Core
  1. Lucene - Core
  2. LUCENE-1016

TermVectorAccessor, transparent vector space access

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.2
    • Fix Version/s: 2.4
    • Component/s: core/termvectors
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      This class visits TermVectorMapper and populates it with information transparent by either passing it down to the default terms cache (documents indexed with Field.TermVector) or by resolving the inverted index.

      1. LUCENE-1016.txt
        11 kB
        Karl Wettin
      2. LUCENE-1016.txt
        10 kB
        Karl Wettin

        Activity

        Hide
        Karl Wettin added a comment -

        Oups, prior patch contained some other stuff too by misstake.

        Show
        Karl Wettin added a comment - Oups, prior patch contained some other stuff too by misstake.
        Hide
        Karl Wettin added a comment -

        TanimotoDocumentSimilarity, depends on TermVectorAccessor, used to calculate the distance between the vector space of two documents.

        My math skills are pretty lame, but I think I got it right.

        Show
        Karl Wettin added a comment - TanimotoDocumentSimilarity, depends on TermVectorAccessor, used to calculate the distance between the vector space of two documents. My math skills are pretty lame, but I think I got it right.
        Hide
        Karl Wettin added a comment -

        I think a kD-tree will be the next step here. Does that fit in this project, or is that something I should post in UIMA or so?

        Show
        Karl Wettin added a comment - I think a kD-tree will be the next step here. Does that fit in this project, or is that something I should post in UIMA or so?
        Hide
        Grant Ingersoll added a comment -

        Java 1.5 -> Java 1.4

        Soon, very soon (in Lucene terms), we will have 1.5

        Show
        Grant Ingersoll added a comment - Java 1.5 -> Java 1.4 Soon, very soon (in Lucene terms), we will have 1.5
        Hide
        Karl Wettin added a comment -

        Grant Ingersoll - 02/Oct/07 05:40 PM
        > Java 1.5 -> Java 1.4
        >
        > Soon, very soon (in Lucene terms), we will have 1.5

        This is why I placed it in contrib/misc, I was under the impression contrib allowed 1.5?

        Show
        Karl Wettin added a comment - Grant Ingersoll - 02/Oct/07 05:40 PM > Java 1.5 -> Java 1.4 > > Soon, very soon (in Lucene terms), we will have 1.5 This is why I placed it in contrib/misc, I was under the impression contrib allowed 1.5?
        Hide
        Karl Wettin added a comment -

        Also, don't pay too much attention at the quite ugly code in TanimotoDocumentSimilarity. I'll post something nice and refactored soon. I was just really thrilled that I managed to figure out all them greek characters in the whitepaper.

        Show
        Karl Wettin added a comment - Also, don't pay too much attention at the quite ugly code in TanimotoDocumentSimilarity. I'll post something nice and refactored soon. I was just really thrilled that I managed to figure out all them greek characters in the whitepaper.
        Hide
        Karl Wettin added a comment -

        Sorry for flooding. This JIRA issue is sort of turning more off topic for each post.. I hope you don't mind.

        LUCENE-1016-clusterer.txt now contains a refactor of the Tanimoto similarity, it does the same thing, but with less messy code.

        And as the filename hints, I thought it would be fun to demonstrate the similarity by adding a very simple two dimensional decision tree clusterer.

        For the test I feed it with 17 news articles representing 3 news stories I got from Google news. Attached is also a graphviz diagram that shows the tree with the news stories clustered together. I did not look at how to draw the line between the clusters yet, but I could probably come up with something simple enough. Legend: floating numbers represents the distance between two children. The leafs are the actual articles, prefixed with new story identity and suffixed with news article identity.

        (The clusterer sure needs optimization, use carrot instead. This is just me fooling aroung.)

        Have fun!

        Show
        Karl Wettin added a comment - Sorry for flooding. This JIRA issue is sort of turning more off topic for each post.. I hope you don't mind. LUCENE-1016 -clusterer.txt now contains a refactor of the Tanimoto similarity, it does the same thing, but with less messy code. And as the filename hints, I thought it would be fun to demonstrate the similarity by adding a very simple two dimensional decision tree clusterer. For the test I feed it with 17 news articles representing 3 news stories I got from Google news. Attached is also a graphviz diagram that shows the tree with the news stories clustered together. I did not look at how to draw the line between the clusters yet, but I could probably come up with something simple enough. Legend: floating numbers represents the distance between two children. The leafs are the actual articles, prefixed with new story identity and suffixed with news article identity. (The clusterer sure needs optimization, use carrot instead. This is just me fooling aroung.) Have fun!
        Hide
        Karl Wettin added a comment -

        TermVectorMapper should probably also be able to extract the term vector from a document prior to it beeing indexed. That was the original reason for me to introduce tokenStreamValue(). However, I suppose there could probably be problems with token streams and readers beeing exhausted.

        Show
        Karl Wettin added a comment - TermVectorMapper should probably also be able to extract the term vector from a document prior to it beeing indexed. That was the original reason for me to introduce tokenStreamValue(). However, I suppose there could probably be problems with token streams and readers beeing exhausted.
        Hide
        Karl Wettin added a comment -

        Karl Wettin - 03/Oct/07 12:52 PM
        > TermVectorMapper should probably also be able..

        TermVectorAccessor, that is.

        Show
        Karl Wettin added a comment - Karl Wettin - 03/Oct/07 12:52 PM > TermVectorMapper should probably also be able.. TermVectorAccessor, that is.
        Hide
        Karl Wettin added a comment -

        This patch:

        • All Java 1.4
        • Bugfix, could throw a nullexception in some cases before

        This patch is TermVectorAccessor code only, nothing else.

        Show
        Karl Wettin added a comment - This patch: All Java 1.4 Bugfix, could throw a nullexception in some cases before This patch is TermVectorAccessor code only, nothing else.
        Hide
        Karl Wettin added a comment -

        In this patch:

        • Java 1.4 for real

        And then I removed everything that had nothing to do with this patch.

        Show
        Karl Wettin added a comment - In this patch: Java 1.4 for real And then I removed everything that had nothing to do with this patch.
        Hide
        Grant Ingersoll added a comment -

        oops, yes. My bad. I missed that part.

        Show
        Grant Ingersoll added a comment - oops, yes. My bad. I missed that part.
        Hide
        Karl Wettin added a comment -

        I would not touch this issue until LUCENE-1038 has been accepted or declined.

        Show
        Karl Wettin added a comment - I would not touch this issue until LUCENE-1038 has been accepted or declined.
        Hide
        Karl Wettin added a comment -

        Now with support for mapper.setDocumentNumber as defined in LUCENE-1038

        Show
        Karl Wettin added a comment - Now with support for mapper.setDocumentNumber as defined in LUCENE-1038
        Hide
        Karl Wettin added a comment -

        I think this is interesting:

        http://www.nabble.com/How-to-generate-TermFreqVector-from-an-existing-index-tf4756257.html#a13601345

        I'll have to look in to the file format and see if it is possible to persist a term vector retreived from the inverted index. That could be a nice addition to this issue.

        Show
        Karl Wettin added a comment - I think this is interesting: http://www.nabble.com/How-to-generate-TermFreqVector-from-an-existing-index-tf4756257.html#a13601345 I'll have to look in to the file format and see if it is possible to persist a term vector retreived from the inverted index. That could be a nice addition to this issue.
        Hide
        Grant Ingersoll added a comment - - edited

        I'm curious if the build part of this would be faster than reanalyzing a document. Just thinking outloud, but I have wondering about a Highlighter that uses the new TermVectorMapper, but using that doesn't account for non-TermVector based Documents that need to be analyzed. Was thinking this might account for both cases, all through the TermVectorMapper mechanism. Just doesn't seem like it would be very fast.

        Show
        Grant Ingersoll added a comment - - edited I'm curious if the build part of this would be faster than reanalyzing a document. Just thinking outloud, but I have wondering about a Highlighter that uses the new TermVectorMapper, but using that doesn't account for non-TermVector based Documents that need to be analyzed. Was thinking this might account for both cases, all through the TermVectorMapper mechanism. Just doesn't seem like it would be very fast.
        Hide
        Karl Wettin added a comment -

        I'm curious if the build part of this would be faster than reanalyzing a document.

        It is a slow process on an index with many terms. Each one has to be iterated and mached against the document number.

        Just thinking outloud, but I have wondering about a Highlighter that uses the new TermVectorMapper, but using that doesn't account for non-TermVector based Documents that need to be analyzed. Was thinking this might account for both cases, all through the TermVectorMapper mechanism. Just doesn't seem like it would be very fast.

        This patch is mostly about when you don't have access to the source data. It was used together with a TermVectorMappingCachedTokenStreamFactory to extract re-indexable documents from any directory.

        If you think of this peice of code and highlighter together, I would consider something else, perhaps a tool that could add the term vector to all documents missing one in a single iteration sweep of the index. I know very little about the file format and the highlighter though.

        Show
        Karl Wettin added a comment - I'm curious if the build part of this would be faster than reanalyzing a document. It is a slow process on an index with many terms. Each one has to be iterated and mached against the document number. Just thinking outloud, but I have wondering about a Highlighter that uses the new TermVectorMapper, but using that doesn't account for non-TermVector based Documents that need to be analyzed. Was thinking this might account for both cases, all through the TermVectorMapper mechanism. Just doesn't seem like it would be very fast. This patch is mostly about when you don't have access to the source data. It was used together with a TermVectorMappingCachedTokenStreamFactory to extract re-indexable documents from any directory. If you think of this peice of code and highlighter together, I would consider something else, perhaps a tool that could add the term vector to all documents missing one in a single iteration sweep of the index. I know very little about the file format and the highlighter though.
        Hide
        Karl Wettin added a comment -

        Documentation

        Show
        Karl Wettin added a comment - Documentation
        Hide
        Karl Wettin added a comment -

        I'll commit this soon.

        Show
        Karl Wettin added a comment - I'll commit this soon.
        Hide
        Michael McCandless added a comment -

        Looks like you have this one Karl...thanks!

        Show
        Michael McCandless added a comment - Looks like you have this one Karl...thanks!

          People

          • Assignee:
            Karl Wettin
            Reporter:
            Karl Wettin
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development