Issue Details (XML | Word | Printable)

Key: LUCENE-1016
Type: New Feature New Feature
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Karl Wettin
Reporter: Karl Wettin
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

TermVectorAccessor, transparent vector space access

Created: 02/Oct/07 09:34 PM   Updated: 25/Aug/08 03:03 PM
Return to search
Component/s: Term Vectors
Affects Version/s: 2.2
Fix Version/s: 2.4

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works LUCENE-1016.txt 2008-08-24 12:17 PM Karl Wettin 11 kB
Text File Licensed for inclusion in ASF works LUCENE-1016.txt 2007-11-04 01:42 AM Karl Wettin 10 kB

Lucene Fields: Patch Available, New
Resolution Date: 25/Aug/08 03:03 PM


 Description  « Hide
This class visits TermVectorMapper and populates it with information transparent by either passing it down to the default terms cache (documents indexed with Field.TermVector) or by resolving the inverted index.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Karl Wettin added a comment - 02/Oct/07 09:42 PM
Oups, prior patch contained some other stuff too by misstake.

Karl Wettin added a comment - 02/Oct/07 11:29 PM
TanimotoDocumentSimilarity, depends on TermVectorAccessor, used to calculate the distance between the vector space of two documents.

My math skills are pretty lame, but I think I got it right.


Karl Wettin added a comment - 02/Oct/07 11:59 PM
I think a kD-tree will be the next step here. Does that fit in this project, or is that something I should post in UIMA or so?

Grant Ingersoll added a comment - 03/Oct/07 12:40 AM
Java 1.5 -> Java 1.4

Soon, very soon (in Lucene terms), we will have 1.5


Karl Wettin added a comment - 03/Oct/07 12:44 AM
Grant Ingersoll - 02/Oct/07 05:40 PM
> Java 1.5 -> Java 1.4
>
> Soon, very soon (in Lucene terms), we will have 1.5

This is why I placed it in contrib/misc, I was under the impression contrib allowed 1.5?


Karl Wettin added a comment - 03/Oct/07 05:53 AM
Also, don't pay too much attention at the quite ugly code in TanimotoDocumentSimilarity. I'll post something nice and refactored soon. I was just really thrilled that I managed to figure out all them greek characters in the whitepaper.

Karl Wettin added a comment - 03/Oct/07 07:49 AM
Sorry for flooding. This JIRA issue is sort of turning more off topic for each post.. I hope you don't mind.

LUCENE-1016-clusterer.txt now contains a refactor of the Tanimoto similarity, it does the same thing, but with less messy code.

And as the filename hints, I thought it would be fun to demonstrate the similarity by adding a very simple two dimensional decision tree clusterer.

For the test I feed it with 17 news articles representing 3 news stories I got from Google news. Attached is also a graphviz diagram that shows the tree with the news stories clustered together. I did not look at how to draw the line between the clusters yet, but I could probably come up with something simple enough. Legend: floating numbers represents the distance between two children. The leafs are the actual articles, prefixed with new story identity and suffixed with news article identity.

(The clusterer sure needs optimization, use carrot instead. This is just me fooling aroung.)

Have fun!


Karl Wettin added a comment - 03/Oct/07 07:52 PM
TermVectorMapper should probably also be able to extract the term vector from a document prior to it beeing indexed. That was the original reason for me to introduce tokenStreamValue(). However, I suppose there could probably be problems with token streams and readers beeing exhausted.

Karl Wettin added a comment - 03/Oct/07 07:57 PM
Karl Wettin - 03/Oct/07 12:52 PM
> TermVectorMapper should probably also be able..

TermVectorAccessor, that is.


Karl Wettin added a comment - 08/Oct/07 12:03 PM
This patch:
  • All Java 1.4
  • Bugfix, could throw a nullexception in some cases before

This patch is TermVectorAccessor code only, nothing else.


Karl Wettin added a comment - 30/Oct/07 07:28 AM
In this patch:
  • Java 1.4 for real

And then I removed everything that had nothing to do with this patch.


Grant Ingersoll added a comment - 03/Nov/07 05:42 AM
oops, yes. My bad. I missed that part.

Karl Wettin added a comment - 03/Nov/07 12:37 PM
I would not touch this issue until LUCENE-1038 has been accepted or declined.

Karl Wettin added a comment - 04/Nov/07 01:42 AM
Now with support for mapper.setDocumentNumber as defined in LUCENE-1038

Karl Wettin added a comment - 06/Nov/07 06:21 PM
I think this is interesting:

http://www.nabble.com/How-to-generate-TermFreqVector-from-an-existing-index-tf4756257.html#a13601345

I'll have to look in to the file format and see if it is possible to persist a term vector retreived from the inverted index. That could be a nice addition to this issue.


Grant Ingersoll added a comment - 14/Jan/08 10:51 PM - edited
I'm curious if the build part of this would be faster than reanalyzing a document. Just thinking outloud, but I have wondering about a Highlighter that uses the new TermVectorMapper, but using that doesn't account for non-TermVector based Documents that need to be analyzed. Was thinking this might account for both cases, all through the TermVectorMapper mechanism. Just doesn't seem like it would be very fast.

Karl Wettin added a comment - 15/Jan/08 06:28 AM

I'm curious if the build part of this would be faster than reanalyzing a document.

It is a slow process on an index with many terms. Each one has to be iterated and mached against the document number.

Just thinking outloud, but I have wondering about a Highlighter that uses the new TermVectorMapper, but using that doesn't account for non-TermVector based Documents that need to be analyzed. Was thinking this might account for both cases, all through the TermVectorMapper mechanism. Just doesn't seem like it would be very fast.

This patch is mostly about when you don't have access to the source data. It was used together with a TermVectorMappingCachedTokenStreamFactory to extract re-indexable documents from any directory.

If you think of this peice of code and highlighter together, I would consider something else, perhaps a tool that could add the term vector to all documents missing one in a single iteration sweep of the index. I know very little about the file format and the highlighter though.


Karl Wettin added a comment - 24/Aug/08 12:17 PM
Documentation

Karl Wettin added a comment - 24/Aug/08 12:17 PM
I'll commit this soon.

Michael McCandless added a comment - 25/Aug/08 09:50 AM
Looks like you have this one Karl...thanks!