|
[
Permlink
| « Hide
]
Karl Wettin added a comment - 02/Oct/07 09:42 PM
Oups, prior patch contained some other stuff too by misstake.
TanimotoDocumentSimilarity, depends on TermVectorAccessor, used to calculate the distance between the vector space of two documents.
My math skills are pretty lame, but I think I got it right. I think a kD-tree will be the next step here. Does that fit in this project, or is that something I should post in UIMA or so?
Java 1.5 -> Java 1.4
Soon, very soon (in Lucene terms), we will have 1.5 Grant Ingersoll - 02/Oct/07 05:40 PM
> Java 1.5 -> Java 1.4 > > Soon, very soon (in Lucene terms), we will have 1.5 This is why I placed it in contrib/misc, I was under the impression contrib allowed 1.5? Also, don't pay too much attention at the quite ugly code in TanimotoDocumentSimilarity. I'll post something nice and refactored soon. I was just really thrilled that I managed to figure out all them greek characters in the whitepaper.
Sorry for flooding. This JIRA issue is sort of turning more off topic for each post.. I hope you don't mind.
And as the filename hints, I thought it would be fun to demonstrate the similarity by adding a very simple two dimensional decision tree clusterer. For the test I feed it with 17 news articles representing 3 news stories I got from Google news. Attached is also a graphviz diagram that shows the tree with the news stories clustered together. I did not look at how to draw the line between the clusters yet, but I could probably come up with something simple enough. Legend: floating numbers represents the distance between two children. The leafs are the actual articles, prefixed with new story identity and suffixed with news article identity. (The clusterer sure needs optimization, use carrot instead. This is just me fooling aroung.) Have fun! TermVectorMapper should probably also be able to extract the term vector from a document prior to it beeing indexed. That was the original reason for me to introduce tokenStreamValue(). However, I suppose there could probably be problems with token streams and readers beeing exhausted.
Karl Wettin - 03/Oct/07 12:52 PM
> TermVectorMapper should probably also be able.. TermVectorAccessor, that is. This patch:
This patch is TermVectorAccessor code only, nothing else. In this patch:
And then I removed everything that had nothing to do with this patch. oops, yes. My bad. I missed that part.
I would not touch this issue until
Now with support for mapper.setDocumentNumber as defined in
I think this is interesting:
http://www.nabble.com/How-to-generate-TermFreqVector-from-an-existing-index-tf4756257.html#a13601345 I'll have to look in to the file format and see if it is possible to persist a term vector retreived from the inverted index. That could be a nice addition to this issue. I'm curious if the build part of this would be faster than reanalyzing a document. Just thinking outloud, but I have wondering about a Highlighter that uses the new TermVectorMapper, but using that doesn't account for non-TermVector based Documents that need to be analyzed. Was thinking this might account for both cases, all through the TermVectorMapper mechanism. Just doesn't seem like it would be very fast.
It is a slow process on an index with many terms. Each one has to be iterated and mached against the document number.
This patch is mostly about when you don't have access to the source data. It was used together with a TermVectorMappingCachedTokenStreamFactory to extract re-indexable documents from any directory. If you think of this peice of code and highlighter together, I would consider something else, perhaps a tool that could add the term vector to all documents missing one in a single iteration sweep of the index. I know very little about the file format and the highlighter though. Looks like you have this one Karl...thanks!
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||