Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/store
    • Labels:
      None

      Description

      One of the big issues with term vector usage is that the information is loaded into parallel arrays as it is loaded, which are then often times manipulated again to use in the application (for instance, they are sorted by frequency).

      Adding a callback mechanism that allows the vector loading to be handled by the application would make this a lot more efficient.

      I propose to add to IndexReader:
      abstract public void getTermFreqVector(int docNumber, String field, TermVectorMapper mapper) throws IOException;
      and a similar one for the all fields version

      Where TermVectorMapper is an interface with a single method:
      void map(String term, int frequency, int offset, int position);

      The TermVectorReader will be modified to just call the TermVectorMapper. The existing getTermFreqVectors will be reimplemented to use an implementation of TermVectorMapper that creates the parallel arrays. Additionally, some simple implementations that automatically sort vectors will also be created.

      This is my first draft of this API and is subject to change. I hope to have a patch soon.

      See http://www.gossamer-threads.com/lists/lucene/java-user/48003?search_string=get%20the%20total%20term%20frequency;#48003 for related information.

      1. LUCENE-868-v2.patch
        56 kB
        Grant Ingersoll
      2. LUCENE-868-v3.patch
        63 kB
        Grant Ingersoll
      3. LUCENE-868-v4.patch
        67 kB
        Grant Ingersoll

        Activity

        Hide
        Grant Ingersoll added a comment -

        TermVectorMapper should have been:
        void map(String field, String term, int frequency, TermVectorOffsetInfo [] offsets, int [] positions);

        Show
        Grant Ingersoll added a comment - TermVectorMapper should have been: void map(String field, String term, int frequency, TermVectorOffsetInfo [] offsets, int [] positions);
        Hide
        Grant Ingersoll added a comment -

        First pass at a patch on providing a callback mechanism for term vectors. The big benefit to this is the ability to control how to store the TV info, instead of being stuck with parallel arrays. Two sample implementations are provided, as FieldSortedTermVectorMapper and SortedTermVectorMapper.

        One thing I am not completely sure on and would appreciate a review of, is the interface definition of TermVectorMapper, in particular the interplay of the setExpectations method with the actual map method (see the TermVectorsReader). It is not thread-safe (although I am not so sure the current way is either)

        The existing access methods for TVs still works just the same, even though the underlying implementation is different.

        I don't know if this is any faster in the Lucene core, but it should speed up those applications that have to reformat the TV info for their needs.

        Show
        Grant Ingersoll added a comment - First pass at a patch on providing a callback mechanism for term vectors. The big benefit to this is the ability to control how to store the TV info, instead of being stuck with parallel arrays. Two sample implementations are provided, as FieldSortedTermVectorMapper and SortedTermVectorMapper. One thing I am not completely sure on and would appreciate a review of, is the interface definition of TermVectorMapper, in particular the interplay of the setExpectations method with the actual map method (see the TermVectorsReader). It is not thread-safe (although I am not so sure the current way is either) The existing access methods for TVs still works just the same, even though the underlying implementation is different. I don't know if this is any faster in the Lucene core, but it should speed up those applications that have to reformat the TV info for their needs.
        Hide
        Grant Ingersoll added a comment -

        Anyone have any comments on this approach for Term Vectors?

        I'm not sure if the patch still applies to trunk, but I will update it and commit on Wednesday or Thursday unless I hear other comments.

        Show
        Grant Ingersoll added a comment - Anyone have any comments on this approach for Term Vectors? I'm not sure if the patch still applies to trunk, but I will update it and commit on Wednesday or Thursday unless I hear other comments.
        Hide
        Yonik Seeley added a comment -

        I haven't really used the term vector APIs, but I like the goal of allowing the app to handle things.
        What about dropping down a level lower, and not constructing the arrays or TermVectorOffsetInfo either?
        Perhaps something like:

        public interface TermVectorMapper

        { void setExpectations(String field, int numTerms, boolean hasOffsets, boolean hasPositions); void mapTerm(String term, int frequency) void mapTermPos(int startOffset, int endOffset, int position) }

        One could have an implementation of TermVectorMapper that collected the offsets and positions into an array as your patch does now. I'm not sure if there would be a noticable performance impact to a method call per term instance or not.

        Oh, wait... I just went and looked at the readTermVector() code, and positions and offsets aren't stored interleaved, so one would have to do a sequence of mapTermPos() followed by a sequence of mapTerm Offset(), which makes less sense than what you have now.

        Might also consider using an abstract class instead of an interface in case you want to make backward-compatible tweaks later.

        Show
        Yonik Seeley added a comment - I haven't really used the term vector APIs, but I like the goal of allowing the app to handle things. What about dropping down a level lower, and not constructing the arrays or TermVectorOffsetInfo either? Perhaps something like: public interface TermVectorMapper { void setExpectations(String field, int numTerms, boolean hasOffsets, boolean hasPositions); void mapTerm(String term, int frequency) void mapTermPos(int startOffset, int endOffset, int position) } One could have an implementation of TermVectorMapper that collected the offsets and positions into an array as your patch does now. I'm not sure if there would be a noticable performance impact to a method call per term instance or not. Oh, wait... I just went and looked at the readTermVector() code, and positions and offsets aren't stored interleaved, so one would have to do a sequence of mapTermPos() followed by a sequence of mapTerm Offset(), which makes less sense than what you have now. Might also consider using an abstract class instead of an interface in case you want to make backward-compatible tweaks later.
        Hide
        Karl Wettin added a comment -

        Grant Ingersoll - [09/Jul/07 02:05 PM ]
        > Anyone have any comments on this approach for Term Vectors?
        >
        > I'm not sure if the patch still applies to trunk, but I will update it
        > and commit on Wednesday or Thursday unless I hear other comments.

        I can give the code an overview in the weekend if you want. I'll defintely be using this stuff when I get back from vacation.

        Show
        Karl Wettin added a comment - Grant Ingersoll - [09/Jul/07 02:05 PM ] > Anyone have any comments on this approach for Term Vectors? > > I'm not sure if the patch still applies to trunk, but I will update it > and commit on Wednesday or Thursday unless I hear other comments. I can give the code an overview in the weekend if you want. I'll defintely be using this stuff when I get back from vacation.
        Hide
        Grant Ingersoll added a comment -

        I will have to add a new patch, it MemoryIndexReader does not implement the new methods

        Show
        Grant Ingersoll added a comment - I will have to add a new patch, it MemoryIndexReader does not implement the new methods
        Hide
        Grant Ingersoll added a comment -

        New patch that passes all tests (and compiles against the memory contrib)

        Show
        Grant Ingersoll added a comment - New patch that passes all tests (and compiles against the memory contrib)
        Hide
        Grant Ingersoll added a comment -

        I also switched TermVectorMapper to be an abstract class per Yonik's suggestion.

        Show
        Grant Ingersoll added a comment - I also switched TermVectorMapper to be an abstract class per Yonik's suggestion.
        Hide
        Grant Ingersoll added a comment -

        Added the start of a Position based Mapper. This would allow indexing directly (almost) into the vector by position. Still needs a little more testing, but wanted to put it out there for others to see.

        Show
        Grant Ingersoll added a comment - Added the start of a Position based Mapper. This would allow indexing directly (almost) into the vector by position. Still needs a little more testing, but wanted to put it out there for others to see.
        Hide
        Karl Wettin added a comment -

        Sorry for the delay, vacation time.

        In short I think this is a really nice improvment to the API. I also agree with Yonik about the array[]s constructed and passed down to the mapper. Perhaps your current implementation could be moved one layer further up? Another thought is to reuse array(s) and pass on the data length, but that might just complicate things.

        I'll try to introduce these things next week and see how well it works.

        I use the term vectors for text classification. For each new classifier introduced (occurs quite a lot) I iterate the corpus and classify the documents. Potentially it could save me quite a bit of ticks and bits to not create all them array[]s, however my gut tells me there might be some JVM settings that does the same trick. I'll have to look in to that.

        Show
        Karl Wettin added a comment - Sorry for the delay, vacation time. In short I think this is a really nice improvment to the API. I also agree with Yonik about the array[]s constructed and passed down to the mapper. Perhaps your current implementation could be moved one layer further up? Another thought is to reuse array(s) and pass on the data length, but that might just complicate things. I'll try to introduce these things next week and see how well it works. I use the term vectors for text classification. For each new classifier introduced (occurs quite a lot) I iterate the corpus and classify the documents. Potentially it could save me quite a bit of ticks and bits to not create all them array[]s, however my gut tells me there might be some JVM settings that does the same trick. I'll have to look in to that.
        Hide
        Grant Ingersoll added a comment -

        The TermVectorOffsetInfo and Position arrays are only created if storeOffsets and storePositions are turned on. But, we could also add mapperMethods like:
        boolean isIgnoringOffsets()
        and
        boolean isIgnoringPositions()

        Then, in TermVectorsReader, it could become:

        if (storePositions && mapper.isIgnoringPositions() == false)

        and likewise for isIgnoringOffsets. This way a mapper could express whether it wants these arrays to be constructed even if they are turned on. Then we just need to skip ahead by the appropriate amount.

        Show
        Grant Ingersoll added a comment - The TermVectorOffsetInfo and Position arrays are only created if storeOffsets and storePositions are turned on. But, we could also add mapperMethods like: boolean isIgnoringOffsets() and boolean isIgnoringPositions() Then, in TermVectorsReader, it could become: if (storePositions && mapper.isIgnoringPositions() == false) and likewise for isIgnoringOffsets. This way a mapper could express whether it wants these arrays to be constructed even if they are turned on. Then we just need to skip ahead by the appropriate amount.
        Hide
        Grant Ingersoll added a comment -

        Based on Yonik's and Karl's comments on avoiding loading the offset and position arrays, this patch has two new methods on the TermVectorMapper which tell the TermVectorsReader whether the Mapper is interested in positions or not, regardless of whether they are stored or not.

        Show
        Grant Ingersoll added a comment - Based on Yonik's and Karl's comments on avoiding loading the offset and position arrays, this patch has two new methods on the TermVectorMapper which tell the TermVectorsReader whether the Mapper is interested in positions or not, regardless of whether they are stored or not.
        Hide
        Grant Ingersoll added a comment -

        Committed. Did not include the PositionBasedTermVectorMapper stuff as it was not ready to go.

        Show
        Grant Ingersoll added a comment - Committed. Did not include the PositionBasedTermVectorMapper stuff as it was not ready to go.
        Hide
        Paul Elschot added a comment -

        I just got this warning from ant javadocs-internal:
        src/java/org/apache/lucene/index/TermVectorsReader.java:448: @return tag has no arguments.
        This could well be from the materializeVector method of the v4 patch.

        Show
        Paul Elschot added a comment - I just got this warning from ant javadocs-internal: src/java/org/apache/lucene/index/TermVectorsReader.java:448: @return tag has no arguments. This could well be from the materializeVector method of the v4 patch.
        Hide
        Grant Ingersoll added a comment -

        Thanks, Paul, I have added a comment.

        Show
        Grant Ingersoll added a comment - Thanks, Paul, I have added a comment.
        Hide
        Paul Elschot added a comment -

        [ https://issues.apache.org/jira/browse/LUCENE-868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515004 ]

        My pleasure,

        Paul Elschot

        LUCENE-868-v4.patch
        loaded into parallel arrays as it is loaded, which are then often times
        manipulated again to use in the application (for instance, they are sorted by
        frequency).
        by the application would make this a lot more efficient.
        TermVectorMapper mapper) throws IOException;
        The existing getTermFreqVectors will be reimplemented to use an
        implementation of TermVectorMapper that creates the parallel arrays.
        Additionally, some simple implementations that automatically sort vectors
        will also be created.
        have a patch soon.
        http://www.gossamer-threads.com/lists/lucene/java-user/48003?search_string=get%20the%20total%20term%20frequency;#48003
        for related information.

        Show
        Paul Elschot added a comment - [ https://issues.apache.org/jira/browse/LUCENE-868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515004 ] My pleasure, Paul Elschot LUCENE-868 -v4.patch loaded into parallel arrays as it is loaded, which are then often times manipulated again to use in the application (for instance, they are sorted by frequency). by the application would make this a lot more efficient. TermVectorMapper mapper) throws IOException; The existing getTermFreqVectors will be reimplemented to use an implementation of TermVectorMapper that creates the parallel arrays. Additionally, some simple implementations that automatically sort vectors will also be created. have a patch soon. http://www.gossamer-threads.com/lists/lucene/java-user/48003?search_string=get%20the%20total%20term%20frequency;#48003 for related information.

          People

          • Assignee:
            Grant Ingersoll
            Reporter:
            Grant Ingersoll
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development