Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: general/website
    • Labels:
      None
    • Environment:

      Operating System: other
      Platform: All

      Description

      Moved from todo.xml:
      http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-
      dev@jakarta.apache.org&msgNo=273
      http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-
      dev@jakarta.apache.org&msgNo=272

      I don't know enough about the lucene internals to know if this was implemented
      in 1.3 so I'm adding it here as an open enhancement.

      1. queryTermPositionVector.patch.tar.gz
        4 kB
        Grant Ingersoll
      2. ASF.LICENSE.NOT.GRANTED--vector.patch.gz
        34 kB
        cutting@apache.org
      3. ASF.LICENSE.NOT.GRANTED--termVectorPatch-1.3-2.zip
        71 kB
        Grant Ingersoll
      4. ASF.LICENSE.NOT.GRANTED--termVectorPatch1.3.zip
        61 kB
        Grant Ingersoll
      5. ASF.LICENSE.NOT.GRANTED--TermFreqVector.patch
        1 kB
        Bruce Ritchie
      6. ASF.LICENSE.NOT.GRANTED--src.zip
        3 kB
        Grant Ingersoll
      7. ASF.LICENSE.NOT.GRANTED--patch-TermVectorPosOffset.txt
        144 kB
        Grant Ingersoll
      8. ASF.LICENSE.NOT.GRANTED--patch-TermVectorPosOffset.txt
        109 kB
        Grant Ingersoll
      9. ASF.LICENSE.NOT.GRANTED--IndexReader.patch
        1 kB
        Bruce Ritchie

        Activity

        Hide
        cutting@apache.org added a comment -

        No, this has not yet been implemented. Dmitry posted a nearly-complete
        implementation over a year ago, but Lucene has changed a lot since then, so it
        would need some work to be updated.

        Show
        cutting@apache.org added a comment - No, this has not yet been implemented. Dmitry posted a nearly-complete implementation over a year ago, but Lucene has changed a lot since then, so it would need some work to be updated.
        Show
        Otis Gospodnetic added a comment - A few more relevant messages from Dmitry: http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=114748 http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=114861 http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=114862 http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=433778
        Hide
        Grant Ingersoll added a comment -

        Created an attachment (id=10254)
        Contains the new files, the patch and implementation notes

        Show
        Grant Ingersoll added a comment - Created an attachment (id=10254) Contains the new files, the patch and implementation notes
        Hide
        Grant Ingersoll added a comment -

        Attached is Dmitry's code updated for 1.3. Here are my notes on the
        implementation (which are also included in the attachment)

        The patch is in the zip and is named termVector1.3Patch.txt and
        was generate using cvs diff -Nu at the root of the tree.

        If there are any questions, I would be more than happy to help via the mailing
        list.

        -----------------------------------------------
        Notes on the re-implemenation of Dmitry's Term Vector enhancements for Lucene
        1.3.

        Please see http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-
        dev@jakarta.apache.org&msgId=114748
        for the original patch.

        General Notes
        -----------------------

        I used Dmitry's code as a template by getting it working against 1.2 and then
        going through by hand
        and applying it against the HEAD. Thanks to Dmitry's great notes, it was
        relatively painless. All of
        the tests against HEAD pass.

        Differences from 1.2 Version
        ----------------------------

        The most significant change I had to make is that in the
        TermFreqVector interface the getTermNumbers() method has been replaced by a
        getTerms() method which
        returns an array of Strings. These strings are the equivalent of Term.text()
        and store the unique string
        that has been indexed. While the numbering schema worked to save space it
        presented a problem in 1.3 when
        it comes to merging because the 1.3 code could support up to Long.MAX_LONG
        positions (see TermEnum and
        SegmentTermEnum) versus Integer.MAX_INTEGER in 1.2 (at least in my
        understanding). This prevented me from
        using the termMaps array technique used in 1.2 for remapping the term numbers
        from the old segment to the new
        segment. To solve this, we needed some globally unique identifier for a term.
        For this, I use the term
        text plus the field number that the terms came from (which is why there is a
        new accessor methods
        on TermFreqVector called get/setFieldNum).

        The side benefit of this is that merging is much simpler, as we can just
        iterate over the readers and
        vectors add the terms from the old TermVector to the new TermVectorWriter, we
        don't have to do any
        remapping. The down side to this is the term vector files are going to take up
        more space on the disk.

        I believe I have overcome the limitation that you can only retrieve term
        vectors on optimized indices.
        The SegmentsReader, which previously through runtime exceptions for the
        getTermVector methods now properly
        implements them.

        Compatibility
        ----------------------
        Similar to Dmitry's, I believe the index files should be backward compatible.

        Performance
        ----------------------
        Have not run thorough performance tests, but I did do the following runs, one
        with term vectors and one
        without term vectors:

        Index Size: 12598 documents with 88362 terms. The documents in question are XML
        files where all of the TEXT
        was extracted and indexed.

        Without TVs:
        Drive Space Used: 42 MB
        Time to index: 5 minutes, 30 seconds

        With TVs:
        Drive Space Used: 71.3 MB
        Time to index: 6 minutes, 2 seconds

        Your mileage may vary.

        Limitations
        ------------------------
        Not sure what they are yet. I am sure there are places that could be
        optimized. The numbering scheme
        could probably be reinstituted by using some type of Paging Array or array of
        arrays scheme that allows you
        to store really large number of values.

        FilterIndexReader throws an UnsupportedOperationException for the new Term
        Vector methods.

        I did not test with compound files. Do not know if they are compatible.

        Other limitations are probably those of omission. That is, are the new methods
        sufficient for doing what
        people need to do? I can think of a few:
        1. Since only terms and frequencies are stored, something to quickly calculate
        the actual weight of the term
        as it was scored for the query. I looked into this, but, frankly, I am fairly
        confused by the whole
        Scorer/Similarity interactions, especially when it comes to nested queries.

        2. Perhaps the Document object itself should have a method similar to those on
        IndexReader.

        New File Notes
        ----------------------------------
        src/java/org/apache/lucene/index/SegmentTermVector.java
        Implementation of TermFreqVector and TermPositionVector.

        src/java/org/apache/lucene/index/TermFreqVector.java
        Interface for describing a Document term vector. See notes above for what
        was changed from 1.2

        src/java/org/apache/lucene/index/TermPositionVector.java
        No change from 1.2 version.

        src/java/org/apache/lucene/index/TermVectorsReader.java
        Changed get methods to return TermFreqVector interface instead of explicit
        SegmentTermVector.
        Added getTermPositions method to retrieve TermPositionVector(s).
        Changed reading in slightly to match the writing of a the Term text instead
        of the term number.

        src/java/org/apache/lucene/index/TermVectorsWriter.java
        Added documentation
        Changed the writing to write the term string instead of the term number
        Would be nice if there was a way to turn on or off the writing of positional
        information.
        See the TODO comment.

        src/test/org/apache/lucene/index/DocHelper.java
        Package local Class to help setup documents for testing.

        src/test/org/apache/lucene/index/TestDocumentWriter.java
        New test class for the DocumentWriter object. Probably needs to be fleshed
        out more to fully test.

        src/test/org/apache/lucene/index/TestFieldInfos.java
        Test for the new FieldInfos return values, etc.

        src/test/org/apache/lucene/index/TestFieldsReader.java
        Basic test for FieldsReader. Needs to be expanded to fully test
        functionality.

        src/test/org/apache/lucene/index/TestSegmentMerger.java
        Setups up two segments, including term vectors then merges them and asserts
        that items were properly
        merged.

        src/test/org/apache/lucene/index/TestSegmentReader.java
        Various tests for the SegmentReader. Tests retrieving a document, deleting a
        document,
        retrieving field names and retrieving terms. Has a placeholder for
        retrieving norms,
        but I did not implement, as I didn't fully understand how norms worked.

        src/test/org/apache/lucene/index/TestSegmentsReader.java
        Setups up a SegmentsReader made up of two Segments and does various tests on
        them. Needs
        to be filled in more completely.

        src/test/org/apache/lucene/index/TestSegmentTermDocs.java
        Has positive and negative tests for the SegmentTermDocs.

        src/test/org/apache/lucene/index/TestTermVectorsReader.java
        Writes out some term vectors and then asserts that they can be read back in

        src/test/org/apache/lucene/index/TestTermVectorsWriter.java
        Writes out some term vectors and then asserts that the proper files were
        created w/ the proper
        information in them.

        src/test/org/apache/lucene/search/TestTermVectors.java
        Searches over an indexed set of documents and then retrieves the term vectors
        for the documents.
        Also sets up a small collection of documents and maps containing term and
        frequency information
        and calculates that the term vectors are properly constructed. This is a
        fairly decent example
        of end to end use of the vectors.

        Existing File Changes:
        ----------------------------------
        org/apache/lucene/analysis/PorterStemmer.java:
        Made public.
        Please, please, please apply this patch! I think several people have
        submitted this one and I vote for it
        as well! I use the implementation in other parts of my code and it is
        annoying to have to change it in
        my local copy every time there is a new release.

        org/apache/lucene/document/Document.java
        Added a getNumFields() method that will return the number of fields that a
        document has.

        org/apache/lucene/document/Field.java
        Same as 1.2 patch.

        org/apache/lucene/index/DocumentWriter.java
        Same as 1.2 patch. Updated some formatting.

        org/apache/lucene/index/FieldInfo.java
        Added constructor for indicating the term vector is stored.

        org/apache/lucene/index/FieldInfos.java
        Added support for term vector storage. Similar to 1.2 patch
        The add methods now return a Map of <field name, field number> pairs.

        org/apache/lucene/index/FieldsReader.java
        Added comment. Now constructs the Field object with the termVector
        information

        org/apache/lucene/index/FilterIndexReader.java
        Formatted code. Added in implementation of Term Vector methods, but they are
        not implemented.

        org/apache/lucene/index/IndexReader.java
        Same as 1.2 patch, plus added a getTermVectorReader method which returns the
        TermVectorReader
        for the IndexReader. Added new getIndexedFieldNames(boolean) methods which
        retrieve
        all indexed field names based on whether the field stores term vectors or not.
        Added a package local method named getFieldInfos which returns the field
        infos object
        for the reader. This is needed in merging.
        Formatted code.

        org/apache/lucene/index/SegmentMerger.java
        Added comments and a mergeVectors() method that merges the terms in from the
        various
        readers into the new segment. Formatted code.

        org/apache/lucene/index/SegmentReader.java
        Added new TV files to the list of segments. Implemented new IndexReader
        methods for TVS.

        org/apache/lucene/index/SegmentTermDocs.java
        Formatted. Added in the isValid() method, but is commented out, as I am not
        sure it is needed.
        It was in 1.2 version.

        org/apache/lucene/index/SegmentTermEnum.java
        Same as 1.2 patch. Formatted.

        org/apache/lucene/index/SegmentTermPositions.java
        Same as 1.2 patch.

        org/apache/lucene/index/SegmentsReader.java
        Added a fieldInfos variable that is the summation of all of the fieldInfos
        from the other segments.
        This is used to implement the getFieldInfos() method, but is probably not all
        that useful.
        Implements the new term vector methods.

        org/apache/lucene/index/TermDocs.java
        Added isValid method per 1.2, but it is commented out as I am not sure we
        need it. Formatted code.

        org/apache/lucene/index/TermEnum.java
        Same as 1.2 patch.

        org/apache/lucene/index/TermInfosWriter.java
        Same as 1.2 patch.

        org/apache/lucene/search/FilteredTermEnum.java
        Implements size() method, but throws UnsupportedOperationException.

        org/apache/lucene/search/FuzzyTermEnum.java
        Implements termNumber() and isValid() but both throw
        UnsupportedOperationException.

        org/apache/lucene/search/MultiSearcher.java
        Implements new count() methods as per 1.2 patch.

        org/apache/lucene/search/RemoteSearchable.java
        Same as MultiSearcher.

        org/apache/lucene/search/Searchable.java
        Added count() methods onto the interface.

        org/apache/lucene/search/Searcher.java
        Added count() methods support.

        org/apache/lucene/search/WildcardTermEnum.java
        Implements termNumber() and isValid() but both throw
        UnsupportedOperationException.

        org/apache/lucene/index/TestFilterIndexReader.java
        Implements the necessary TV methods

        org/apache/lucene/search/TestBasics.java
        Tests the count methods for the searcher.

        Show
        Grant Ingersoll added a comment - Attached is Dmitry's code updated for 1.3. Here are my notes on the implementation (which are also included in the attachment) The patch is in the zip and is named termVector1.3Patch.txt and was generate using cvs diff -Nu at the root of the tree. If there are any questions, I would be more than happy to help via the mailing list. ----------------------------------------------- Notes on the re-implemenation of Dmitry's Term Vector enhancements for Lucene 1.3. Please see http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene- dev@jakarta.apache.org&msgId=114748 for the original patch. General Notes ----------------------- I used Dmitry's code as a template by getting it working against 1.2 and then going through by hand and applying it against the HEAD. Thanks to Dmitry's great notes, it was relatively painless. All of the tests against HEAD pass. Differences from 1.2 Version ---------------------------- The most significant change I had to make is that in the TermFreqVector interface the getTermNumbers() method has been replaced by a getTerms() method which returns an array of Strings. These strings are the equivalent of Term.text() and store the unique string that has been indexed. While the numbering schema worked to save space it presented a problem in 1.3 when it comes to merging because the 1.3 code could support up to Long.MAX_LONG positions (see TermEnum and SegmentTermEnum) versus Integer.MAX_INTEGER in 1.2 (at least in my understanding). This prevented me from using the termMaps array technique used in 1.2 for remapping the term numbers from the old segment to the new segment. To solve this, we needed some globally unique identifier for a term. For this, I use the term text plus the field number that the terms came from (which is why there is a new accessor methods on TermFreqVector called get/setFieldNum). The side benefit of this is that merging is much simpler, as we can just iterate over the readers and vectors add the terms from the old TermVector to the new TermVectorWriter, we don't have to do any remapping. The down side to this is the term vector files are going to take up more space on the disk. I believe I have overcome the limitation that you can only retrieve term vectors on optimized indices. The SegmentsReader, which previously through runtime exceptions for the getTermVector methods now properly implements them. Compatibility ---------------------- Similar to Dmitry's, I believe the index files should be backward compatible. Performance ---------------------- Have not run thorough performance tests, but I did do the following runs, one with term vectors and one without term vectors: Index Size: 12598 documents with 88362 terms. The documents in question are XML files where all of the TEXT was extracted and indexed. Without TVs: Drive Space Used: 42 MB Time to index: 5 minutes, 30 seconds With TVs: Drive Space Used: 71.3 MB Time to index: 6 minutes, 2 seconds Your mileage may vary. Limitations ------------------------ Not sure what they are yet. I am sure there are places that could be optimized. The numbering scheme could probably be reinstituted by using some type of Paging Array or array of arrays scheme that allows you to store really large number of values. FilterIndexReader throws an UnsupportedOperationException for the new Term Vector methods. I did not test with compound files. Do not know if they are compatible. Other limitations are probably those of omission. That is, are the new methods sufficient for doing what people need to do? I can think of a few: 1. Since only terms and frequencies are stored, something to quickly calculate the actual weight of the term as it was scored for the query. I looked into this, but, frankly, I am fairly confused by the whole Scorer/Similarity interactions, especially when it comes to nested queries. 2. Perhaps the Document object itself should have a method similar to those on IndexReader. New File Notes ---------------------------------- src/java/org/apache/lucene/index/SegmentTermVector.java Implementation of TermFreqVector and TermPositionVector. src/java/org/apache/lucene/index/TermFreqVector.java Interface for describing a Document term vector. See notes above for what was changed from 1.2 src/java/org/apache/lucene/index/TermPositionVector.java No change from 1.2 version. src/java/org/apache/lucene/index/TermVectorsReader.java Changed get methods to return TermFreqVector interface instead of explicit SegmentTermVector. Added getTermPositions method to retrieve TermPositionVector(s). Changed reading in slightly to match the writing of a the Term text instead of the term number. src/java/org/apache/lucene/index/TermVectorsWriter.java Added documentation Changed the writing to write the term string instead of the term number Would be nice if there was a way to turn on or off the writing of positional information. See the TODO comment. src/test/org/apache/lucene/index/DocHelper.java Package local Class to help setup documents for testing. src/test/org/apache/lucene/index/TestDocumentWriter.java New test class for the DocumentWriter object. Probably needs to be fleshed out more to fully test. src/test/org/apache/lucene/index/TestFieldInfos.java Test for the new FieldInfos return values, etc. src/test/org/apache/lucene/index/TestFieldsReader.java Basic test for FieldsReader. Needs to be expanded to fully test functionality. src/test/org/apache/lucene/index/TestSegmentMerger.java Setups up two segments, including term vectors then merges them and asserts that items were properly merged. src/test/org/apache/lucene/index/TestSegmentReader.java Various tests for the SegmentReader. Tests retrieving a document, deleting a document, retrieving field names and retrieving terms. Has a placeholder for retrieving norms, but I did not implement, as I didn't fully understand how norms worked. src/test/org/apache/lucene/index/TestSegmentsReader.java Setups up a SegmentsReader made up of two Segments and does various tests on them. Needs to be filled in more completely. src/test/org/apache/lucene/index/TestSegmentTermDocs.java Has positive and negative tests for the SegmentTermDocs. src/test/org/apache/lucene/index/TestTermVectorsReader.java Writes out some term vectors and then asserts that they can be read back in src/test/org/apache/lucene/index/TestTermVectorsWriter.java Writes out some term vectors and then asserts that the proper files were created w/ the proper information in them. src/test/org/apache/lucene/search/TestTermVectors.java Searches over an indexed set of documents and then retrieves the term vectors for the documents. Also sets up a small collection of documents and maps containing term and frequency information and calculates that the term vectors are properly constructed. This is a fairly decent example of end to end use of the vectors. Existing File Changes: ---------------------------------- org/apache/lucene/analysis/PorterStemmer.java: Made public. Please, please, please apply this patch! I think several people have submitted this one and I vote for it as well! I use the implementation in other parts of my code and it is annoying to have to change it in my local copy every time there is a new release. org/apache/lucene/document/Document.java Added a getNumFields() method that will return the number of fields that a document has. org/apache/lucene/document/Field.java Same as 1.2 patch. org/apache/lucene/index/DocumentWriter.java Same as 1.2 patch. Updated some formatting. org/apache/lucene/index/FieldInfo.java Added constructor for indicating the term vector is stored. org/apache/lucene/index/FieldInfos.java Added support for term vector storage. Similar to 1.2 patch The add methods now return a Map of <field name, field number> pairs. org/apache/lucene/index/FieldsReader.java Added comment. Now constructs the Field object with the termVector information org/apache/lucene/index/FilterIndexReader.java Formatted code. Added in implementation of Term Vector methods, but they are not implemented. org/apache/lucene/index/IndexReader.java Same as 1.2 patch, plus added a getTermVectorReader method which returns the TermVectorReader for the IndexReader. Added new getIndexedFieldNames(boolean) methods which retrieve all indexed field names based on whether the field stores term vectors or not. Added a package local method named getFieldInfos which returns the field infos object for the reader. This is needed in merging. Formatted code. org/apache/lucene/index/SegmentMerger.java Added comments and a mergeVectors() method that merges the terms in from the various readers into the new segment. Formatted code. org/apache/lucene/index/SegmentReader.java Added new TV files to the list of segments. Implemented new IndexReader methods for TVS. org/apache/lucene/index/SegmentTermDocs.java Formatted. Added in the isValid() method, but is commented out, as I am not sure it is needed. It was in 1.2 version. org/apache/lucene/index/SegmentTermEnum.java Same as 1.2 patch. Formatted. org/apache/lucene/index/SegmentTermPositions.java Same as 1.2 patch. org/apache/lucene/index/SegmentsReader.java Added a fieldInfos variable that is the summation of all of the fieldInfos from the other segments. This is used to implement the getFieldInfos() method, but is probably not all that useful. Implements the new term vector methods. org/apache/lucene/index/TermDocs.java Added isValid method per 1.2, but it is commented out as I am not sure we need it. Formatted code. org/apache/lucene/index/TermEnum.java Same as 1.2 patch. org/apache/lucene/index/TermInfosWriter.java Same as 1.2 patch. org/apache/lucene/search/FilteredTermEnum.java Implements size() method, but throws UnsupportedOperationException. org/apache/lucene/search/FuzzyTermEnum.java Implements termNumber() and isValid() but both throw UnsupportedOperationException. org/apache/lucene/search/MultiSearcher.java Implements new count() methods as per 1.2 patch. org/apache/lucene/search/RemoteSearchable.java Same as MultiSearcher. org/apache/lucene/search/Searchable.java Added count() methods onto the interface. org/apache/lucene/search/Searcher.java Added count() methods support. org/apache/lucene/search/WildcardTermEnum.java Implements termNumber() and isValid() but both throw UnsupportedOperationException. org/apache/lucene/index/TestFilterIndexReader.java Implements the necessary TV methods org/apache/lucene/search/TestBasics.java Tests the count methods for the searcher.
        Hide
        cutting@apache.org added a comment -

        Wow!

        I think the idea of removing the Term->int mapping is probably a good one, since
        it makes vectors available for all indexes, not just optimized ones, and that's
        really a requirement. It makes things bigger and slower (e.g., a vector
        dot-product will have to do string compares) but I think that's probably worth it.

        Dmitry, others: what do you think of this approach?

        Note that, since the vectors are sorted by term text, you can write them in a
        more compact manner by sharing string prefixes. See, for example,
        SegmentTermEnum.readTerm() for an example of how this can be done.

        It would be best to include a format version number as the first four bytes of
        each file. I'm trying to add that as we introduce new files or change the
        format of existing files. This will make it much easier to compatibly evolve
        the file format.

        An description of the new file formats will also be required before we make a
        1.4 release. Can you draft something up about this?

        I haven't actually applied the patch or tried to run this yet. One thing I
        note, in glancing at the code, is that it looks like you read the positions even
        when they're not asked for. (Or did I miss something.) It would be best if
        this could be avoided as it adds file i/o and increases the in-memory size of
        vectors. Lots of vector-based computations don't care about positions.

        Thanks!

        Show
        cutting@apache.org added a comment - Wow! I think the idea of removing the Term->int mapping is probably a good one, since it makes vectors available for all indexes, not just optimized ones, and that's really a requirement. It makes things bigger and slower (e.g., a vector dot-product will have to do string compares) but I think that's probably worth it. Dmitry, others: what do you think of this approach? Note that, since the vectors are sorted by term text, you can write them in a more compact manner by sharing string prefixes. See, for example, SegmentTermEnum.readTerm() for an example of how this can be done. It would be best to include a format version number as the first four bytes of each file. I'm trying to add that as we introduce new files or change the format of existing files. This will make it much easier to compatibly evolve the file format. An description of the new file formats will also be required before we make a 1.4 release. Can you draft something up about this? I haven't actually applied the patch or tried to run this yet. One thing I note, in glancing at the code, is that it looks like you read the positions even when they're not asked for. (Or did I miss something.) It would be best if this could be avoided as it adds file i/o and increases the in-memory size of vectors. Lots of vector-based computations don't care about positions. Thanks!
        Hide
        Grant Ingersoll added a comment -

        Yeah, the term text was a trade-off, but the only other way I saw of doing it
        was some type of large list object that allowed you to address an array using
        longs (for merging).
        I thought about doing the prefix string thing like in the main index file, but
        wanted to keep it simple for the first go around.

        I can provide file formats.

        There is a TODO tag in the TermVectorWriter marking where we would need to
        handle the option of writing position information. Currently they are always
        written
        and always read in, per the original code. As above, I wanted to do the first
        pass as simple as possible.

        Show
        Grant Ingersoll added a comment - Yeah, the term text was a trade-off, but the only other way I saw of doing it was some type of large list object that allowed you to address an array using longs (for merging). I thought about doing the prefix string thing like in the main index file, but wanted to keep it simple for the first go around. I can provide file formats. There is a TODO tag in the TermVectorWriter marking where we would need to handle the option of writing position information. Currently they are always written and always read in, per the original code. As above, I wanted to do the first pass as simple as possible.
        Hide
        Grant Ingersoll added a comment -

        Below is the diff produced on the File Formats XML file located in xdocs, as
        promised. I trust it will be checked for accuracy. Let me know if there are
        any mistakes and I will fix them.

        cvs diff -Nu fileformats.xml

        Index: fileformats.xml
        ===================================================================
        RCS file: /home/cvspublic/jakarta-lucene/xdocs/fileformats.xml,v
        retrieving revision 1.6
        diff -u -r1.6 fileformats.xml
        — fileformats.xml 13 Oct 2003 13:53:08 -0000 1.6
        +++ fileformats.xml 9 Feb 2004 16:08:57 -0000
        @@ -224,7 +224,11 @@
        multiplied into the score for hits on that field.
        </p>
        </li>
        -
        + <li><p>Term Vectors. For each field in each document, the
        term vector
        + (sometimes called document vector) is stored. A term
        vector consists
        + of the term text, term frequency and term position.
        + </p>
        + </li>
        <li><p>Deleted documents.
        An optional file indicating which documents are
        deleted.
        </p>
        @@ -804,9 +808,10 @@
        </p>

        <p>

        • Currently only the low-order bit is used of FieldBits is
          used. It is
        • one for
        • indexed fields, and zero for non-indexed fields.
          + The low-order bit is one for
          + indexed fields, and zero for non-indexed fields. The
          second lowest-order
          + bit is one for fields that have term vectors stored, and
          zero for fields
          + without term vectors.
          </p>

        <p>
        @@ -1112,6 +1117,57 @@
        </li>
        </ol>

        + </subsection>
        + <subsection name="Term Vectors">
        + Term Vector support is an optional on a field by field basis.
        It consists of 4
        + files.
        + <ol>
        + <li>
        + <p>The Document Index or .tvx file.</p>
        + <p>This contains, for each document, a pointer to the
        document data in the Document
        + (.tvd) file.
        + </p>
        + <p>DocumentIndex (.tvx) -->
        <DocumentPosition><sup>NumDocs</sup></p>
        + <p>DocumentPosition --> UInt64</p>
        + <p>This is used to find the position of the Document in
        the .tvd file.</p>
        + </li>
        + <li>
        + <p>The Document or .tvd file.</p>
        + <p>This contains, for each document, the number of fields, a
        list of the fields with
        + term vector info and finally a list of pointers to the field
        information in the .tvf
        + (Term Vector Fields) file.</p>
        + <p>
        + Document (.tvd) --> <NumFields, FieldNums,
        FieldPositions,><sup>NumDocs</sup>
        + </p>
        + <p>NumFields --> VInt</p>
        + <p>FieldNums -->
        <FieldNumDelta><sup>NumFields</sup></p>
        + <p>FieldNumDelta --> VInt</p>
        + <p>FieldPositions -->
        <FieldPosition><sup>NumFields</sup></p>
        + <p>FieldPosition --> VLong</p>
        + <p>The .tvd file is used to map out the fields that have
        term vectors stored and
        + where the field information is in the .tvf file.</p>
        + </li>
        + <li>
        + <p>The Field or .tvf file.</p>
        + <p>This file contains, for each field that has a term vector
        stored, a list of
        + the terms and their frequencies.</p>
        + <p>Field (.tvf) --> <NumTerms, NumDistinct, TermFreqs,
        TermPositionPointerDelta><sup>NumFields</sup></p>
        + <p>NumTerms --> VInt</p>
        + <p>NumDistinct --> VInt – Future Use</p>
        + <p>TermFreqs --> <TermText,
        TermFreq><sup>NumTerms</sup></p>
        + <p>TermText --> String</p>
        + <p>TermFreq --> VInt</p>
        + <p>TermPositionPointerDelta --> VLong</p>
        + <p></p>
        + </li>
        + <li>
        + <p>The Positions or .tvp file.</p>
        + <p>This contains, for each term in the Field and Document,
        the positional information for
        + each term in the document. </p>
        + <p>Positions (.tvp) -->
        <PositionDelta><sup>NumPositions</sup></p>
        + <p>PositionDelta --> VInt</p>
        + </li>
        + </ol>
        </subsection>

        <subsection name="Deleted Documents">

        Show
        Grant Ingersoll added a comment - Below is the diff produced on the File Formats XML file located in xdocs, as promised. I trust it will be checked for accuracy. Let me know if there are any mistakes and I will fix them. cvs diff -Nu fileformats.xml Index: fileformats.xml =================================================================== RCS file: /home/cvspublic/jakarta-lucene/xdocs/fileformats.xml,v retrieving revision 1.6 diff -u -r1.6 fileformats.xml — fileformats.xml 13 Oct 2003 13:53:08 -0000 1.6 +++ fileformats.xml 9 Feb 2004 16:08:57 -0000 @@ -224,7 +224,11 @@ multiplied into the score for hits on that field. </p> </li> - + <li><p>Term Vectors. For each field in each document, the term vector + (sometimes called document vector) is stored. A term vector consists + of the term text, term frequency and term position. + </p> + </li> <li><p>Deleted documents. An optional file indicating which documents are deleted. </p> @@ -804,9 +808,10 @@ </p> <p> Currently only the low-order bit is used of FieldBits is used. It is one for indexed fields, and zero for non-indexed fields. + The low-order bit is one for + indexed fields, and zero for non-indexed fields. The second lowest-order + bit is one for fields that have term vectors stored, and zero for fields + without term vectors. </p> <p> @@ -1112,6 +1117,57 @@ </li> </ol> + </subsection> + <subsection name="Term Vectors"> + Term Vector support is an optional on a field by field basis. It consists of 4 + files. + <ol> + <li> + <p>The Document Index or .tvx file.</p> + <p>This contains, for each document, a pointer to the document data in the Document + (.tvd) file. + </p> + <p>DocumentIndex (.tvx) --> <DocumentPosition><sup>NumDocs</sup></p> + <p>DocumentPosition --> UInt64</p> + <p>This is used to find the position of the Document in the .tvd file.</p> + </li> + <li> + <p>The Document or .tvd file.</p> + <p>This contains, for each document, the number of fields, a list of the fields with + term vector info and finally a list of pointers to the field information in the .tvf + (Term Vector Fields) file.</p> + <p> + Document (.tvd) --> <NumFields, FieldNums, FieldPositions,><sup>NumDocs</sup> + </p> + <p>NumFields --> VInt</p> + <p>FieldNums --> <FieldNumDelta><sup>NumFields</sup></p> + <p>FieldNumDelta --> VInt</p> + <p>FieldPositions --> <FieldPosition><sup>NumFields</sup></p> + <p>FieldPosition --> VLong</p> + <p>The .tvd file is used to map out the fields that have term vectors stored and + where the field information is in the .tvf file.</p> + </li> + <li> + <p>The Field or .tvf file.</p> + <p>This file contains, for each field that has a term vector stored, a list of + the terms and their frequencies.</p> + <p>Field (.tvf) --> <NumTerms, NumDistinct, TermFreqs, TermPositionPointerDelta><sup>NumFields</sup></p> + <p>NumTerms --> VInt</p> + <p>NumDistinct --> VInt – Future Use</p> + <p>TermFreqs --> <TermText, TermFreq><sup>NumTerms</sup></p> + <p>TermText --> String</p> + <p>TermFreq --> VInt</p> + <p>TermPositionPointerDelta --> VLong</p> + <p></p> + </li> + <li> + <p>The Positions or .tvp file.</p> + <p>This contains, for each term in the Field and Document, the positional information for + each term in the document. </p> + <p>Positions (.tvp) --> <PositionDelta><sup>NumPositions</sup></p> + <p>PositionDelta --> VInt</p> + </li> + </ol> </subsection> <subsection name="Deleted Documents">
        Hide
        Grant Ingersoll added a comment -

        Created an attachment (id=10415)
        Term Vector support, part 2. See the notes.txt file in the attachment.

        Show
        Grant Ingersoll added a comment - Created an attachment (id=10415) Term Vector support, part 2. See the notes.txt file in the attachment.
        Hide
        cutting@apache.org added a comment -

        Created an attachment (id=10443)
        vector patch file

        Show
        cutting@apache.org added a comment - Created an attachment (id=10443) vector patch file
        Hide
        cutting@apache.org added a comment -

        Phew! This was a fair bit of work to apply! I ended up making lots of changes,
        mostly removing stuff that wasn't used and/or shouldn't be public.

        Can folks try applying this to the CVS head and seeing that everything still
        works? I'll check it in soon, if no one reports any problems.

        Show
        cutting@apache.org added a comment - Phew! This was a fair bit of work to apply! I ended up making lots of changes, mostly removing stuff that wasn't used and/or shouldn't be public. Can folks try applying this to the CVS head and seeing that everything still works? I'll check it in soon, if no one reports any problems.
        Hide
        cutting@apache.org added a comment -

        I committed this a few days ago.

        Show
        cutting@apache.org added a comment - I committed this a few days ago.
        Hide
        Bruce Ritchie added a comment -

        Created an attachment (id=10522)
        Diff to fix a small documentation bug in the TermFreqVector class.

        Show
        Bruce Ritchie added a comment - Created an attachment (id=10522) Diff to fix a small documentation bug in the TermFreqVector class.
        Hide
        Bruce Ritchie added a comment -

        Created an attachment (id=10523)
        Another small patch to term vector documentation.

        Show
        Bruce Ritchie added a comment - Created an attachment (id=10523) Another small patch to term vector documentation.
        Hide
        Grant Ingersoll added a comment -

        Term Vector support now has optional support for storing
        Token.getPositionIncrement() and Token.startOffset() and Token.endOffset()
        information. Control of this is done through the standard Field creation
        methods. All options are backward compatible (position and offset information
        will not be stored by default). Added many new test cases to demonstrate
        functionality. There are two new files needed: SegmentTermPositionVector and
        TermVectorOffsetInfo. All tests pass as of 8/19/04 in the AM.

        Attached should be 1 patch file plus a zip containing 2 new files.

        What is this info good for?
        1. I think the highlighter could use this info (offset) instead of reparsing
        every document at runtime
        2. Many IR algorithms need character position, etc.
        3. Others??

        Remember, the values stored are based on what values you set when running the
        Analyzer (i.e. Token.startOffset and Token.endOffset and
        Token.positionIncrement). These values are controlled by the application
        author and can vary by application.

        Show
        Grant Ingersoll added a comment - Term Vector support now has optional support for storing Token.getPositionIncrement() and Token.startOffset() and Token.endOffset() information. Control of this is done through the standard Field creation methods. All options are backward compatible (position and offset information will not be stored by default). Added many new test cases to demonstrate functionality. There are two new files needed: SegmentTermPositionVector and TermVectorOffsetInfo. All tests pass as of 8/19/04 in the AM. Attached should be 1 patch file plus a zip containing 2 new files. What is this info good for? 1. I think the highlighter could use this info (offset) instead of reparsing every document at runtime 2. Many IR algorithms need character position, etc. 3. Others?? Remember, the values stored are based on what values you set when running the Analyzer (i.e. Token.startOffset and Token.endOffset and Token.positionIncrement). These values are controlled by the application author and can vary by application.
        Hide
        Grant Ingersoll added a comment -

        Created an attachment (id=12484)
        Patch file for new TermVector options

        Show
        Grant Ingersoll added a comment - Created an attachment (id=12484) Patch file for new TermVector options
        Hide
        Grant Ingersoll added a comment -

        Created an attachment (id=12485)
        Two new files needed for new TermVector position and offset support

        Show
        Grant Ingersoll added a comment - Created an attachment (id=12485) Two new files needed for new TermVector position and offset support
        Hide
        Daniel Naber added a comment -

        Unfortunately your patch won't apply anymore, as the Field class has been
        modified. We know take enumerations instead of booleans and this patch should
        probably be adapted accordingly.

        Show
        Daniel Naber added a comment - Unfortunately your patch won't apply anymore, as the Field class has been modified. We know take enumerations instead of booleans and this patch should probably be adapted accordingly.
        Hide
        Grant Ingersoll added a comment -

        How would you recommend I fix it? Is there a preferred way of adding these?
        Seems like the Field class is going to be overwhelmed with inner classes if we
        take this approach everytime we want to add some new feature to Field.

        Show
        Grant Ingersoll added a comment - How would you recommend I fix it? Is there a preferred way of adding these? Seems like the Field class is going to be overwhelmed with inner classes if we take this approach everytime we want to add some new feature to Field.
        Hide
        Daniel Naber added a comment -

        On the mailing list it has been suggested to add enumerations to
        Field.TermVector, e.g. maybe Field.TermVector.WITH_POSITIONS. Do you see any
        problem with the use of static inner classes?

        Show
        Daniel Naber added a comment - On the mailing list it has been suggested to add enumerations to Field.TermVector, e.g. maybe Field.TermVector.WITH_POSITIONS. Do you see any problem with the use of static inner classes?
        Hide
        Grant Ingersoll added a comment -

        Created an attachment (id=12671)
        New Patch that incorporates the new Field.TermVector parameters

        Show
        Grant Ingersoll added a comment - Created an attachment (id=12671) New Patch that incorporates the new Field.TermVector parameters
        Hide
        Grant Ingersoll added a comment -

        The latest patch provides an upgrade to the previous Term Vector patch (dated
        8/19/04 12:10) that added support for storing offset and position
        information. The attachment containing the 2 new files (dated 8/19/04 12:11)
        ARE still needed for this new patch.

        Cheers,
        Grant

        Show
        Grant Ingersoll added a comment - The latest patch provides an upgrade to the previous Term Vector patch (dated 8/19/04 12:10) that added support for storing offset and position information. The attachment containing the 2 new files (dated 8/19/04 12:11) ARE still needed for this new patch. Cheers, Grant
        Hide
        Christoph Goller added a comment -

        Hi Grant,

        Thank you very much for this huge patch
        I applied it with some changes. I would highly appreciate
        if you could review these changes briefly.

        Christoph

        Show
        Christoph Goller added a comment - Hi Grant, Thank you very much for this huge patch I applied it with some changes. I would highly appreciate if you could review these changes briefly. Christoph
        Hide
        Grant Ingersoll added a comment -

        Attached is an implementation of TermPositionVector for queries. It simply extends QueryTermVector and provides position and offset information for a query.

        I implemented it for the sake of completeness (there is a similar functionality on the Document side) and b/c I need it for my ApacheCon Lucene talk in December .

        I also modified some items in QueryTermVector to make some private members protected and added some documentation.

        Contents:
        patch.txt – The svn diff patch to QueryTermVector
        newFiles.tar – The new files for QueryTermPositionVector

        Thanks,
        Grant

        Show
        Grant Ingersoll added a comment - Attached is an implementation of TermPositionVector for queries. It simply extends QueryTermVector and provides position and offset information for a query. I implemented it for the sake of completeness (there is a similar functionality on the Document side) and b/c I need it for my ApacheCon Lucene talk in December . I also modified some items in QueryTermVector to make some private members protected and added some documentation. Contents: patch.txt – The svn diff patch to QueryTermVector newFiles.tar – The new files for QueryTermPositionVector Thanks, Grant

          People

          • Assignee:
            Lucene Developers
            Reporter:
            Eric Isakson
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development