Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 2.3.1
    • Fix Version/s: None
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The problem the tag index solves is slow field cache loading and range queries, and reindexing an entire document to update fields that are not tokenized.

      The tag index holds untokenized terms with a docfreq of 1 in a term dictionary like index file. The file also stores the docs per term, similar to LUCENE-1278. The index also has a transaction log and in memory index for realtime updates to the tags. The transaction log is periodically merged into the existing tag term dictionary index file.

      The TagIndexReader extends IndexReader and is unified with a regular index by ParallelReader. There is a doc id to terms skip pointer file for the IndexReader.document method. This file contains a pointer for looking up the terms for a document.

      There is a higher level class that encapsulates writing a document with tag fields to IndexWriter and TagIndexWriter. This requires a hook into IndexWriter to coordinate doc ids and flushing segments to disk.

      The writer class could be as simple as:

      public class TagIndexWriter {
        
        public void add(Term term, DocIdSetIterator iterator) {
        }
        
        public void delete(Term term, DocIdSetIterator iterator) {
        }
      }
      
      1. lucene-1292.patch
        108 kB
        Jason Rutherglen

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Closed Closed
        978d 7h 9m 1 Jason Rutherglen 24/Jan/11 21:16
        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12562262 ] jira [ 12584736 ]
        Mark Thomas made changes -
        Workflow jira [ 12431534 ] Default workflow, editable Closed status [ 12562262 ]
        Jason Rutherglen made changes -
        Status Open [ 1 ] Closed [ 6 ]
        Resolution Won't Fix [ 2 ]
        Hide
        Jason Rutherglen added a comment -

        Won't be working on these and they're old

        Show
        Jason Rutherglen added a comment - Won't be working on these and they're old
        Hide
        Jason Rutherglen added a comment -

        I decided to rework this to not use a transaction log. The reason being the transaction log should be global and be per segment. The other reason is the current architecture used the transaction log to read from. Given the small size of postings, I decided it would be best to keep the changed postings blocks in memory. This way there is no performance hit from the realtime update feature. When the RAM usage hits a specified number then a new term + postings file can be written.

        One issue is how to implement skipTo over the blocks. The postings blocks use the SkipListWriter implementation. However the question is what is the best way to skip over the blocks themselves? One approach is to use the modified binary search from InstantiatedIndex described at http://ochafik.free.fr/blog/?p=106. The other is to implement a skip list over the blocks, however I am not sure SkipListWriter will work given it is tied to using file pointers. If SkipListWriter were used, would it be best to have it skip over the blocks using only the maxDoc for each block? This is the approach the binarysearch method would use.

        At the beginning of each term's list of posting blocks, there is a set of BlockInfos describing the blocks. Some blocks are missing because they don't have postings. Blocks are predetermined to span a document number range rather than a length in bytes range. This is to not have to deal with document number overlap issues between blocks when an update occurs. Right now I am thinking the default block size should be 4000 docs. This should yield a block size in bytes of a little over 4000 at the maximum. It is unclear how the size of the block will affect the performance of the skip process.

        Show
        Jason Rutherglen added a comment - I decided to rework this to not use a transaction log. The reason being the transaction log should be global and be per segment. The other reason is the current architecture used the transaction log to read from. Given the small size of postings, I decided it would be best to keep the changed postings blocks in memory. This way there is no performance hit from the realtime update feature. When the RAM usage hits a specified number then a new term + postings file can be written. One issue is how to implement skipTo over the blocks. The postings blocks use the SkipListWriter implementation. However the question is what is the best way to skip over the blocks themselves? One approach is to use the modified binary search from InstantiatedIndex described at http://ochafik.free.fr/blog/?p=106 . The other is to implement a skip list over the blocks, however I am not sure SkipListWriter will work given it is tied to using file pointers. If SkipListWriter were used, would it be best to have it skip over the blocks using only the maxDoc for each block? This is the approach the binarysearch method would use. At the beginning of each term's list of posting blocks, there is a set of BlockInfos describing the blocks. Some blocks are missing because they don't have postings. Blocks are predetermined to span a document number range rather than a length in bytes range. This is to not have to deal with document number overlap issues between blocks when an update occurs. Right now I am thinking the default block size should be 4000 docs. This should yield a block size in bytes of a little over 4000 at the maximum. It is unclear how the size of the block will affect the performance of the skip process.
        Hide
        Christopher Morris added a comment -

        Not sure I understand the question. There is no database or in-memory data.

        There are three indexes: the static index (s), the pseudo index (p), and the tag index (t). The index p doesn't exist - it is spoofed from t. The user view is of a ParallelReader wrapping s and p.

        Example:

        Populate index s with four documents. It is a requirement that the documents have a primary key field (a field known to contain a unique integer). The PKs for these documents are 1,2,3,and 4; ascending order from the document with docID 0.

        The user tags documents with PK 1, 3, 4 with the Term("tag","foo").

        The index p looks like:

        Document 0
        ==========
        tag : foo

        Document 2
        ==========
        tag : foo

        Document 3
        ==========
        tag : foo

        The index t looks like the following (where

        {n}

        means the term is stored as occuring at term position 'n'; the term position is (ab)used here to store the primary key for use by p.termDocs()) :

        Document: 0
        ==========
        tag : foo_ADD

        {1}, foo_ADD{3}, foo_ADD{4}
        PK_tag : 1_foo,3_foo,4_foo

        The user now deletes the Term("tag", "foo") from the PK 3.

        The index p looks like:

        Document 0
        ==========
        tag : foo

        Document 3
        ==========
        tag : foo

        The index t looks like the following:

        Document: 0
        ==========
        tag : foo_ADD{1}

        , foo_ADD

        {3}, foo_ADD{4}
        PK_tag : 1_foo,3_foo,4_foo

        Document: 0
        ==========
        tag : foo_DEL{3}

        PK_tag : 3_foo

        The method p.docFreq(new Term("tag", "foo")) runs the following block of code (removed some fail-fast tests and tidy up code for clarity):

        int docFreq = 0;
        TermDocs additions = t.termDocs(new Term("tag","foo_ADD")); // A TermDocs with one doc. The doc has a freq() of 3.
        while (additions.next()) // true; once

        { docFreq += additions.freq(); // +3 }

        deletions = t.termDocs(new Term("tag","foo_DEL")); // A TermDocs with one doc. The doc has a freq() of 1.
        while(deletions.next()) // true; once

        { docFreq -= additions.freq(); // -1 }

        return docFreq; // 2

        I'd hoped to have test results for very dynamic terms, but I forgot to check that the server had any disk space available before I started the test.

        Show
        Christopher Morris added a comment - Not sure I understand the question. There is no database or in-memory data. There are three indexes: the static index (s), the pseudo index (p), and the tag index (t). The index p doesn't exist - it is spoofed from t. The user view is of a ParallelReader wrapping s and p. Example: Populate index s with four documents. It is a requirement that the documents have a primary key field (a field known to contain a unique integer). The PKs for these documents are 1,2,3,and 4; ascending order from the document with docID 0. The user tags documents with PK 1, 3, 4 with the Term("tag","foo"). The index p looks like: Document 0 ========== tag : foo Document 2 ========== tag : foo Document 3 ========== tag : foo The index t looks like the following (where {n} means the term is stored as occuring at term position 'n'; the term position is (ab)used here to store the primary key for use by p.termDocs()) : Document: 0 ========== tag : foo_ADD {1}, foo_ADD{3}, foo_ADD{4} PK_tag : 1_foo,3_foo,4_foo The user now deletes the Term("tag", "foo") from the PK 3. The index p looks like: Document 0 ========== tag : foo Document 3 ========== tag : foo The index t looks like the following: Document: 0 ========== tag : foo_ADD{1} , foo_ADD {3}, foo_ADD{4} PK_tag : 1_foo,3_foo,4_foo Document: 0 ========== tag : foo_DEL{3} PK_tag : 3_foo The method p.docFreq(new Term("tag", "foo")) runs the following block of code (removed some fail-fast tests and tidy up code for clarity): int docFreq = 0; TermDocs additions = t.termDocs(new Term("tag","foo_ADD")); // A TermDocs with one doc. The doc has a freq() of 3. while (additions.next()) // true; once { docFreq += additions.freq(); // +3 } deletions = t.termDocs(new Term("tag","foo_DEL")); // A TermDocs with one doc. The doc has a freq() of 1. while(deletions.next()) // true; once { docFreq -= additions.freq(); // -1 } return docFreq; // 2 I'd hoped to have test results for very dynamic terms, but I forgot to check that the server had any disk space available before I started the test.
        Jason Rutherglen made changes -
        Attachment lucene-1292.patch [ 12383623 ]
        Hide
        Jason Rutherglen added a comment -

        lucene-1292.patch

        Basic class structure and file formats. Stub for LRU cache. Added DocState parameter to IndexWriter.addDocument to obtain docid. Needs TagSegmentMerger class. Package should possibly be moved to org.apache.lucene.index.tag.

        Syncing with IndexWriter merging processes currently would be too complex because of the way IndexWriter handles deletes and updated documents in RAM sometimes before flushing. The system I will use Tag Index with manages the segment merging process and the deletion of documents outside of IndexWriter.

        Show
        Jason Rutherglen added a comment - lucene-1292.patch Basic class structure and file formats. Stub for LRU cache. Added DocState parameter to IndexWriter.addDocument to obtain docid. Needs TagSegmentMerger class. Package should possibly be moved to org.apache.lucene.index.tag. Syncing with IndexWriter merging processes currently would be too complex because of the way IndexWriter handles deletes and updated documents in RAM sometimes before flushing. The system I will use Tag Index with manages the segment merging process and the deletion of documents outside of IndexWriter.
        Hide
        Jason Rutherglen added a comment -

        RDMBS or in memory?

        Show
        Jason Rutherglen added a comment - RDMBS or in memory?
        Hide
        Christopher Morris added a comment -

        The dynamic index is an ordinary Lucene index, wrapped to resemble a dynamic index.

        Each modification to a dynamic term creates a document. The document has two fields: one is the dynamic term field, the other is "PK" post-pended by the dynamic term field. The "PK" field contains the primary key post-pended with the term. The dynamic term field contains the dynamic term text post-pended by either ADD or DEL with term position representing the primary key. There can be multiple additions and deletions in the same document.

        The indexReader.docFreq() for a dynamic term is the sum of the termDocs freq for dynamic term ADD minus the sum of the dynamic term DEL. terms() is the underlying terms() for all fields not starting "PK", filtered by whether the dynamic term still exists (docFreq()>0). Retreiving terms for primary key/field combination involves the TermEnum for all terms with field ("PK" + field) starting with text (primary key). Terms with an odd docFreq() still exist (been added more times than deleted). Term Docs involves using TermPositions for ADD and DEL to seek through the index toggling the primary keys as exist/not exist.

        To test performance I used the Enron corpus (~ 500,000 docs) that has a folder structure (3503 nodes, max depth ~6). Ran queries for each level in the hierachy (PrefixQuery) and saved the results as a dynamic term.

        The results for a TermQuery search for the dynamic term compared to the original query varied from identical to four times slower, in a shark's tooth pattern with a frequency of 125 querys. The shark's tooth pattern does not match folder depth (cause of shark's tooth is currently unknown).

        I am currently running a similar test for dynamic terms that have been dynamic. As above, but all nodes are set to the results for the first node, then all but the first are set to the value of the second, etc. The last node will have been modified 3503 times. Modifying this amount of data is slow.

        I should be able to release the code if you wanted a direct comparison. The external APIs are similar: startBulkLoad(), addTerm(term, primary key), deleteTerm(term,primary key), acceptBulkLoad().

        Show
        Christopher Morris added a comment - The dynamic index is an ordinary Lucene index, wrapped to resemble a dynamic index. Each modification to a dynamic term creates a document. The document has two fields: one is the dynamic term field, the other is "PK" post-pended by the dynamic term field. The "PK" field contains the primary key post-pended with the term. The dynamic term field contains the dynamic term text post-pended by either ADD or DEL with term position representing the primary key. There can be multiple additions and deletions in the same document. The indexReader.docFreq() for a dynamic term is the sum of the termDocs freq for dynamic term ADD minus the sum of the dynamic term DEL. terms() is the underlying terms() for all fields not starting "PK", filtered by whether the dynamic term still exists (docFreq()>0). Retreiving terms for primary key/field combination involves the TermEnum for all terms with field ("PK" + field) starting with text (primary key). Terms with an odd docFreq() still exist (been added more times than deleted). Term Docs involves using TermPositions for ADD and DEL to seek through the index toggling the primary keys as exist/not exist. To test performance I used the Enron corpus (~ 500,000 docs) that has a folder structure (3503 nodes, max depth ~6). Ran queries for each level in the hierachy (PrefixQuery) and saved the results as a dynamic term. The results for a TermQuery search for the dynamic term compared to the original query varied from identical to four times slower, in a shark's tooth pattern with a frequency of 125 querys. The shark's tooth pattern does not match folder depth (cause of shark's tooth is currently unknown). I am currently running a similar test for dynamic terms that have been dynamic. As above, but all nodes are set to the results for the first node, then all but the first are set to the value of the second, etc. The last node will have been modified 3503 times. Modifying this amount of data is slow. I should be able to release the code if you wanted a direct comparison. The external APIs are similar: startBulkLoad(), addTerm(term, primary key), deleteTerm(term,primary key), acceptBulkLoad().
        Hide
        Jason Rutherglen added a comment -

        What is the dynamic index structure? How is the performance?

        Show
        Jason Rutherglen added a comment - What is the dynamic index structure? How is the performance?
        Hide
        Christopher Morris added a comment -

        I've developed something similar with core Lucene (where term changes add a document with add terms and delete terms where the term positions represent the docID). The dynamic index is synchronised to the semi-static index with a primary key field (in the static index). I was going to submit back, but it's redundant.

        Our use cases are user tagging (user x has read document y) and query tagging (tag all documents that match a query). The dynamic terms appear in the TermEnum, TermDocs, and TermFreqVector.

        I think you've covered all that, but I wanted to check.

        Show
        Christopher Morris added a comment - I've developed something similar with core Lucene (where term changes add a document with add terms and delete terms where the term positions represent the docID). The dynamic index is synchronised to the semi-static index with a primary key field (in the static index). I was going to submit back, but it's redundant. Our use cases are user tagging (user x has read document y) and query tagging (tag all documents that match a query). The dynamic terms appear in the TermEnum, TermDocs, and TermFreqVector. I think you've covered all that, but I wanted to check.
        Hide
        Jason Rutherglen added a comment -

        Terms that have many docs will store the docs + skiplist in blocks. This is to avoid having to write a large kilobyte docs + skiplist for an update that only alters some of the docs. Only the blocks that will be changing will be updated. They will be appended to the transaction log and the in memory file pointers updated. When this transaction log reaches a certain percentage of the size of the existing tag.tii file the whole tag.tii file will be rewritten.

        When an iteration of TermEnum is being performed, the in memory alterations are consulted. If the a term for example no longer has any docs, the term is skipped. The TermDocs iteration performs the same by checking if it should be reading from the tag.tii or the tag.tlg file for the current block. The block skipto and iteration code is functions the same as MultiTermDocs.

        The concern is the optimal number of blocks per term and the affect on skipto performance. Because only 2 files are involved it seems that the switching between files that may be an issue with MultiTermDocs skipto over many segments should not be an issue. Seeks in the same file are faster than seeks over multiple files.

        tag.tii
        TermInfos --> <TermInfo> TermCount>
        TagTermInfo --> <Term, DocFreq, NumBlocks>
        Term --> <PrefixLength, Suffix, FieldNum, TermNumber>
        BlockInfo --> <DocsBytesLength, SkipBytesLength,StartDoc,EndDoc>
        Block --> <DocDeltas,SkipData>

        tag.tlg

        Term --> <TermString>
        BlockInfo --> <DocsBytesLength, SkipBytesLength,StartDoc,EndDoc>
        Block --> <DocDeltas,SkipData>

        Show
        Jason Rutherglen added a comment - Terms that have many docs will store the docs + skiplist in blocks. This is to avoid having to write a large kilobyte docs + skiplist for an update that only alters some of the docs. Only the blocks that will be changing will be updated. They will be appended to the transaction log and the in memory file pointers updated. When this transaction log reaches a certain percentage of the size of the existing tag.tii file the whole tag.tii file will be rewritten. When an iteration of TermEnum is being performed, the in memory alterations are consulted. If the a term for example no longer has any docs, the term is skipped. The TermDocs iteration performs the same by checking if it should be reading from the tag.tii or the tag.tlg file for the current block. The block skipto and iteration code is functions the same as MultiTermDocs. The concern is the optimal number of blocks per term and the affect on skipto performance. Because only 2 files are involved it seems that the switching between files that may be an issue with MultiTermDocs skipto over many segments should not be an issue. Seeks in the same file are faster than seeks over multiple files. tag.tii TermInfos --> <TermInfo> TermCount> TagTermInfo --> <Term, DocFreq, NumBlocks> Term --> <PrefixLength, Suffix, FieldNum, TermNumber> BlockInfo --> <DocsBytesLength, SkipBytesLength,StartDoc,EndDoc> Block --> <DocDeltas,SkipData> tag.tlg Term --> <TermString> BlockInfo --> <DocsBytesLength, SkipBytesLength,StartDoc,EndDoc> Block --> <DocDeltas,SkipData>
        Jason Rutherglen made changes -
        Field Original Value New Value
        Comment [ Tag in the sense you described. In Lucene, an untokenized field. Yes, a
        separate index that is dynamic while the main one is static except deletes.
        The main problem, though not impossible, is syncing the segments. The tag
        index needs the new segment docmap and a callback for each merge in
        SegmentMerger. The write to the tag index needs the doc id for each
        IndexWriter.addDocument call and a callback for each subsequent flush to
        disk.

        These event listeners can be used for other customized indexes.


        On Thu, May 22, 2008 at 6:09 PM, Otis Gospodnetic (JIRA) <jira@apache.org>

        ]
        Hide
        Jason Rutherglen added a comment -

        Tag in the sense you described. In Lucene, an untokenized field. Yes, a separate index that is dynamic while the main one is static except deletes. The main problem, though not impossible, is syncing the segments. The tag index needs the new segment docmap and a callback for each merge in SegmentMerger. The write to the tag index needs the doc id for each IndexWriter.addDocument call and a callback for each subsequent flush to disk.

        These event listeners can be used for other customized indexes.

        Show
        Jason Rutherglen added a comment - Tag in the sense you described. In Lucene, an untokenized field. Yes, a separate index that is dynamic while the main one is static except deletes. The main problem, though not impossible, is syncing the segments. The tag index needs the new segment docmap and a callback for each merge in SegmentMerger. The write to the tag index needs the doc id for each IndexWriter.addDocument call and a callback for each subsequent flush to disk. These event listeners can be used for other customized indexes.
        Hide
        Otis Gospodnetic added a comment -

        Making sure I understand:

        What you are calling "Tag Index".... is that "Tag" in a "Web 2.0 sense" of a tag? Something like the "ocean" tag here: http://www.simpy.com/user/otis/search/ocean ?

        If so, is the main idea and the reason behind this the maintenance of 2 parallel indices, one mostly static (fields tend to be big and do not change often) and one more dynamic (e.g. contains a "tags" field that clients add or remove tags from?)

        Thanks.

        Show
        Otis Gospodnetic added a comment - Making sure I understand: What you are calling "Tag Index".... is that "Tag" in a "Web 2.0 sense" of a tag? Something like the "ocean" tag here: http://www.simpy.com/user/otis/search/ocean ? If so, is the main idea and the reason behind this the maintenance of 2 parallel indices, one mostly static (fields tend to be big and do not change often) and one more dynamic (e.g. contains a "tags" field that clients add or remove tags from?) Thanks.
        Hide
        Jason Rutherglen added a comment -

        Looks like the hook into IndexWriter needs to get the doc id from DocumentsWriterThreadState in DocumentsWriter.updateDocument(Document doc, Analyzer analyzer, Term delTerm).

        The flush segment hook looks like it needs a callback from DocumentsWriter.flush(boolean closeDocStore)

        Show
        Jason Rutherglen added a comment - Looks like the hook into IndexWriter needs to get the doc id from DocumentsWriterThreadState in DocumentsWriter.updateDocument(Document doc, Analyzer analyzer, Term delTerm). The flush segment hook looks like it needs a callback from DocumentsWriter.flush(boolean closeDocStore)
        Jason Rutherglen created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            Jason Rutherglen
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development