Lucene - Core
  1. Lucene - Core
  2. LUCENE-5156

CompressingTermVectors termsEnum should probably not support seek-by-ord

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.5, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Just like term vectors before it, it has a O seek-by-term.

      But this one also advertises a seek-by-ord, only this is also O.

      This could cause e.g. checkindex to be very slow, because if termsenum supports ord it does a bunch of seeking tests. (Another solution would be to leave it, and add a boolean so checkindex never does seeking tests for term vectors, only real fields).

      However, I think its also kinda a trap, in my opinion if seek-by-ord is supported anywhere, you kinda expect it to be faster than linear time...?

        Activity

        Hide
        Adrien Grand added a comment -

        +1 this is trappy

        Show
        Adrien Grand added a comment - +1 this is trappy
        Hide
        Michael McCandless added a comment -

        +1

        Show
        Michael McCandless added a comment - +1
        Hide
        Adrien Grand added a comment -

        Patch looks good, +1

        Show
        Adrien Grand added a comment - Patch looks good, +1
        Hide
        ASF subversion and git services added a comment -

        Commit 1511009 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1511009 ]

        LUCENE-5156: remove seek-by-ord from CompressingTermVectors

        Show
        ASF subversion and git services added a comment - Commit 1511009 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1511009 ] LUCENE-5156 : remove seek-by-ord from CompressingTermVectors
        Hide
        ASF subversion and git services added a comment -

        Commit 1511014 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1511014 ]

        LUCENE-5156: remove seek-by-ord from CompressingTermVectors

        Show
        ASF subversion and git services added a comment - Commit 1511014 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1511014 ] LUCENE-5156 : remove seek-by-ord from CompressingTermVectors
        Hide
        Adrien Grand added a comment -

        4.5 release -> bulk close

        Show
        Adrien Grand added a comment - 4.5 release -> bulk close
        Hide
        David Smiley added a comment -

        I can understand why this change was done – better to not support it than support something optional that should be implemented fast yet not do it fast. What if it were to be made fast, along with seekCeil() which is also implemented slowly right now too? For example, say the first time either seekCeil is called or an ord method is called, then build up an array of term start positions by ordinal, which otherwise wouldn't be done. Then you could do a binary search for seekCeil and a direct lookup for seekExact. The lazy-created array could also then be shared across repeated invocations to get Terms for the current document.

        Why bother, you might ask? I'm working on a means of having the Terms from term vectors be directly searched against by the default highlighter instead of re-inverting to MemoryIndex. I'll post a separate issue for that with code, of course, which "works" but isn't as efficient as it could be thanks to the O(N) of seekCeil on term vectors' Terms.

        Show
        David Smiley added a comment - I can understand why this change was done – better to not support it than support something optional that should be implemented fast yet not do it fast. What if it were to be made fast, along with seekCeil() which is also implemented slowly right now too? For example, say the first time either seekCeil is called or an ord method is called, then build up an array of term start positions by ordinal, which otherwise wouldn't be done. Then you could do a binary search for seekCeil and a direct lookup for seekExact. The lazy-created array could also then be shared across repeated invocations to get Terms for the current document. Why bother, you might ask? I'm working on a means of having the Terms from term vectors be directly searched against by the default highlighter instead of re-inverting to MemoryIndex. I'll post a separate issue for that with code, of course, which "works" but isn't as efficient as it could be thanks to the O(N) of seekCeil on term vectors' Terms.
        Hide
        Robert Muir added a comment -

        Thats unrelated to term vectors. We shouldnt have such caching in the default codec, it can easily blow up on a large document.

        Show
        Robert Muir added a comment - Thats unrelated to term vectors. We shouldnt have such caching in the default codec, it can easily blow up on a large document.
        Hide
        Robert Muir added a comment -

        Personally i would do such a thing with a FilterTerms + FilterReader. you just check if docid == lastDocID and you have your cache thing.

        But i dont think it should be in the default codec. I also happen to think term vectors arent a good datastructure for highlighting anyway.

        Show
        Robert Muir added a comment - Personally i would do such a thing with a FilterTerms + FilterReader. you just check if docid == lastDocID and you have your cache thing. But i dont think it should be in the default codec. I also happen to think term vectors arent a good datastructure for highlighting anyway.
        Hide
        David Smiley added a comment -

        I agree on the caching thing – that is, what I said in which you ask for Terms for the same document again. Never-mind that part – as I thought about it I realized I didn't need that after all.

        But i dont think it should be in the default codec. I also happen to think term vectors arent a good datastructure for highlighting anyway.

        The default highlighter fully respects the positions and other aspects of the user's query, unlike the other highlighters. Some applications demand that a highlight is accurate to the query, even if the query uses custom span queries that do tricks with payloads, etc. It would be nice if the other highlighters supported accurate highlights for such queries but they don't, so today, this is the applicable one for accurate highlights for complex queries. The default highlighter requires a Terms instance reflecting the current document – it currently gets it via a re-inverting into a MemoryIndex but it can be hacked to accept a Terms directly from term vectors.

        So you don't like the idea of enhancing performance of term vector seekCeil in the default codec? Is that a -1 or -0? This change I propose seems harmless – the code would not create & build up the new offset array if consuming code doesn't call seekCeil or the ord methods.

        Show
        David Smiley added a comment - I agree on the caching thing – that is, what I said in which you ask for Terms for the same document again. Never-mind that part – as I thought about it I realized I didn't need that after all. But i dont think it should be in the default codec. I also happen to think term vectors arent a good datastructure for highlighting anyway. The default highlighter fully respects the positions and other aspects of the user's query, unlike the other highlighters. Some applications demand that a highlight is accurate to the query, even if the query uses custom span queries that do tricks with payloads, etc. It would be nice if the other highlighters supported accurate highlights for such queries but they don't, so today, this is the applicable one for accurate highlights for complex queries. The default highlighter requires a Terms instance reflecting the current document – it currently gets it via a re-inverting into a MemoryIndex but it can be hacked to accept a Terms directly from term vectors. So you don't like the idea of enhancing performance of term vector seekCeil in the default codec? Is that a -1 or -0? This change I propose seems harmless – the code would not create & build up the new offset array if consuming code doesn't call seekCeil or the ord methods.
        Hide
        Robert Muir added a comment -

        I also think its ok if we fix the codec to have a faster seekExact (not by copying stuff into a large array on the first call though, just by fixing datastructure / how it access data).

        That would solve the actual problem here you have in a clean way.

        Show
        Robert Muir added a comment - I also think its ok if we fix the codec to have a faster seekExact (not by copying stuff into a large array on the first call though, just by fixing datastructure / how it access data). That would solve the actual problem here you have in a clean way.
        Hide
        Robert Muir added a comment -

        Sorry David, its not about being against speeding something up, its about how you propose implementing it.

        Copying all the data from the entire document into another array on the first read for the doc, that's a really trashy thing to do here. Instead, we should just fix it correctly, so that seekCeil() is not linear time.

        Show
        Robert Muir added a comment - Sorry David, its not about being against speeding something up, its about how you propose implementing it. Copying all the data from the entire document into another array on the first read for the doc, that's a really trashy thing to do here. Instead, we should just fix it correctly, so that seekCeil() is not linear time.
        Hide
        Robert Muir added a comment -

        Also, this should be discussed somewhere else than on an unrelated, closed, year-old issue, like on its own issue. (Sorry, its not really related to seek-by-ord, your problem is a more general one, and it wasnt created by this issue, nor even by compressing term vectors but is older than that... this issue is closed)

        Show
        Robert Muir added a comment - Also, this should be discussed somewhere else than on an unrelated, closed, year-old issue, like on its own issue. (Sorry, its not really related to seek-by-ord, your problem is a more general one, and it wasnt created by this issue, nor even by compressing term vectors but is older than that... this issue is closed)

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development