Lucene - Core
  1. Lucene - Core
  2. LUCENE-4299

No way to find term vectors options at read time

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The problem is simple:

      1. term vectors can be configured "per-field-per-document", meaning for the "body" field, document 0 can have them, document 1 maybe doesnt at all, document 2 maybe has offsets (no positions), and so on. To me this is not a useful feature at all, no one has ever mentioned a single use case for this, and it just makes our code more complicated. but it is what it is (for this issue)
      2. there is no way to discover these options for a field of a document, you have to do things like 'peek ahead' to see the first position of the first term is -1, or same for offsets (except worse, we used to allow anything in offsets so -1 might be an actual value). This makes the merging code really hairy, and tough on end consumers.

      So I propose that instead of returning Terms for Vectors, we return VectorTerms (extends Terms), which just adds hasOffsets() and hasPositions(). e.g. lucene40 already knows this from the bits for the field/doc pair and just returns what it knows.

      1. LUCENE-4299.patch
        22 kB
        Robert Muir
      2. LUCENE-4299.patch
        19 kB
        Robert Muir
      3. LUCENE-4299.patch
        18 kB
        Robert Muir
      4. LUCENE-4299.patch
        14 kB
        Robert Muir
      5. LUCENE-4299.patch
        15 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        here's a prototype patch: all tests pass.

        If we are ok with the idea, i can clean up the rest.

        Show
        Robert Muir added a comment - here's a prototype patch: all tests pass. If we are ok with the idea, i can clean up the rest.
        Hide
        Robert Muir added a comment -

        I would also clean up the merging and checkindex code too... thats the worst and it would become a lot simpler here.

        Show
        Robert Muir added a comment - I would also clean up the merging and checkindex code too... thats the worst and it would become a lot simpler here.
        Hide
        Robert Muir added a comment -

        an alternative is to add this information just to Terms, but then for postings its redundant with FieldInfos. So I don't know if thats any better.

        Show
        Robert Muir added a comment - an alternative is to add this information just to Terms, but then for postings its redundant with FieldInfos. So I don't know if thats any better.
        Hide
        Robert Muir added a comment -

        ok second idea seems simpler, just adding these to Terms: here's a patch.

        I didn't improve tv merging or checkindex yet.

        Show
        Robert Muir added a comment - ok second idea seems simpler, just adding these to Terms: here's a patch. I didn't improve tv merging or checkindex yet.
        Hide
        Robert Muir added a comment -

        Updated patch: really simplifies the default TermVectorsWriter.merge impl.

        Show
        Robert Muir added a comment - Updated patch: really simplifies the default TermVectorsWriter.merge impl.
        Hide
        Robert Muir added a comment -

        updated patch fixing a pretty big inefficiency in highlighter, because its hasPositions(termvectors) was inefficient before, it had to actually clone an indexinput, read term bytes, freqs, positions, offsets, just to see if the first pos was -1.

        Show
        Robert Muir added a comment - updated patch fixing a pretty big inefficiency in highlighter, because its hasPositions(termvectors) was inefficient before, it had to actually clone an indexinput, read term bytes, freqs, positions, offsets, just to see if the first pos was -1.
        Hide
        Robert Muir added a comment -

        added comparisons of these options in TestDuelingCodecs, and tried to simplify CheckIndex (only slightly) since we know these values up front.

        I think this is ready.

        Show
        Robert Muir added a comment - added comparisons of these options in TestDuelingCodecs, and tried to simplify CheckIndex (only slightly) since we know these values up front. I think this is ready.
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development