Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.2, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The general idea is basically the docvalues parallel to FieldCache.getDocTermOrds/UninvertedField

      Currently this stuff is used in e.g. grouping and join for multivalued fields, and in solr for faceting.

      1. LUCENE-4765.patch
        227 kB
        Robert Muir
      2. LUCENE-4765.patch
        97 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        Here's a prototype patch just to explore the idea.

        I only implemented this for Lucene42Codec (just uses FST + vint-encoded byte[] as a quick hack).

        We should try to think about what would be the best API and so on if we want to add this.

        Show
        Robert Muir added a comment - Here's a prototype patch just to explore the idea. I only implemented this for Lucene42Codec (just uses FST + vint-encoded byte[] as a quick hack). We should try to think about what would be the best API and so on if we want to add this.
        Hide
        Shai Erera added a comment -

        I briefly looked at the patch, looks good. Perhaps instead of returning an iterator, we should return an IntsRef? While optimizing facets code, Mike and I observed the "iterator-based API" is not so good for "hot code", and e.g. if we'll want to use this for faceting, a bulk-API would be better I think.

        Same question for indexing – in facets, we get all the ordinals at once, so is it possible to have a field which takes a list of ordinals instead of adding many instances of SortedSetDVField?

        Show
        Shai Erera added a comment - I briefly looked at the patch, looks good. Perhaps instead of returning an iterator, we should return an IntsRef? While optimizing facets code, Mike and I observed the "iterator-based API" is not so good for "hot code", and e.g. if we'll want to use this for faceting, a bulk-API would be better I think. Same question for indexing – in facets, we get all the ordinals at once, so is it possible to have a field which takes a list of ordinals instead of adding many instances of SortedSetDVField?
        Hide
        Robert Muir added a comment -

        We can consider the IntsRef (i generally try to avoid these *Ref APIs)

        As far as this field for lucene/facets: I don't think it should be used by lucene/facets. That one should continue to use a single-valued byte[] because it separately tracks ordinals in a different structure.

        Show
        Robert Muir added a comment - We can consider the IntsRef (i generally try to avoid these *Ref APIs) As far as this field for lucene/facets: I don't think it should be used by lucene/facets. That one should continue to use a single-valued byte[] because it separately tracks ordinals in a different structure.
        Hide
        Shai Erera added a comment -

        Oh, you're right, I notice now that it adds strings. Will I be able to use this format by adding a different Field which has all the ordinals as-is? Or is that not the intention of this issue at all?

        Show
        Shai Erera added a comment - Oh, you're right, I notice now that it adds strings. Will I be able to use this format by adding a different Field which has all the ordinals as-is? Or is that not the intention of this issue at all?
        Hide
        Robert Muir added a comment -

        Thats not the intention of this issue at all.

        Show
        Robert Muir added a comment - Thats not the intention of this issue at all.
        Hide
        Robert Muir added a comment -

        some progress:

        • 4.2, assertingcodec, checkindex, etc are working
        • fieldcache.getDocTermOrds returns SortedSetDocValues (and from IR if you indexed them)
        • join/ and grouping/ are ported over to the new API

        next step:

        • simpletext and disk codecs.
        Show
        Robert Muir added a comment - some progress: 4.2, assertingcodec, checkindex, etc are working fieldcache.getDocTermOrds returns SortedSetDocValues (and from IR if you indexed them) join/ and grouping/ are ported over to the new API next step: simpletext and disk codecs.
        Hide
        Robert Muir added a comment -

        For the disk codec I did a simple solution of writing the entire ordinal stream as block-packed on disk.
        The indexes to this stream are loaded into memory (MonotonicBlockPacked). I think this is a pretty good tradeoff
        that might work reasonably well for this codec: we can always think of alternative encodings.

        Show
        Robert Muir added a comment - For the disk codec I did a simple solution of writing the entire ordinal stream as block-packed on disk. The indexes to this stream are loaded into memory (MonotonicBlockPacked). I think this is a pretty good tradeoff that might work reasonably well for this codec: we can always think of alternative encodings.
        Hide
        Robert Muir added a comment -

        Updated patch showing differences between trunk and branch.

        I actually think this is ready:

        • its a docvalues field where you can add multiple instances to a document.
        • these are dereferenced (like SORTED), except for each document you get a ordered list of ordinals instead of a single one.
        • transparent pass-thru to FieldCache.getDocTermOrds: so this "completes" dv in that we have index-time equivalent to what FieldCache provides.
        • if you ask for FieldCache.getDocTermOrds, instead of insanity for a single-valued field indexed by SORTED, you get a bridge API: so e.g. if we wanted we could start with a per-segment facet API for solr that handles both single/multi-valued and specialize only if it increases perf.
        • all apis cutover, including join/ and grouping/, though while doing this I noticed an opportunity to separately make join/ more efficient (LUCENE-4771)
        • refactored DocValues default merge to be simpler (also the existing SORTED case), additionally this benefits from the RAM improvements Adrien committed in LUCENE-4780.
        • Lucene42 implementation uses an FST for the ord/term "dictionary", and the ordinal list per-doc is essential a BINARY entry (vint+dgap encoded, as this seems to be the most efficient from the tests Shai et al have been doing with lucene/facets).
        • SimpleText, Disk, Asserting, and CheapBastard codecs.
        • I added random tests that basically index and delete lots of things and verify the contents against stored fields, and DocTermOrds built in RAM from the indexed contents.

        Just wanted to get the patch up for review for a while. In the meantime I'll continue to make some commits: for example I want to add this type to IndexWriter's diskfull/exception/thread interrupt/etc tests and the usual rounding out of things.

        Show
        Robert Muir added a comment - Updated patch showing differences between trunk and branch. I actually think this is ready: its a docvalues field where you can add multiple instances to a document. these are dereferenced (like SORTED), except for each document you get a ordered list of ordinals instead of a single one. transparent pass-thru to FieldCache.getDocTermOrds: so this "completes" dv in that we have index-time equivalent to what FieldCache provides. if you ask for FieldCache.getDocTermOrds, instead of insanity for a single-valued field indexed by SORTED, you get a bridge API: so e.g. if we wanted we could start with a per-segment facet API for solr that handles both single/multi-valued and specialize only if it increases perf. all apis cutover, including join/ and grouping/, though while doing this I noticed an opportunity to separately make join/ more efficient ( LUCENE-4771 ) refactored DocValues default merge to be simpler (also the existing SORTED case), additionally this benefits from the RAM improvements Adrien committed in LUCENE-4780 . Lucene42 implementation uses an FST for the ord/term "dictionary", and the ordinal list per-doc is essential a BINARY entry (vint+dgap encoded, as this seems to be the most efficient from the tests Shai et al have been doing with lucene/facets). SimpleText, Disk, Asserting, and CheapBastard codecs. I added random tests that basically index and delete lots of things and verify the contents against stored fields, and DocTermOrds built in RAM from the indexed contents. Just wanted to get the patch up for review for a while. In the meantime I'll continue to make some commits: for example I want to add this type to IndexWriter's diskfull/exception/thread interrupt/etc tests and the usual rounding out of things.
        Hide
        Adrien Grand added a comment -

        +1

        Show
        Adrien Grand added a comment - +1
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Robert Muir
        http://svn.apache.org/viewvc?view=revision&revision=1448085

        LUCENE-4765: Multi-valued docvalues field

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1448085 LUCENE-4765 : Multi-valued docvalues field
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Robert Muir
        http://svn.apache.org/viewvc?view=revision&revision=1447999

        LUCENE-4765: Multi-valued docvalues field

        Show
        Commit Tag Bot added a comment - [trunk commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1447999 LUCENE-4765 : Multi-valued docvalues field
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development