Lucene - Core
  1. Lucene - Core
  2. LUCENE-5300

SORTED_SET could use SORTED encoding when the field is actually single-valued

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.6
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      It would be nice to detect when a SORTED_SET field is single-valued in order to optimize storage.

      1. LUCENE-5300.patch
        18 kB
        Adrien Grand
      2. LUCENE-5300.patch
        8 kB
        Adrien Grand

        Activity

        Hide
        Adrien Grand added a comment -

        Here is a patch.

        Show
        Adrien Grand added a comment - Here is a patch.
        Hide
        Michael McCandless added a comment -

        +1

        I wonder if we could somehow do this "generically" so that any DVFormat (not just Lucene45) would get it ... but that can be later.

        Show
        Michael McCandless added a comment - +1 I wonder if we could somehow do this "generically" so that any DVFormat (not just Lucene45) would get it ... but that can be later.
        Hide
        Robert Muir added a comment -

        I'm not so happy about this:

           @Override
           public SortedSetDocValues getSortedSet(FieldInfo field) throws IOException {
            if (!ordIndexes.containsKey(field.number)) {
            // if (entry is missing.... look in another place)
        

        Can we just explicitly write the way the field is encoded instead of the fallback? The fallback could be confusing in the case of real bugs.

        Show
        Robert Muir added a comment - I'm not so happy about this: @Override public SortedSetDocValues getSortedSet(FieldInfo field) throws IOException { if (!ordIndexes.containsKey(field.number)) { // if (entry is missing.... look in another place) Can we just explicitly write the way the field is encoded instead of the fallback? The fallback could be confusing in the case of real bugs.
        Hide
        Adrien Grand added a comment -

        It was tempting to check for ordIndexes for simplicity but I agree it is safer to explicitely write the format. Here is a patch that fixes that.

        Show
        Adrien Grand added a comment - It was tempting to check for ordIndexes for simplicity but I agree it is safer to explicitely write the format. Here is a patch that fixes that.
        Hide
        Robert Muir added a comment -

        +1.

        Somewhat related: SingletonSortedSetDocValues is public, i think its used by a few codecs (maybe also FieldCacheImpl). Maybe its fair to add a getter here to access the wrapped SortedDocValues?

        it sounds ugly/stupid but maybe could help some low-level code (like DocValuesFaceting in solr) that already has two specializations for Sorted and SortedSet anyway. But I'm not hung up on this and totally happy for it to stay all hidden too.

        Show
        Robert Muir added a comment - +1. Somewhat related: SingletonSortedSetDocValues is public, i think its used by a few codecs (maybe also FieldCacheImpl). Maybe its fair to add a getter here to access the wrapped SortedDocValues? it sounds ugly/stupid but maybe could help some low-level code (like DocValuesFaceting in solr) that already has two specializations for Sorted and SortedSet anyway. But I'm not hung up on this and totally happy for it to stay all hidden too.
        Hide
        ASF subversion and git services added a comment -

        Commit 1535296 from Adrien Grand in branch 'dev/trunk'
        [ https://svn.apache.org/r1535296 ]

        LUCENE-5300: Optimized SORTED_SET storage for fields which are single-valued.

        Show
        ASF subversion and git services added a comment - Commit 1535296 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1535296 ] LUCENE-5300 : Optimized SORTED_SET storage for fields which are single-valued.
        Hide
        ASF subversion and git services added a comment -

        Commit 1535298 from Adrien Grand in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1535298 ]

        LUCENE-5300: Optimized SORTED_SET storage for fields which are single-valued.

        Show
        ASF subversion and git services added a comment - Commit 1535298 from Adrien Grand in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1535298 ] LUCENE-5300 : Optimized SORTED_SET storage for fields which are single-valued.
        Hide
        Adrien Grand added a comment -

        Somewhat related: SingletonSortedSetDocValues is public, i think its used by a few codecs (maybe also FieldCacheImpl). Maybe its fair to add a getter here to access the wrapped SortedDocValues?

        it sounds ugly/stupid but maybe could help some low-level code (like DocValuesFaceting in solr) that already has two specializations for Sorted and SortedSet anyway. But I'm not hung up on this and totally happy for it to stay all hidden too.

        I think it is fair. I opened LUCENE-5304 for this.

        Show
        Adrien Grand added a comment - Somewhat related: SingletonSortedSetDocValues is public, i think its used by a few codecs (maybe also FieldCacheImpl). Maybe its fair to add a getter here to access the wrapped SortedDocValues? it sounds ugly/stupid but maybe could help some low-level code (like DocValuesFaceting in solr) that already has two specializations for Sorted and SortedSet anyway. But I'm not hung up on this and totally happy for it to stay all hidden too. I think it is fair. I opened LUCENE-5304 for this.
        Hide
        Adrien Grand added a comment -

        Thanks Mike and Robert for the reviews!

        Show
        Adrien Grand added a comment - Thanks Mike and Robert for the reviews!

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development