Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.3
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      OrdinalMap is very useful when computing top terms on a multi-index segment. However I've seen it being occasionally slow to build, which was either making facets (when the ordinals map is computed lazily) or reopen (when computed eagerly) slow. So out of curiosity, I tried to profile ordinal map building on a simple index: 10M random strings of length between 0 and 20 stored as a SORTED doc values field. The index has 19 segments. The bottleneck was MultiTermsEnum.next() (by far) due to lots of BytesRef comparisons (UTF8SortedAsUnicodeComparator).

      MultiTermsEnum stores sub enums in two different places:

      • top: a simple array containing all enums on the current term
      • queue: a queue for enums that are not exhausted yet but beyond the current term.

      A non-exhausted enum is in exactly one of these data-structures. When moving to the next term, MultiTermsEnum advances all enums in top, then adds them to queue and finally, pops all enum that are on the same term back into top.

      We could save reorderings of the priority queue by not removing entries from the priority queue and then calling updateTop to advance enums which are on the current term. This is already what we do for disjunctions of doc IDs in DISIPriorityQueue.

      On the index described above and current trunk, building an OrdinalMap has to call UTF8SortedAsUnicodeComparator.compare 80114820 times and runs in 1.9 s. With the change, it calls UTF8SortedAsUnicodeComparator.compare 36900694 times, BytesRef.equals 16297638 times and runs in 1.4s (~26% faster).

      1. LUCENE-6690.patch
        4 kB
        Adrien Grand
      2. OrdinalMapBuildBench.java
        2 kB
        Adrien Grand

        Activity

        Hide
        Adrien Grand added a comment -

        Here is the benchmark I've been using. It's certainly not great but I don't think it's too bad either.

        Show
        Adrien Grand added a comment - Here is the benchmark I've been using. It's certainly not great but I don't think it's too bad either.
        Hide
        Adrien Grand added a comment -

        And here is the patch.

        Show
        Adrien Grand added a comment - And here is the patch.
        Hide
        Uwe Schindler added a comment -

        Good idea!

        Show
        Uwe Schindler added a comment - Good idea!
        Hide
        Martijn van Groningen added a comment -

        +1 this looks great. I test it out on an index with a single doc values field with 100M unique values and 42 segments and the rebuilding time dropped from ~17000 ms to ~10800ms.

        Show
        Martijn van Groningen added a comment - +1 this looks great. I test it out on an index with a single doc values field with 100M unique values and 42 segments and the rebuilding time dropped from ~17000 ms to ~10800ms.
        Hide
        ASF subversion and git services added a comment -

        Commit 1692253 from Adrien Grand in branch 'dev/trunk'
        [ https://svn.apache.org/r1692253 ]

        LUCENE-6690: Speed up MultiTermsEnum.next().

        Show
        ASF subversion and git services added a comment - Commit 1692253 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1692253 ] LUCENE-6690 : Speed up MultiTermsEnum.next().
        Hide
        ASF subversion and git services added a comment -

        Commit 1692255 from Adrien Grand in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1692255 ]

        LUCENE-6690: Speed up MultiTermsEnum.next().

        Show
        ASF subversion and git services added a comment - Commit 1692255 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1692255 ] LUCENE-6690 : Speed up MultiTermsEnum.next().
        Hide
        Adrien Grand added a comment -

        Thanks Martijn for confirming it sped up global ordinals loading in your case too! I just committed to trunk and 5.x.

        Show
        Adrien Grand added a comment - Thanks Martijn for confirming it sped up global ordinals loading in your case too! I just committed to trunk and 5.x.
        Hide
        Shalin Shekhar Mangar added a comment -

        Bulk close for 5.3.0 release

        Show
        Shalin Shekhar Mangar added a comment - Bulk close for 5.3.0 release

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development