Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7101

OfflineSorter's merging is O(N^2) cost for large sorts

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.0, 6.1, 7.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Our OfflineSorter acts just like Lucene, writing small initial
      segments of sorted values (from what it was able to sort at once in
      heap), periodically merging them when there are too many, and doing a
      forceMerge(1) in the end.

      But the merge logic is too simplistic today, resulting in O(N^2)
      cost. Smallish sorts usually won't hit it, because the default 128
      merge factor is so high, but e.g. the new 2B points tests do hit the
      N^2 behavior. I suspect the high merge factor hurts performance (OS
      struggles to use what free RAM it has to read-ahead on 128 files), and
      also risks file descriptor exhaustion.

      I think we should implement a simple log merge policy for it, and drop
      its default merge factor to 10.

      1. LUCENE-7101.patch
        10 kB
        Michael McCandless

        Activity

        Hide
        dweiss Dawid Weiss added a comment -

        Sorry for being dim, what's the scenario of hitting N^2 complexity here?

        Show
        dweiss Dawid Weiss added a comment - Sorry for being dim, what's the scenario of hitting N^2 complexity here?
        Hide
        mikemccand Michael McCandless added a comment -

        To get log(N) behavior you need to merge N segments that are "close" to the same size on each merge...

        E.g., say you are sorting "plenty" of data, with only 16 MB heap buffer to work with.

        Today, we will write 128 (default merge factor) 16 MB segments, and then merge those down to a single 2 GB (= 128 * 16 MB) segment. Next we write another 127 16 MB segments, and merge the 2 GB segment, plus 127 16 MB segments, into a 3.98 GB segment. Then another 127 16 MB segments, and merge that into a 5.97 GB segment, etc.

        So the net number of MB copied by merging will be something like:

        (128 + 0*127)*16 + (128 + 1*127)*16 + (128 + 2*127)*16 + (128 + 3*127)*16 + ...

        The equation has a linear number of terms vs your input size (input data divided by 1.98 GB, or so?).

        But this equation asymptotes towards O(N^2):

        It rewrites to:

        16 * 128 + 16 * (0*127 + 1*127 + 2*127 + 3*127 + ...)

        and rewrites to:

        16 * 128 + 16 * 127 * (0 + 1 + 2 + 3 + ...)

        where the number of terms in that last part is linear vs your input data size.

        So this bug only happens if you sort enough data, vs. your allowed heap usage, to require many merges, so the too-large 128 merge factor "hides" this problem for most use cases.

        Show
        mikemccand Michael McCandless added a comment - To get log(N) behavior you need to merge N segments that are "close" to the same size on each merge... E.g., say you are sorting "plenty" of data, with only 16 MB heap buffer to work with. Today, we will write 128 (default merge factor) 16 MB segments, and then merge those down to a single 2 GB (= 128 * 16 MB) segment. Next we write another 127 16 MB segments, and merge the 2 GB segment, plus 127 16 MB segments, into a 3.98 GB segment. Then another 127 16 MB segments, and merge that into a 5.97 GB segment, etc. So the net number of MB copied by merging will be something like: (128 + 0*127)*16 + (128 + 1*127)*16 + (128 + 2*127)*16 + (128 + 3*127)*16 + ... The equation has a linear number of terms vs your input size (input data divided by 1.98 GB, or so?). But this equation asymptotes towards O(N^2): It rewrites to: 16 * 128 + 16 * (0*127 + 1*127 + 2*127 + 3*127 + ...) and rewrites to: 16 * 128 + 16 * 127 * (0 + 1 + 2 + 3 + ...) where the number of terms in that last part is linear vs your input data size. So this bug only happens if you sort enough data, vs. your allowed heap usage, to require many merges, so the too-large 128 merge factor "hides" this problem for most use cases.
        Hide
        mikemccand Michael McCandless added a comment -

        Patch, adding a trivial "log merge policy" to OfflineSorter, and changing its default merge factor from 128 to 10.

        I also fixed the 2B points tests to not "cheat" by giving BKD more heap than the defaults, and improved BKDWriter's temp file naming

        I'm running Test2BBKDPoints.test2D now ...

        Show
        mikemccand Michael McCandless added a comment - Patch, adding a trivial "log merge policy" to OfflineSorter , and changing its default merge factor from 128 to 10. I also fixed the 2B points tests to not "cheat" by giving BKD more heap than the defaults, and improved BKDWriter 's temp file naming I'm running Test2BBKDPoints.test2D now ...
        Hide
        dweiss Dawid Weiss added a comment -

        Thanks Mike, this was detailed. Patch looks good at first glance.

        Show
        dweiss Dawid Weiss added a comment - Thanks Mike, this was detailed. Patch looks good at first glance.
        Hide
        mikemccand Michael McCandless added a comment -

        I'm running Test2BBKDPoints.test2D now ...

        It passed (after 12.2 hours, on spinning disk)!

        Show
        mikemccand Michael McCandless added a comment - I'm running Test2BBKDPoints.test2D now ... It passed (after 12.2 hours, on spinning disk)!
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 82c06190a35a8159288c2fb48d8d38d6d81dbbf2 in lucene-solr's branch refs/heads/master from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=82c0619 ]

        LUCENE-7101: OfflineSorter had O(N^2) merge cost, and used too many temporary file descriptors, for large sorts

        Show
        jira-bot ASF subversion and git services added a comment - Commit 82c06190a35a8159288c2fb48d8d38d6d81dbbf2 in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=82c0619 ] LUCENE-7101 : OfflineSorter had O(N^2) merge cost, and used too many temporary file descriptors, for large sorts
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 287d6c8be6a75f43ff298483ef79e8226f60d138 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=287d6c8 ]

        LUCENE-7101: OfflineSorter had O(N^2) merge cost, and used too many temporary file descriptors, for large sorts

        Show
        jira-bot ASF subversion and git services added a comment - Commit 287d6c8be6a75f43ff298483ef79e8226f60d138 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=287d6c8 ] LUCENE-7101 : OfflineSorter had O(N^2) merge cost, and used too many temporary file descriptors, for large sorts
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 6168fe1afb40286a7515b5909c3eb41db1ab6d00 in lucene-solr's branch refs/heads/branch_6_0 from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6168fe1 ]

        LUCENE-7101: OfflineSorter had O(N^2) merge cost, and used too many temporary file descriptors, for large sorts

        Conflicts:
        lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java
        lucene/core/src/test/org/apache/lucene/util/bkd/Test2BBKDPoints.java

        Show
        jira-bot ASF subversion and git services added a comment - Commit 6168fe1afb40286a7515b5909c3eb41db1ab6d00 in lucene-solr's branch refs/heads/branch_6_0 from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6168fe1 ] LUCENE-7101 : OfflineSorter had O(N^2) merge cost, and used too many temporary file descriptors, for large sorts Conflicts: lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java lucene/core/src/test/org/apache/lucene/util/bkd/Test2BBKDPoints.java
        Hide
        hossman Hoss Man added a comment -

        Manually correcting fixVersion per Step #S6 of LUCENE-7271

        Show
        hossman Hoss Man added a comment - Manually correcting fixVersion per Step #S6 of LUCENE-7271

          People

          • Assignee:
            mikemccand Michael McCandless
            Reporter:
            mikemccand Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development