Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      One of the goals of the new iterator-based API is to better handle sparse data. However, the current doc values writers still use a dense representation, and some of them perform naive linear scans in the nextDoc implementation.

      1. LUCENE-7474.patch
        32 kB
        Adrien Grand

        Issue Links

          Activity

          Hide
          mikemccand Michael McCandless added a comment -

          A sparse set in the nightly benchmarks is an interesting idea. Do you have a data set in mind?

          At some point I'll write up a blog post summarizing the change and I can also try to do a before (6.x) / after (upcoming 7.0) one-time performance test for that.

          Show
          mikemccand Michael McCandless added a comment - A sparse set in the nightly benchmarks is an interesting idea. Do you have a data set in mind? At some point I'll write up a blog post summarizing the change and I can also try to do a before (6.x) / after (upcoming 7.0) one-time performance test for that.
          Hide
          otis Otis Gospodnetic added a comment -

          I was wondering how one could compare Lucene indexing (and searching) performance before and after this change. Is there a way to add a sparse dataset for the nightly benchmark and use it for both trunk and 6.x branch, so one can see the performance difference of Lucene 6.x with sparse data vs. Lucene 7.x with sparse data?

          Show
          otis Otis Gospodnetic added a comment - I was wondering how one could compare Lucene indexing (and searching) performance before and after this change. Is there a way to add a sparse dataset for the nightly benchmark and use it for both trunk and 6.x branch, so one can see the performance difference of Lucene 6.x with sparse data vs. Lucene 7.x with sparse data?
          Hide
          jpountz Adrien Grand added a comment -

          All our benchmarks use dense data I think. The good news is that these changes did not seem to slow down indexing in the dense case if I look at http://people.apache.org/~mikemccand/geobench.html#index-times or http://people.apache.org/~mikemccand/lucenebench/indexing.html, or at least the slow down is small enough so that nothing is noticeable if there are points or terms indexed too. However regarding search, this change is almost certainly going to make things slower (see eg. http://people.apache.org/~mikemccand/lucenebench/Term.html), I think we need to be careful about keeping the slowdown contained.

          Show
          jpountz Adrien Grand added a comment - All our benchmarks use dense data I think. The good news is that these changes did not seem to slow down indexing in the dense case if I look at http://people.apache.org/~mikemccand/geobench.html#index-times or http://people.apache.org/~mikemccand/lucenebench/indexing.html , or at least the slow down is small enough so that nothing is noticeable if there are points or terms indexed too. However regarding search, this change is almost certainly going to make things slower (see eg. http://people.apache.org/~mikemccand/lucenebench/Term.html ), I think we need to be careful about keeping the slowdown contained.
          Hide
          otis Otis Gospodnetic added a comment -

          yoooohooo!
          Do the nightly builds have any tests that will exercise these new writers, the new 7.0 Codec, etc., so one can see how much speed this change gains?

          Show
          otis Otis Gospodnetic added a comment - yoooohooo! Do the nightly builds have any tests that will exercise these new writers, the new 7.0 Codec, etc., so one can see how much speed this change gains?
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit d50cf97617c88ec75fd8f4482003623db08e625e in lucene-solr's branch refs/heads/master from Adrien Grand
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d50cf97 ]

          LUCENE-7474: Doc values writers should have a sparse encoding.

          Show
          jira-bot ASF subversion and git services added a comment - Commit d50cf97617c88ec75fd8f4482003623db08e625e in lucene-solr's branch refs/heads/master from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d50cf97 ] LUCENE-7474 : Doc values writers should have a sparse encoding.
          Hide
          mikemccand Michael McCandless added a comment -

          +1, wonderful.

          Show
          mikemccand Michael McCandless added a comment - +1, wonderful.
          Hide
          jpountz Adrien Grand added a comment -

          Here is a patch. Writers now only store actual values (not placeholders for documents that do not have a value) and documents that have a value for the field are encoded using a FixedBitSet. While this is still technically linear, this should be significantly faster in the sparse case since many documents can be skipped at once.

          Show
          jpountz Adrien Grand added a comment - Here is a patch. Writers now only store actual values (not placeholders for documents that do not have a value) and documents that have a value for the field are encoded using a FixedBitSet. While this is still technically linear, this should be significantly faster in the sparse case since many documents can be skipped at once.

            People

            • Assignee:
              jpountz Adrien Grand
              Reporter:
              jpountz Adrien Grand
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development