Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7489

Improve sparsity support of Lucene70DocValuesFormat

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Like Lucene70NormsFormat, it should be able to only encode actual values.

      1. LUCENE-7489.patch
        167 kB
        Adrien Grand
      2. LUCENE-7489.patch
        146 kB
        Adrien Grand

        Issue Links

          Activity

          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 643429de6e162fd85d5100137d01ee29e4bb614a in lucene-solr's branch refs/heads/master from Adrien Grand
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=643429d ]

          LUCENE-7489: Remove one layer of abstraction in binary doc values and single-valued numerics.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 643429de6e162fd85d5100137d01ee29e4bb614a in lucene-solr's branch refs/heads/master from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=643429d ] LUCENE-7489 : Remove one layer of abstraction in binary doc values and single-valued numerics.
          Hide
          mikemccand Michael McCandless added a comment -

          That's awesome progress on sort performance! Thanks Adrien Grand.

          Show
          mikemccand Michael McCandless added a comment - That's awesome progress on sort performance! Thanks Adrien Grand .
          Hide
          jpountz Adrien Grand added a comment -

          It looks like it worked, sorting by date time now seems a bit faster than it was before this change was pushed, but still slower than before we switched to an iterator API: http://people.apache.org/~mikemccand/lucenebench/TermDTSort.html

          The surprise to me is more that sorting by title looks faster than it was before we switched to an iterator API: http://people.apache.org/~mikemccand/lucenebench/TermTitleSort.html, which is good news.

          Show
          jpountz Adrien Grand added a comment - It looks like it worked, sorting by date time now seems a bit faster than it was before this change was pushed, but still slower than before we switched to an iterator API: http://people.apache.org/~mikemccand/lucenebench/TermDTSort.html The surprise to me is more that sorting by title looks faster than it was before we switched to an iterator API: http://people.apache.org/~mikemccand/lucenebench/TermTitleSort.html , which is good news.
          Hide
          jpountz Adrien Grand added a comment -

          The only difference that I could find is that we now wrap twice instead of only once before when gcd compression is enabled. I changed it, which yielded a ~2% improvement on wikimedium1m. This is far from the ~8% that the nightly benchmarks report, but it could be that the differences in the dataset explain it. I'll keep watching this benchmark over the next days.

          Show
          jpountz Adrien Grand added a comment - The only difference that I could find is that we now wrap twice instead of only once before when gcd compression is enabled. I changed it, which yielded a ~2% improvement on wikimedium1m. This is far from the ~8% that the nightly benchmarks report, but it could be that the differences in the dataset explain it. I'll keep watching this benchmark over the next days.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit a17e92006f087a0601d9329bf9b9c946ca72478b in lucene-solr's branch refs/heads/master from Adrien Grand
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a17e920 ]

          LUCENE-7489: Wrap only once in case GCD compression is used.

          Show
          jira-bot ASF subversion and git services added a comment - Commit a17e92006f087a0601d9329bf9b9c946ca72478b in lucene-solr's branch refs/heads/master from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a17e920 ] LUCENE-7489 : Wrap only once in case GCD compression is used.
          Hide
          jpountz Adrien Grand added a comment -
          Show
          jpountz Adrien Grand added a comment - It looks like it helped sorting by title ( http://people.apache.org/~mikemccand/lucenebench/TermTitleSort.html ) but not by date ( http://people.apache.org/~mikemccand/lucenebench/TermDTSort.html ). I'll look into it.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 927fd51d64a6e72843018786daea855847416487 in lucene-solr's branch refs/heads/master from Adrien Grand
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=927fd51 ]

          LUCENE-7489: Better sparsity support for Lucene70DocValuesFormat.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 927fd51d64a6e72843018786daea855847416487 in lucene-solr's branch refs/heads/master from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=927fd51 ] LUCENE-7489 : Better sparsity support for Lucene70DocValuesFormat.
          Hide
          jpountz Adrien Grand added a comment -

          I just ran the reproduction line again now that LUCENE-7495 fixed and the test passed. Things should be good now.

          Show
          jpountz Adrien Grand added a comment - I just ran the reproduction line again now that LUCENE-7495 fixed and the test passed. Things should be good now.
          Hide
          jpountz Adrien Grand added a comment -

          Phew, the bug was not in the new format but in nested sorting: LUCENE-7495.

          Show
          jpountz Adrien Grand added a comment - Phew, the bug was not in the new format but in nested sorting: LUCENE-7495 .
          Hide
          jpountz Adrien Grand added a comment -

          Thanks for reporting this seed, I'll look into this bug.

          It looks like it uses the same compression techniques for the values as the 6.x codec, but then for "which docIDs have a value" it has three different approaches, for the very sparse, mostly dense, and 100% dense cases.

          This is correct.

          Show
          jpountz Adrien Grand added a comment - Thanks for reporting this seed, I'll look into this bug. It looks like it uses the same compression techniques for the values as the 6.x codec, but then for "which docIDs have a value" it has three different approaches, for the very sparse, mostly dense, and 100% dense cases. This is correct.
          Hide
          mikemccand Michael McCandless added a comment -

          +1, this patch looks wonderful!

          It looks like it uses the same compression techniques for the values as the 6.x codec, but then for "which docIDs have a value" it has three different approaches, for the very sparse, mostly dense, and 100% dense cases.

          I hit this test failure, but doesn't repro on trunk (though it could still be a pre-existing issue, if e.g. this patch shifted seeds):

             [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestBlockJoinSorting -Dtests.method=testNestedSorting -Dtests.seed=A0B8F022A1A8B661 -Dtests.locale=en-CA -Dtests.timezone=Etc/GMT+4 -Dtests.asserts=true -Dtests.file.encoding=UTF-8
             [junit4] FAILURE 0.20s | TestBlockJoinSorting.testNestedSorting <<<
             [junit4]    > Throwable #1: org.junit.ComparisonFailure: expected:<[e]> but was:<[f]>
             [junit4]    > 	at __randomizedtesting.SeedInfo.seed([A0B8F022A1A8B661:A8511D63E101BB0F]:0)
             [junit4]    > 	at org.apache.lucene.search.join.TestBlockJoinSorting.testNestedSorting(TestBlockJoinSorting.java:233)
             [junit4]    > 	at java.lang.Thread.run(Thread.java:745)
             [junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): {field1=FST50, __type=Lucene50(blocksize=128), filter_1=Lucene50(blocksize=128), field2=Lucene50(blocksize=128)}, docValues:{field2=DocValuesFormat(name=Asserting)}, maxPointsInLeafNode=972, maxMBSortInHeap=5.645435808865713, sim=RandomSimilarity(queryNorm=false): {}, locale=en-CA, timezone=Etc/GMT+4
             [junit4]   2> NOTE: Linux 4.4.0-38-generic amd64/Oracle Corporation 1.8.0_101 (64-bit)/cpus=8,threads=1,free=420118024,total=514850816
             [junit4]   2> NOTE: All tests run in this JVM: [TestBlockJoinSorting]
             [junit4] Completed [1/1 (1!)] in 0.37s, 1 test, 1 failure <<< FAILURES!
          
          Show
          mikemccand Michael McCandless added a comment - +1, this patch looks wonderful! It looks like it uses the same compression techniques for the values as the 6.x codec, but then for "which docIDs have a value" it has three different approaches, for the very sparse, mostly dense, and 100% dense cases. I hit this test failure, but doesn't repro on trunk (though it could still be a pre-existing issue, if e.g. this patch shifted seeds): [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestBlockJoinSorting -Dtests.method=testNestedSorting -Dtests.seed=A0B8F022A1A8B661 -Dtests.locale=en-CA -Dtests.timezone=Etc/GMT+4 -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] FAILURE 0.20s | TestBlockJoinSorting.testNestedSorting <<< [junit4] > Throwable #1: org.junit.ComparisonFailure: expected:<[e]> but was:<[f]> [junit4] > at __randomizedtesting.SeedInfo.seed([A0B8F022A1A8B661:A8511D63E101BB0F]:0) [junit4] > at org.apache.lucene.search.join.TestBlockJoinSorting.testNestedSorting(TestBlockJoinSorting.java:233) [junit4] > at java.lang.Thread.run(Thread.java:745) [junit4] 2> NOTE: test params are: codec=Asserting(Lucene70): {field1=FST50, __type=Lucene50(blocksize=128), filter_1=Lucene50(blocksize=128), field2=Lucene50(blocksize=128)}, docValues:{field2=DocValuesFormat(name=Asserting)}, maxPointsInLeafNode=972, maxMBSortInHeap=5.645435808865713, sim=RandomSimilarity(queryNorm=false): {}, locale=en-CA, timezone=Etc/GMT+4 [junit4] 2> NOTE: Linux 4.4.0-38-generic amd64/Oracle Corporation 1.8.0_101 (64-bit)/cpus=8,threads=1,free=420118024,total=514850816 [junit4] 2> NOTE: All tests run in this JVM: [TestBlockJoinSorting] [junit4] Completed [1/1 (1!)] in 0.37s, 1 test, 1 failure <<< FAILURES!
          Hide
          jpountz Adrien Grand added a comment -

          Here is a new patch, I think it's ready.

          Compared to the current Lucene54DocValuesFormat, the only missing feature is table compression for the multi-valued case. I did not implement it because I was thinking that maybe we want to spend our complexity budget on other types of compression now that we have an iterator API. But if we decide otherwise, there is nothing that prevents us from adding it back in the future.

          Another difference is that nothing uses the old packedints APIs anymore (PackedInts.Reader, MonotonicBlockPackedReader, etc.), the reverse terms index is off heap, and the terms dictionary implementation always uses the compressed impl rather than trying to figure out whether storing fixed-length terms would be more efficient.

          Something that is still left to do is to see whether we can improve things by making the packedints API return iterators when only sequential access is needed. But that belongs to a different issue IMO, this change is huge already.

          Show
          jpountz Adrien Grand added a comment - Here is a new patch, I think it's ready. Compared to the current Lucene54DocValuesFormat, the only missing feature is table compression for the multi-valued case. I did not implement it because I was thinking that maybe we want to spend our complexity budget on other types of compression now that we have an iterator API. But if we decide otherwise, there is nothing that prevents us from adding it back in the future. Another difference is that nothing uses the old packedints APIs anymore (PackedInts.Reader, MonotonicBlockPackedReader, etc.), the reverse terms index is off heap, and the terms dictionary implementation always uses the compressed impl rather than trying to figure out whether storing fixed-length terms would be more efficient. Something that is still left to do is to see whether we can improve things by making the packedints API return iterators when only sequential access is needed. But that belongs to a different issue IMO, this change is huge already.
          Hide
          jpountz Adrien Grand added a comment -

          Here is a prototype that passes tests. It uses the same sparse DISI as norms in order to be able to only store actual values. Other than that it is mostly the same as the old format, it doesn't leverage the fact that we have an iterator in order to do RLE for instance (this should be explored in a different issue I think). I still need to review it a bit more carefully and work on the format docs.

          Show
          jpountz Adrien Grand added a comment - Here is a prototype that passes tests. It uses the same sparse DISI as norms in order to be able to only store actual values. Other than that it is mostly the same as the old format, it doesn't leverage the fact that we have an iterator in order to do RLE for instance (this should be explored in a different issue I think). I still need to review it a bit more carefully and work on the format docs.

            People

            • Assignee:
              jpountz Adrien Grand
              Reporter:
              jpountz Adrien Grand
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development