Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.5
    • Component/s: core/other
    • Labels:
      None
    • Lucene Fields:
      New
    1. LUCENE-5098.patch
      12 kB
      Paul Elschot
    2. LUCENE-5098.patch
      12 kB
      Paul Elschot

      Activity

      Hide
      Paul Elschot added a comment -

      This has some methods and constants inspired by the article
      "Broadword Implementation of Rank/Select Queries" by Sebastiano Vigna, January 30, 2012,
      currently retrievable from http://vigna.di.unimi.it/ftp/papers/Broadword.pdf .

      I'd expect this to become really useful in the Elias-Fano sequence of LUCENE-5084 after an index is added to that.

      Show
      Paul Elschot added a comment - This has some methods and constants inspired by the article "Broadword Implementation of Rank/Select Queries" by Sebastiano Vigna, January 30, 2012, currently retrievable from http://vigna.di.unimi.it/ftp/papers/Broadword.pdf . I'd expect this to become really useful in the Elias-Fano sequence of LUCENE-5084 after an index is added to that.
      Hide
      Adrien Grand added a comment -

      The patch looks good to me. I'm looking forward to seeing the impact these methods will have on the EliasFanoDocIdSet performance!

      Some minor comments about the patch:

      • Maybe these utility methods could be moved to oal.util.BitUtil where we already have some utility methods to count or find bits?
      • I ran quick benchmarks for select and select9 seemed to quickly outperform selectNaive, even for non-dense longs. Do you have the same experience with selectNaive vs. select9? If yes, maybe selectNaive could be moved to the test case.
      • Since the main goal of rank9 is code documentation, maybe it should be made package-private?
      Show
      Adrien Grand added a comment - The patch looks good to me. I'm looking forward to seeing the impact these methods will have on the EliasFanoDocIdSet performance! Some minor comments about the patch: Maybe these utility methods could be moved to oal.util.BitUtil where we already have some utility methods to count or find bits? I ran quick benchmarks for select and select9 seemed to quickly outperform selectNaive, even for non-dense longs. Do you have the same experience with selectNaive vs. select9? If yes, maybe selectNaive could be moved to the test case. Since the main goal of rank9 is code documentation, maybe it should be made package-private?
      Hide
      Dawid Weiss added a comment -
      TestBroadWord extends TestCase
      

      Could we extend LuceneTestCase for tests, please? I know it doesn't have any impact here but if it's good to have a single superclass for all tests in case we need to tweak/ alter something.

      As for longHex – I always liked the "assembly" version better:

          final static char [] HEX = "0123456789abcdef".toCharArray();
          private static String longHex(long x)
          {
              char [] asHex = new char [16];
              for (int i = 16; --i >= 0; x >>>= 4) {
                  asHex[i] = HEX[(int) x & 0x0F];
              }
              return "0x" + new String(asHex);
          }
      

      but if you strive for simplicity this should do too:

          private static String longHex(long x)
          {
              return String.format(Locale.ENGLISH, "%016x", x);
          }
      
      Show
      Dawid Weiss added a comment - TestBroadWord extends TestCase Could we extend LuceneTestCase for tests, please? I know it doesn't have any impact here but if it's good to have a single superclass for all tests in case we need to tweak/ alter something. As for longHex – I always liked the "assembly" version better: final static char [] HEX = "0123456789abcdef" .toCharArray(); private static String longHex( long x) { char [] asHex = new char [16]; for ( int i = 16; --i >= 0; x >>>= 4) { asHex[i] = HEX[( int ) x & 0x0F]; } return "0x" + new String (asHex); } but if you strive for simplicity this should do too: private static String longHex( long x) { return String .format(Locale.ENGLISH, "%016x" , x); }
      Hide
      Uwe Schindler added a comment -

      Could we extend LuceneTestCase for tests, please?

      I think forbidden checker should catch this

      Show
      Uwe Schindler added a comment - Could we extend LuceneTestCase for tests, please? I think forbidden checker should catch this
      Hide
      Adrien Grand added a comment -

      It does.

      Show
      Adrien Grand added a comment - It does.
      Hide
      Paul Elschot added a comment -

      Maybe these utility methods could be moved to oal.util.BitUtil where we already have some utility methods to count or find bits?

      I kept them separate to have a clear reference to the Vigna broadword article.

      The class javadoc already refers to all methods and constants from there, so that could be easily merged into BitUtil's class javadoc when merging the two classes.
      Also both classes have only static attributes.

      So merging is no problem at all, but I'm still in two minds about this.
      Any opinions either way, i.e. merging class BroadWord here into the existing BitUtil or not?

      I ran quick benchmarks for select and select9 seemed to quickly outperform selectNaive, even for non-dense longs. Do you have the same experience with selectNaive vs. select9? If yes, maybe selectNaive could be moved to the test case.

      In the Quasi Succinct Indices article (EliasFanoDocIdSet, LUCENE-5084), in Section 9 under Reading Unary Codes, Vigna mentions that within the right longword:

      Our experiments show that broadword bit search is extremely effective, unless the number of reads is very small, as in that case computing iteratively the least significant bit becomes competitive. Indeed, when skipping a very small number of position (e.g., less then eight) we simply resort to iterating through the list.

      In the test code there is a comment that select9 is about 7 times faster than selectNaive, 45ns against 295 ns for a very dense case on my old 32 bit home machine. Since the machine is 32 bits, that is no more than a rough indication that the selectNaive here (that iterates) might indeed preferable over the select9 here when selecting before the 8th set bit.
      A safe conclusion is that moving selectNaive to the test cases now would be premature.

      Since the main goal of rank9 is code documentation, maybe it should be made package-private?

      I have not actually benchmarked rank9 it against Long.bitCount, but I think we should do that just to be sure that rank9 is slower, and than it can be made package-private.

      As for longHex – I always liked the "assembly" version better.

      I have used that mostly during debugging, and it ended up also in toString in EliasFanoEncoder iirc.
      How about putting the assembly version in BitUtil?

      Could we extend LuceneTestCase for tests, please?

      I don't really like it, but my next patch here will have that.
      Should LuceneTestCase also be mentioned in the wiki at "How to contribute"?

      Show
      Paul Elschot added a comment - Maybe these utility methods could be moved to oal.util.BitUtil where we already have some utility methods to count or find bits? I kept them separate to have a clear reference to the Vigna broadword article. The class javadoc already refers to all methods and constants from there, so that could be easily merged into BitUtil's class javadoc when merging the two classes. Also both classes have only static attributes. So merging is no problem at all, but I'm still in two minds about this. Any opinions either way, i.e. merging class BroadWord here into the existing BitUtil or not? I ran quick benchmarks for select and select9 seemed to quickly outperform selectNaive, even for non-dense longs. Do you have the same experience with selectNaive vs. select9? If yes, maybe selectNaive could be moved to the test case. In the Quasi Succinct Indices article (EliasFanoDocIdSet, LUCENE-5084 ), in Section 9 under Reading Unary Codes, Vigna mentions that within the right longword: Our experiments show that broadword bit search is extremely effective, unless the number of reads is very small, as in that case computing iteratively the least significant bit becomes competitive. Indeed, when skipping a very small number of position (e.g., less then eight) we simply resort to iterating through the list. In the test code there is a comment that select9 is about 7 times faster than selectNaive, 45ns against 295 ns for a very dense case on my old 32 bit home machine. Since the machine is 32 bits, that is no more than a rough indication that the selectNaive here (that iterates) might indeed preferable over the select9 here when selecting before the 8th set bit. A safe conclusion is that moving selectNaive to the test cases now would be premature. Since the main goal of rank9 is code documentation, maybe it should be made package-private? I have not actually benchmarked rank9 it against Long.bitCount, but I think we should do that just to be sure that rank9 is slower, and than it can be made package-private. As for longHex – I always liked the "assembly" version better. I have used that mostly during debugging, and it ended up also in toString in EliasFanoEncoder iirc. How about putting the assembly version in BitUtil? Could we extend LuceneTestCase for tests, please? I don't really like it, but my next patch here will have that. Should LuceneTestCase also be mentioned in the wiki at "How to contribute"?
      Hide
      Dawid Weiss added a comment -

      I don't really like it, but my next patch here will have that.

      Is there any reason why you don't like it? Voices of criticism are always welcome, especially those constructive ones

      The only change for you should be that it'll require JVM assertions to be enabled and that the method order will be randomized (which you can fix by putting a Seed annotation on the class).

      There are a few test classes that in fact do not extend LuceneTestCase but they do it for a reason (they "test the tester" and LuceneTestCase is not really reentrant).

      Show
      Dawid Weiss added a comment - I don't really like it, but my next patch here will have that. Is there any reason why you don't like it? Voices of criticism are always welcome, especially those constructive ones The only change for you should be that it'll require JVM assertions to be enabled and that the method order will be randomized (which you can fix by putting a Seed annotation on the class). There are a few test classes that in fact do not extend LuceneTestCase but they do it for a reason (they "test the tester" and LuceneTestCase is not really reentrant).
      Hide
      Paul Elschot added a comment -

      I don't really like using LuceneTestCase here because the tests here pass when extending TestCase and they do not use anything that LuceneTestCase provides, at least not in the way that I use them:

       ant -Dtestcase=TestBroadWord test

      .
      Also I prefer to try and avoid having that discussion here.

      Is the forbidden checker a precommit check?

      Show
      Paul Elschot added a comment - I don't really like using LuceneTestCase here because the tests here pass when extending TestCase and they do not use anything that LuceneTestCase provides, at least not in the way that I use them: ant -Dtestcase=TestBroadWord test . Also I prefer to try and avoid having that discussion here. Is the forbidden checker a precommit check?
      Hide
      Uwe Schindler added a comment -

      You can call ant check-forbidden-apis from the lucene or solr root directory. And yes, it's also a precommit check.

      LuceneTestCase does more checks on your test case, so we require every test in Lucene extends this class.

      Show
      Uwe Schindler added a comment - You can call ant check-forbidden-apis from the lucene or solr root directory. And yes, it's also a precommit check. LuceneTestCase does more checks on your test case, so we require every test in Lucene extends this class.
      Hide
      Adrien Grand added a comment -

      A safe conclusion is that moving selectNaive to the test cases now would be premature.

      OK.

      I have not actually benchmarked rank9 it against Long.bitCount, but I think we should do that just to be sure that rank9 is slower, and than it can be made package-private.

      I played a bit with it and rank9 was always between 15% and 20% slower than bitCount no matter what the input was (which is still impressing since bitCount is supposed to be an intrinsic). We used to have a utility method in BitUtil to compute pop counts on longs but we removed it in LUCENE-2221 in favor of Long.bitCount.

      How about putting the assembly version in BitUtil?

      Or ToStringUtils?

      Should LuceneTestCase also be mentioned in the wiki at "How to contribute"?

      We try to keep this page as concise as possible so I added a mention to it in https://wiki.apache.org/lucene-java/DeveloperTips.

      Show
      Adrien Grand added a comment - A safe conclusion is that moving selectNaive to the test cases now would be premature. OK. I have not actually benchmarked rank9 it against Long.bitCount, but I think we should do that just to be sure that rank9 is slower, and than it can be made package-private. I played a bit with it and rank9 was always between 15% and 20% slower than bitCount no matter what the input was (which is still impressing since bitCount is supposed to be an intrinsic). We used to have a utility method in BitUtil to compute pop counts on longs but we removed it in LUCENE-2221 in favor of Long.bitCount. How about putting the assembly version in BitUtil? Or ToStringUtils? Should LuceneTestCase also be mentioned in the wiki at "How to contribute"? We try to keep this page as concise as possible so I added a mention to it in https://wiki.apache.org/lucene-java/DeveloperTips .
      Hide
      Dawid Weiss added a comment - - edited

      I don't really like using LuceneTestCase here because the tests here pass when extending TestCase and they do not use anything that LuceneTestCase provides

      If you prefer not to have that discussion here I will shut up. I just wanted to clarify that LTC does checks on your test that you bypass when you extend from TestCase. Randomization of method order (catches order dependencies that shouldn't be there), ensuring your tests run with assertions enabled, checking if your tests don't spawn extra threads – all these (and more) may not matter to you because you're a seasoned engineer but they do matter in general when contributions come from various sources (or are refactored later by people other than the original author). Randomization isn't the only goal of LTC.

      In short: it's not a remark to you personally, it's a remark to everyone in general that extending LTC should be a priority because it will catch faulty tests faster.

      Show
      Dawid Weiss added a comment - - edited I don't really like using LuceneTestCase here because the tests here pass when extending TestCase and they do not use anything that LuceneTestCase provides If you prefer not to have that discussion here I will shut up. I just wanted to clarify that LTC does checks on your test that you bypass when you extend from TestCase. Randomization of method order (catches order dependencies that shouldn't be there), ensuring your tests run with assertions enabled, checking if your tests don't spawn extra threads – all these (and more) may not matter to you because you're a seasoned engineer but they do matter in general when contributions come from various sources (or are refactored later by people other than the original author). Randomization isn't the only goal of LTC. In short: it's not a remark to you personally, it's a remark to everyone in general that extending LTC should be a priority because it will catch faulty tests faster.
      Hide
      Dawid Weiss added a comment -

      I played a bit with it and rank9 was always between 15% and 20% slower than bitCount no matter what the input was (which is still impressing since bitCount is supposed to be an intrinsic)

      I also toyed with this a bit. Interesting because rank9 is essentially Hacker's delight implementation, but with a multiplication at the end rather than shifts/ additions (which I think originated from the fact that multiplication used to be much slower than additions/shifts back in the day).

      I ran a few caliper bechmarks on a million longs, different distributions (Intel I7, JDK17) just to see what the output will be.

         benchmark distribution     us linear runtime
          HdPopCnd        ZEROS   2333 =
          HdPopCnd         FULL   2334 =
          HdPopCnd       RANDOM   2329 =
          HdPopCnd       ONEBIT   2334 =
             Rank9        ZEROS   1651 =
             Rank9         FULL   1652 =
             Rank9       RANDOM   1651 =
             Rank9       ONEBIT   1651 =
      LongBitCount        ZEROS    411 =
      LongBitCount         FULL    394 =
      LongBitCount       RANDOM    391 =
      LongBitCount       ONEBIT    404 =
       NaivePopCnt        ZEROS    585 =
       NaivePopCnt         FULL  39019 ======
       NaivePopCnt       RANDOM 171365 ==============================
       NaivePopCnt       ONEBIT  28155 ====
      

      The naive loop was:

              int cnt = 0;
              while (x != 0) {
                  if (((x >>>= 1) & 1) != 0L) {
                      cnt++;
                  }
              }
              return cnt;
      

      and you can see that even for all zeros (when in fact there is no counting at all) it's still slower than the intrinsified popcnt. Note full-ones is not the worst case (I believe due to constant branch misprediction in a random input).

      A zoomed-in benchmark without the naive impl.:

         benchmark distribution   us linear runtime
          HdPopCnd        ZEROS 2331 =============================
          HdPopCnd         FULL 2329 =============================
          HdPopCnd       RANDOM 2333 =============================
          HdPopCnd       ONEBIT 2333 =============================
             Rank9        ZEROS 1650 =====================
             Rank9         FULL 1650 =====================
             Rank9       RANDOM 1652 =====================
             Rank9       ONEBIT 1650 =====================
      LongBitCount        ZEROS  400 =====
      LongBitCount         FULL  402 =====
      LongBitCount       RANDOM  401 =====
      LongBitCount       ONEBIT  391 =====
      

      You can see when popcnt isn't intrinsified by running with IBM's J9, for example:

         benchmark distribution   ms linear runtime
      LongBitCount        ZEROS 2.25 =============================
      LongBitCount         FULL 2.22 =============================
      LongBitCount       RANDOM 2.25 =============================
      LongBitCount       ONEBIT 2.25 =============================
          HdPopCnd        ZEROS 2.25 =============================
          HdPopCnd         FULL 2.25 =============================
          HdPopCnd       RANDOM 2.22 =============================
          HdPopCnd       ONEBIT 2.22 =============================
             Rank9        ZEROS 1.62 =====================
             Rank9         FULL 1.62 =====================
             Rank9       RANDOM 1.62 =====================
             Rank9       ONEBIT 1.62 =====================
      

      But I think they'll eventually catch up with modern cpus too so I'd stick with Long.bitCount.

      Show
      Dawid Weiss added a comment - I played a bit with it and rank9 was always between 15% and 20% slower than bitCount no matter what the input was (which is still impressing since bitCount is supposed to be an intrinsic) I also toyed with this a bit. Interesting because rank9 is essentially Hacker's delight implementation, but with a multiplication at the end rather than shifts/ additions (which I think originated from the fact that multiplication used to be much slower than additions/shifts back in the day). I ran a few caliper bechmarks on a million longs, different distributions (Intel I7, JDK17) just to see what the output will be. benchmark distribution us linear runtime HdPopCnd ZEROS 2333 = HdPopCnd FULL 2334 = HdPopCnd RANDOM 2329 = HdPopCnd ONEBIT 2334 = Rank9 ZEROS 1651 = Rank9 FULL 1652 = Rank9 RANDOM 1651 = Rank9 ONEBIT 1651 = LongBitCount ZEROS 411 = LongBitCount FULL 394 = LongBitCount RANDOM 391 = LongBitCount ONEBIT 404 = NaivePopCnt ZEROS 585 = NaivePopCnt FULL 39019 ====== NaivePopCnt RANDOM 171365 ============================== NaivePopCnt ONEBIT 28155 ==== The naive loop was: int cnt = 0; while (x != 0) { if (((x >>>= 1) & 1) != 0L) { cnt++; } } return cnt; and you can see that even for all zeros (when in fact there is no counting at all) it's still slower than the intrinsified popcnt. Note full-ones is not the worst case (I believe due to constant branch misprediction in a random input). A zoomed-in benchmark without the naive impl.: benchmark distribution us linear runtime HdPopCnd ZEROS 2331 ============================= HdPopCnd FULL 2329 ============================= HdPopCnd RANDOM 2333 ============================= HdPopCnd ONEBIT 2333 ============================= Rank9 ZEROS 1650 ===================== Rank9 FULL 1650 ===================== Rank9 RANDOM 1652 ===================== Rank9 ONEBIT 1650 ===================== LongBitCount ZEROS 400 ===== LongBitCount FULL 402 ===== LongBitCount RANDOM 401 ===== LongBitCount ONEBIT 391 ===== You can see when popcnt isn't intrinsified by running with IBM's J9, for example: benchmark distribution ms linear runtime LongBitCount ZEROS 2.25 ============================= LongBitCount FULL 2.22 ============================= LongBitCount RANDOM 2.25 ============================= LongBitCount ONEBIT 2.25 ============================= HdPopCnd ZEROS 2.25 ============================= HdPopCnd FULL 2.25 ============================= HdPopCnd RANDOM 2.22 ============================= HdPopCnd ONEBIT 2.22 ============================= Rank9 ZEROS 1.62 ===================== Rank9 FULL 1.62 ===================== Rank9 RANDOM 1.62 ===================== Rank9 ONEBIT 1.62 ===================== But I think they'll eventually catch up with modern cpus too so I'd stick with Long.bitCount.
      Hide
      Paul Elschot added a comment -

      In the naive implementation above the first shift is before the first masked test. Does this miss the lowest bit? Not that it matters much...

      I'll provide a new patch with:

      • rank9 package private, users of IBM's J9 will know what to do,
      • longHex in ToStringUtils, and
      • extends LuceneTestCase.
      Show
      Paul Elschot added a comment - In the naive implementation above the first shift is before the first masked test. Does this miss the lowest bit? Not that it matters much... I'll provide a new patch with: rank9 package private, users of IBM's J9 will know what to do, longHex in ToStringUtils, and extends LuceneTestCase.
      Hide
      Dawid Weiss added a comment -

      Indeed, it's a regression from an earlier implementation where I used Long.rotateRight. It won't play a difference but well spotted.

      Show
      Dawid Weiss added a comment - Indeed, it's a regression from an earlier implementation where I used Long.rotateRight. It won't play a difference but well spotted.
      Hide
      Dawid Weiss added a comment -

      Looks good to me (there's a likely unused import but it can be fixed later). Very cool, btw.

      Show
      Dawid Weiss added a comment - Looks good to me (there's a likely unused import but it can be fixed later). Very cool, btw.
      Hide
      ASF subversion and git services added a comment -

      Commit 1502690 from Adrien Grand
      [ https://svn.apache.org/r1502690 ]

      LUCENE-5098: Broadword utility methods.

      Show
      ASF subversion and git services added a comment - Commit 1502690 from Adrien Grand [ https://svn.apache.org/r1502690 ] LUCENE-5098 : Broadword utility methods.
      Hide
      ASF subversion and git services added a comment -

      Commit 1502691 from Adrien Grand
      [ https://svn.apache.org/r1502691 ]

      LUCENE-5098: Broadword utility methods (merged from r1502690).

      Show
      ASF subversion and git services added a comment - Commit 1502691 from Adrien Grand [ https://svn.apache.org/r1502691 ] LUCENE-5098 : Broadword utility methods (merged from r1502690).
      Hide
      Adrien Grand added a comment -

      Committed. Thanks Paul and Dawid!

      Show
      Adrien Grand added a comment - Committed. Thanks Paul and Dawid!
      Hide
      Adrien Grand added a comment -

      4.5 release -> bulk close

      Show
      Adrien Grand added a comment - 4.5 release -> bulk close
      Hide
      Paul Elschot added a comment -

      On reviewing the article I found that the method names I chose for algorithms 1 and 2 in the VIgna article were wrong. I used rank9 and select9 because of the titles of sections 3 and 5 in the article, but the method names do less than what is described in these sections.
      The method names should have been bitCount and select.
      I'll add a correction for this at LUCENE-5236, there is no need to reopen this issue.

      Show
      Paul Elschot added a comment - On reviewing the article I found that the method names I chose for algorithms 1 and 2 in the VIgna article were wrong. I used rank9 and select9 because of the titles of sections 3 and 5 in the article, but the method names do less than what is described in these sections. The method names should have been bitCount and select. I'll add a correction for this at LUCENE-5236 , there is no need to reopen this issue.

        People

        • Assignee:
          Adrien Grand
          Reporter:
          Paul Elschot
        • Votes:
          2 Vote for this issue
          Watchers:
          6 Start watching this issue

          Dates

          • Created:
            Updated:
            Resolved:

            Development