Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.5, Trunk
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      DocIdSet in Elias-Fano encoding

      1. LUCENE-5084.patch
        45 kB
        Paul Elschot
      2. LUCENE-5084.patch
        46 kB
        Paul Elschot

        Issue Links

          Activity

          Hide
          Adrien Grand added a comment -

          4.5 release -> bulk close

          Show
          Adrien Grand added a comment - 4.5 release -> bulk close
          Hide
          Adrien Grand added a comment -

          Committed, thanks Paul!

          Show
          Adrien Grand added a comment - Committed, thanks Paul!
          Hide
          ASF subversion and git services added a comment -

          Commit 1501578 from Adrien Grand
          [ https://svn.apache.org/r1501578 ]

          LUCENE-5084: EliasFanoDocIdSet (merged from r1501576 and r1501577).

          Show
          ASF subversion and git services added a comment - Commit 1501578 from Adrien Grand [ https://svn.apache.org/r1501578 ] LUCENE-5084 : EliasFanoDocIdSet (merged from r1501576 and r1501577).
          Hide
          ASF subversion and git services added a comment -

          Commit 1501577 from Adrien Grand
          [ https://svn.apache.org/r1501577 ]

          LUCENE-5084: Move Elias-Fano encoding from Lucene 4.4 to 4.5...

          Show
          ASF subversion and git services added a comment - Commit 1501577 from Adrien Grand [ https://svn.apache.org/r1501577 ] LUCENE-5084 : Move Elias-Fano encoding from Lucene 4.4 to 4.5...
          Hide
          ASF subversion and git services added a comment -

          Commit 1501576 from Adrien Grand
          [ https://svn.apache.org/r1501576 ]

          LUCENE-5084: EliasFanoDocIdSet.

          Show
          ASF subversion and git services added a comment - Commit 1501576 from Adrien Grand [ https://svn.apache.org/r1501576 ] LUCENE-5084 : EliasFanoDocIdSet.
          Hide
          Paul Elschot added a comment -

          Broadword bit selection is at LUCENE-5098 .

          Show
          Paul Elschot added a comment - Broadword bit selection is at LUCENE-5098 .
          Hide
          Paul Elschot added a comment -

          For now the indexed version is in subclasses of the encoder and decoder.
          In these I only expect some changes in method signatures from private to protected, so I don't mind either way.

          Show
          Paul Elschot added a comment - For now the indexed version is in subclasses of the encoder and decoder. In these I only expect some changes in method signatures from private to protected, so I don't mind either way.
          Hide
          David Smiley added a comment -

          Otherwise I would like to commit your patch, I think it is already a very good improvement and we can work on the index on another issue

          +1 definitely. One step at a time. Thanks Paul & Adrien.

          Show
          David Smiley added a comment - Otherwise I would like to commit your patch, I think it is already a very good improvement and we can work on the index on another issue +1 definitely. One step at a time. Thanks Paul & Adrien.
          Hide
          Adrien Grand added a comment -

          for now I'll concentrate on adding an index.

          Have you started working on it? Otherwise I would like to commit your patch, I think it is already a very good improvement and we can work on the index on another issue, what do you think?

          Show
          Adrien Grand added a comment - for now I'll concentrate on adding an index. Have you started working on it? Otherwise I would like to commit your patch, I think it is already a very good improvement and we can work on the index on another issue, what do you think?
          Hide
          Paul Elschot added a comment -

          At the moment I don't know how to do better random testing than the code that produced the results posted above, so for now I'll concentrate on adding an index.

          Show
          Paul Elschot added a comment - At the moment I don't know how to do better random testing than the code that produced the results posted above, so for now I'll concentrate on adding an index.
          Hide
          Dawid Weiss added a comment -

          Randomized test cases with uniform data distributions are not likely to test exceptional situations in the high bits such as long high bit words with all zeros or all ones.

          The exceptional situations you can test separately. I am constantly surprised at how many exceptional conditions in pretty much regular code one can overlook. Like Adrien said – it doesn't hurt to be in there and if it catches something, it's even better.

          Show
          Dawid Weiss added a comment - Randomized test cases with uniform data distributions are not likely to test exceptional situations in the high bits such as long high bit words with all zeros or all ones. The exceptional situations you can test separately. I am constantly surprised at how many exceptional conditions in pretty much regular code one can overlook. Like Adrien said – it doesn't hurt to be in there and if it catches something, it's even better.
          Hide
          Adrien Grand added a comment -

          I tend to like testing different scenarii every time tests are run (and tests, especially lucene-core tests, are run very very often by the CI servers), this helped find many unsuspected bugs in the past. For example, the random variable can be used to compute slight variations from the exceptional situations which are interesting to test.

          Show
          Adrien Grand added a comment - I tend to like testing different scenarii every time tests are run (and tests, especially lucene-core tests, are run very very often by the CI servers), this helped find many unsuspected bugs in the past. For example, the random variable can be used to compute slight variations from the exceptional situations which are interesting to test.
          Hide
          Paul Elschot added a comment -

          Would you have any specific purpose for randomized testing?

          Randomized test cases with uniform data distributions are not likely to test exceptional situations in the high bits such as long high bit words with all zeros or all ones.

          Show
          Paul Elschot added a comment - Would you have any specific purpose for randomized testing? Randomized test cases with uniform data distributions are not likely to test exceptional situations in the high bits such as long high bit words with all zeros or all ones.
          Hide
          Adrien Grand added a comment -

          The patch looks ready. I think we should just add a bit more randomization to the tests before committing.

          Show
          Adrien Grand added a comment - The patch looks ready. I think we should just add a bit more randomization to the tests before committing.
          Hide
          Paul Elschot added a comment -

          As announced on 4 July 2013.

          Show
          Paul Elschot added a comment - As announced on 4 July 2013.
          Hide
          Paul Elschot added a comment -

          Ok, I'll post a new patch that

          • is in package oal.util.packed,
          • uses numberOfLeadingZeros instead of a loop,
          • has a correct getCost, and
          • has a static method sufficientlySmallerThanBitSet, deliberately imprecise...
          Show
          Paul Elschot added a comment - Ok, I'll post a new patch that is in package oal.util.packed, uses numberOfLeadingZeros instead of a loop, has a correct getCost, and has a static method sufficientlySmallerThanBitSet, deliberately imprecise...
          Hide
          Adrien Grand added a comment -

          We could, but in which class? For example, in CachingWrapperFilter it might be good to save memory, so it could be there.

          This new doc id set might be used for other use-cases in the future, so maybe we should have this method on the EliasFanoDocIdSet class?

          Also, would the expected size be the only thing to check for? When decoding speed is also important, other DocIdSets might be preferable.

          Sure, this is something we need to give users control on. For filter caches, it is already possible to override CachingWrapperFilter.docIdSetToCache to decide whether speed or memory usage is more important. The decision can even depend on the cardinality of the set to cache or on its implementation. So we just need to provide users with good defaults I think?

          I haven't run performance benchmarks on this set implementation yet, but if it is faster than the DocIdSets iterators of our default postings format, then they are not going to be a bottleneck and I think it makes sense to use the implementation that saves the most memory. If they are slower or not faster enough, then maybe other implementations such as kamikaze's p-for-delta-based doc ID sets (LUCENE-2750) would make more sense as a default.

          Can PackedInts.getMutable also be used in a codec?

          The PackedInts API can return readers that can read directly from an IndexInput if this is the question but if we want to be able to store high and low bits contiguously then they are not going to be a good fit.

          I considered a decoder that returns ints but that would require a lot more casting in the decoder.

          OK. I just wanted to have your opinion on this, we can keep everything as a long.

          I'll open another issue for broadword bit selection later.

          Sounds good! I think backwards iteration and efficient skipping should be done in separate issues as well, even without them this new doc ID set would be a very nice addition.

          Show
          Adrien Grand added a comment - We could, but in which class? For example, in CachingWrapperFilter it might be good to save memory, so it could be there. This new doc id set might be used for other use-cases in the future, so maybe we should have this method on the EliasFanoDocIdSet class? Also, would the expected size be the only thing to check for? When decoding speed is also important, other DocIdSets might be preferable. Sure, this is something we need to give users control on. For filter caches, it is already possible to override CachingWrapperFilter.docIdSetToCache to decide whether speed or memory usage is more important. The decision can even depend on the cardinality of the set to cache or on its implementation. So we just need to provide users with good defaults I think? I haven't run performance benchmarks on this set implementation yet, but if it is faster than the DocIdSets iterators of our default postings format, then they are not going to be a bottleneck and I think it makes sense to use the implementation that saves the most memory. If they are slower or not faster enough, then maybe other implementations such as kamikaze's p-for-delta-based doc ID sets ( LUCENE-2750 ) would make more sense as a default. Can PackedInts.getMutable also be used in a codec? The PackedInts API can return readers that can read directly from an IndexInput if this is the question but if we want to be able to store high and low bits contiguously then they are not going to be a good fit. I considered a decoder that returns ints but that would require a lot more casting in the decoder. OK. I just wanted to have your opinion on this, we can keep everything as a long. I'll open another issue for broadword bit selection later. Sounds good! I think backwards iteration and efficient skipping should be done in separate issues as well, even without them this new doc ID set would be a very nice addition.
          Hide
          Paul Elschot added a comment - - edited

          maybe we should have a static utility method to check that so that consumers of this API can opt for a FixedBitSet if their doc set is going to be dense?

          We could, but in which class? For example, in CachingWrapperFilter it might be good to save memory, so it could be there.
          Also, would the expected size be the only thing to check for? When decoding speed is also important, other DocIdSets might be preferable.

          the ceil of the log in base 2 is computed through a loop

          numberOfLeadingZeros is indeed better than a loop. We need the Long variant here.

          use PackedInts.getMutable to store the low-order bits instead of a raw long[]

          Can PackedInts.getMutable also be used in a codec? Longs are needed for the high bits, see below, and the high and low bits can be conveniently stored next to each other in an index.

          shouldn't the iterator's getCost method return efDecoder.numValues instead of efEncoder.numValues?

          Yes.

          Maybe we could just support the encoding of monotonically increasing sequences of ints to make things simpler?

          I considered a decoder that returns ints but that would require a lot more casting in the decoder.
          Decoding the unary encoded high bits is best done on longs, so mixing longs and ints in the encoder is not really an option.
          We could pass the actual NO_MORE_VALUES to be used as an argument to the decoder, would that help?

          As to why decoding the unary encoded high bits is best done on longs, see Algorithm 2 in "Broadword Implementation of Rank/Select Queries", Sebastiano Vigna, January 30, 2012, http://vigna.di.unimi.it/ftp/papers/Broadword.pdf .
          I also have an initial java implementation of that, but it is not used here yet, there are only a few comments in the code here that it might be used. I'll open another issue for broadword bit selection later.

          Show
          Paul Elschot added a comment - - edited maybe we should have a static utility method to check that so that consumers of this API can opt for a FixedBitSet if their doc set is going to be dense? We could, but in which class? For example, in CachingWrapperFilter it might be good to save memory, so it could be there. Also, would the expected size be the only thing to check for? When decoding speed is also important, other DocIdSets might be preferable. the ceil of the log in base 2 is computed through a loop numberOfLeadingZeros is indeed better than a loop. We need the Long variant here. use PackedInts.getMutable to store the low-order bits instead of a raw long[] Can PackedInts.getMutable also be used in a codec? Longs are needed for the high bits, see below, and the high and low bits can be conveniently stored next to each other in an index. shouldn't the iterator's getCost method return efDecoder.numValues instead of efEncoder.numValues? Yes. Maybe we could just support the encoding of monotonically increasing sequences of ints to make things simpler? I considered a decoder that returns ints but that would require a lot more casting in the decoder. Decoding the unary encoded high bits is best done on longs, so mixing longs and ints in the encoder is not really an option. We could pass the actual NO_MORE_VALUES to be used as an argument to the decoder, would that help? As to why decoding the unary encoded high bits is best done on longs, see Algorithm 2 in "Broadword Implementation of Rank/Select Queries", Sebastiano Vigna, January 30, 2012, http://vigna.di.unimi.it/ftp/papers/Broadword.pdf . I also have an initial java implementation of that, but it is not used here yet, there are only a few comments in the code here that it might be used. I'll open another issue for broadword bit selection later.
          Hide
          Adrien Grand added a comment - - edited

          I had a deeper look at the patch, here are some thoughts about it:

          • the patch looks great, really, I especially like all the integrity checks!
          • after having read more of it, I think moving it to oal.util.packed makes perfect sense
          • the documentation mentions the fact that a FixedBitSet will be more storage efficient if numValues >= (upperBound/3), maybe we should have a static utility method to check that so that consumers of this API can opt for a FixedBitSet if their doc set is going to be dense?
          • in EliasFanoEncoder constructor, the ceil of the log in base 2 is computed through a loop, can we use Integer.numberOfLeadingZeros instead (as explained in http://docs.oracle.com/javase/7/docs/api/java/lang/Integer.html#numberOfLeadingZeros(int) )
          • I think it would make sense to use PackedInts.getMutable to store the low-order bits instead of a raw long[] (this abstraction will hide the complexity of the bit packing or even use a more appropriate structure, eg. if the number of bits required is 8, it will use a byte[])
          • I'm not sure but shouldn't the iterator's getCost method return efDecoder.numValues instead of efEncoder.numValues?

          Even though storing longs is more general, it forces us to use a different value to encode NO_MORE_DOCS which adds a little complexity to the DocIdSet implementation. Maybe we could just support the encoding of monotonically increasing sequences of ints to make things simpler? (just thinking out loud).

          Show
          Adrien Grand added a comment - - edited I had a deeper look at the patch, here are some thoughts about it: the patch looks great, really, I especially like all the integrity checks! after having read more of it, I think moving it to oal.util.packed makes perfect sense the documentation mentions the fact that a FixedBitSet will be more storage efficient if numValues >= (upperBound/3) , maybe we should have a static utility method to check that so that consumers of this API can opt for a FixedBitSet if their doc set is going to be dense? in EliasFanoEncoder constructor, the ceil of the log in base 2 is computed through a loop, can we use Integer.numberOfLeadingZeros instead (as explained in http://docs.oracle.com/javase/7/docs/api/java/lang/Integer.html#numberOfLeadingZeros(int ) ) I think it would make sense to use PackedInts.getMutable to store the low-order bits instead of a raw long[] (this abstraction will hide the complexity of the bit packing or even use a more appropriate structure, eg. if the number of bits required is 8, it will use a byte[]) I'm not sure but shouldn't the iterator's getCost method return efDecoder.numValues instead of efEncoder.numValues? Even though storing longs is more general, it forces us to use a different value to encode NO_MORE_DOCS which adds a little complexity to the DocIdSet implementation. Maybe we could just support the encoding of monotonically increasing sequences of ints to make things simpler? (just thinking out loud).
          Hide
          Paul Elschot added a comment - - edited

          ... what is the CPU tradeoff for all these compressions and how does query speed compare across all of them?

          We don't really know.
          One conclusion from the Vigna paper is that a tuned implementation of an Elias-Fano decoder is faster than a tuned PForDelta implementation for highly selective phrase queries.
          I would guess that that is because Elias-Fano uses random access to the low bits where PForDelta only uses bulk decompression of the low bits, which compensates for Elias-Fano decoding its high bits somewhat slower than PForDelta can decode its exceptions.

          One reason to use Elias-Fano for a DocIdSet here is that its high bits are encoded in unary coding which can easily be decoded in two directions, and that makes it useful for block joins. The other reason is that its compression is quite good, which makes it a nice candidate for in memory filters.

          Show
          Paul Elschot added a comment - - edited ... what is the CPU tradeoff for all these compressions and how does query speed compare across all of them? We don't really know. One conclusion from the Vigna paper is that a tuned implementation of an Elias-Fano decoder is faster than a tuned PForDelta implementation for highly selective phrase queries. I would guess that that is because Elias-Fano uses random access to the low bits where PForDelta only uses bulk decompression of the low bits, which compensates for Elias-Fano decoding its high bits somewhat slower than PForDelta can decode its exceptions. One reason to use Elias-Fano for a DocIdSet here is that its high bits are encoded in unary coding which can easily be decoded in two directions, and that makes it useful for block joins. The other reason is that its compression is quite good, which makes it a nice candidate for in memory filters.
          Hide
          Paul Elschot added a comment - - edited

          Have you considered creating a PostingFormat with this? I was thinking in something like DirectPostingsFormat but instead of using an array of ints for storing the docIds using an Elias-Fano compressed bit stream.

          The Vigna paper is all about posting formats.
          Because of this I first implemented an encoder and a decoder in a long format, and then used these here for a DocIdSet that works on int.

          For a postings format, the encoder would need an additional constructor from index data. That might involve merging the currently separate long arrays for high bits and low bits into a single array.

          Show
          Paul Elschot added a comment - - edited Have you considered creating a PostingFormat with this? I was thinking in something like DirectPostingsFormat but instead of using an array of ints for storing the docIds using an Elias-Fano compressed bit stream. The Vigna paper is all about posting formats. Because of this I first implemented an encoder and a decoder in a long format, and then used these here for a DocIdSet that works on int. For a postings format, the encoder would need an additional constructor from index data. That might involve merging the currently separate long arrays for high bits and low bits into a single array.
          Hide
          Michael McCandless added a comment -

          I think LUCENE-5052 will explore that (using a bitset to represent DOCS_ONLY postings)...

          Show
          Michael McCandless added a comment - I think LUCENE-5052 will explore that (using a bitset to represent DOCS_ONLY postings)...
          Hide
          Emmanuel Espina added a comment -

          Have you considered creating a PostingFormat with this? I was thinking in something like DirectPostingsFormat but instead of using an array of ints for storing the docIds using an Elias-Fano compressed bit stream.
          Facebook is using an in-memory index for its graph search and they are compressing the posting lists with Elias Fano (https://www.facebook.com/download/138915572976390/UnicornVLDB-final.pdf)

          Show
          Emmanuel Espina added a comment - Have you considered creating a PostingFormat with this? I was thinking in something like DirectPostingsFormat but instead of using an array of ints for storing the docIds using an Elias-Fano compressed bit stream. Facebook is using an in-memory index for its graph search and they are compressing the posting lists with Elias Fano ( https://www.facebook.com/download/138915572976390/UnicornVLDB-final.pdf )
          Hide
          Otis Gospodnetic added a comment - - edited

          Sorry if this sounds uninformed, but what is the CPU tradeoff for all these compressions and how does query speed compare across all of them? What is the logic behind never looking at that when making memory comparisons? I skimmed the paper Paul references and saw the main comparisons were really more about query speed and less about memory savings. Thanks.

          Show
          Otis Gospodnetic added a comment - - edited Sorry if this sounds uninformed, but what is the CPU tradeoff for all these compressions and how does query speed compare across all of them? What is the logic behind never looking at that when making memory comparisons? I skimmed the paper Paul references and saw the main comparisons were really more about query speed and less about memory savings. Thanks.
          Hide
          Paul Elschot added a comment -

          The number of index entries may actually be 2N/256 instead of N/256.
          That would be about 4% size in the case above.

          I am about to implement such an index, but my tempo is going to be slow. So in case anyone can do that faster...

          Show
          Paul Elschot added a comment - The number of index entries may actually be 2N/256 instead of N/256. That would be about 4% size in the case above. I am about to implement such an index, but my tempo is going to be slow. So in case anyone can do that faster...
          Hide
          Adrien Grand added a comment -

          What's up with the kamikaze bitset, also an option?

          I'll take some time to evaluate all the options we have for compressed doc ID sets, and the kamikaze implementation is one of them.

          Show
          Adrien Grand added a comment - What's up with the kamikaze bitset, also an option? I'll take some time to evaluate all the options we have for compressed doc ID sets, and the kamikaze implementation is one of them.
          Hide
          Adrien Grand added a comment -

          Please note that there is no index yet. The index will be relatively small, for example for the 10% case above with 641 kB I expect an index size of about 12 kB, adding about 2% to the size. This index will consist of N/256 entries of a single number with max value 3N, i.e. ceil(2log(3N)) bits per index entry.

          This sounds good!

          The code posted here is still young. So even though it has some test cases, I'd like be reassured that in the code that produced the posted results, there is at least a basic test that verifies that all input docs are available after compression. Is that the case?

          Yes, I checked that all doc ID sets have the same content.

          Show
          Adrien Grand added a comment - Please note that there is no index yet. The index will be relatively small, for example for the 10% case above with 641 kB I expect an index size of about 12 kB, adding about 2% to the size. This index will consist of N/256 entries of a single number with max value 3N, i.e. ceil(2log(3N)) bits per index entry. This sounds good! The code posted here is still young. So even though it has some test cases, I'd like be reassured that in the code that produced the posted results, there is at least a basic test that verifies that all input docs are available after compression. Is that the case? Yes, I checked that all doc ID sets have the same content.
          Hide
          Paul Elschot added a comment -

          Uwe,

          The EliasFanoDocIdSet is not a bitset, but it is definitely cool, please have a look at the Vigna article I mentioned at LUCENE-5081.

          The upper bound above on the bits per encoded number is valid for every sorted permutation of non negative whole numbers bounded by upperBound, and this size bound is no more than half a bit larger than the smallest possible representation.

          Show
          Paul Elschot added a comment - Uwe, The EliasFanoDocIdSet is not a bitset, but it is definitely cool, please have a look at the Vigna article I mentioned at LUCENE-5081 . The upper bound above on the bits per encoded number is valid for every sorted permutation of non negative whole numbers bounded by upperBound, and this size bound is no more than half a bit larger than the smallest possible representation.
          Hide
          Paul Elschot added a comment -

          That was fast
          The results posted above seem to be in line with the formula below, and that is quite nice to see.

          The upper bound for the size in bits per encoded number in an EliasFanoDocIdSet is:

          2 + ceil(2log(upperBound/numValues))

          and a few constant size objects also have to be added in there.

          Please note that there is no index yet. The index will be relatively small, for example for the 10% case above with 641 kB I expect an index size of about 12 kB, adding about 2% to the size.
          This index will consist of N/256 entries of a single number with max value 3N, i.e. ceil(2log(3N)) bits per index entry.

          The code posted here is still young. So even though it has some test cases, I'd like be reassured that in the code that produced the posted results, there is at least a basic test that verifies that all input docs are available after compression. Is that the case?

          Show
          Paul Elschot added a comment - That was fast The results posted above seem to be in line with the formula below, and that is quite nice to see. The upper bound for the size in bits per encoded number in an EliasFanoDocIdSet is: 2 + ceil(2log(upperBound/numValues)) and a few constant size objects also have to be added in there. Please note that there is no index yet. The index will be relatively small, for example for the 10% case above with 641 kB I expect an index size of about 12 kB, adding about 2% to the size. This index will consist of N/256 entries of a single number with max value 3N, i.e. ceil(2log(3N)) bits per index entry. The code posted here is still young. So even though it has some test cases, I'd like be reassured that in the code that produced the posted results, there is at least a basic test that verifies that all input docs are available after compression. Is that the case?
          Hide
          Uwe Schindler added a comment -

          Coo bitset

          What's up with the kamikaze bitset, also an option?

          Show
          Uwe Schindler added a comment - Coo bitset What's up with the kamikaze bitset, also an option?
          Hide
          Adrien Grand added a comment -

          I have not dug much through the code but I tested it against various randomly-generated sets with numDocs=10M, and the compression looks great:

          Load FixedBitSet WAH8DocIdSet(LUCENE-5081) EliasFanoDocIdSet(this issue) PForDeltaDocIdSet(from kamikaze, LUCENE-2750)
          0.001% 1.2 MB 424 bytes 344 bytes 9 KB
          0.01% 1.2 MB 3.4 KB 2 KB 10.6 KB
          0.1% 1.2 MB 28.4 KB 14.7 KB 25.1 KB
          1% 1.2 MB 223.2 KB 104.6 KB 132.3 KB
          10% 1.2 MB 1 MB 641 KB 860.5 KB
          30% 1.2 MB 1.2 MB 1.3 MB 1.9 MB
          50% 1.2 MB 1.2 MB 1.8 MB 2.7 MB
          70% 1.2 MB 1.2 MB 2 MB 3 MB
          90% 1.2 MB 1.2 MB 2.3 MB 3.1 MB

          I especially like the fact that it saves almost half the memory even for pretty large sets that contain 1/10th of all doc IDs.

          I have used package o.a.l.util.eliasfano, this could be changed to o.a.l.util.packed for example.

          Indeed maybe we don't need a dedicated package for this DocIdSet. oal.util.packed would be fine I think.

          There is a NOCOMMIT for a static longHex method that dumps a long in fixed width hex format, is there a better place for this method?

          I think it is OK to leave it here.

          I'll try to dig more thoroughly into the patch in the next few days...

          Show
          Adrien Grand added a comment - I have not dug much through the code but I tested it against various randomly-generated sets with numDocs=10M, and the compression looks great: Load FixedBitSet WAH8DocIdSet( LUCENE-5081 ) EliasFanoDocIdSet(this issue) PForDeltaDocIdSet(from kamikaze, LUCENE-2750 ) 0.001% 1.2 MB 424 bytes 344 bytes 9 KB 0.01% 1.2 MB 3.4 KB 2 KB 10.6 KB 0.1% 1.2 MB 28.4 KB 14.7 KB 25.1 KB 1% 1.2 MB 223.2 KB 104.6 KB 132.3 KB 10% 1.2 MB 1 MB 641 KB 860.5 KB 30% 1.2 MB 1.2 MB 1.3 MB 1.9 MB 50% 1.2 MB 1.2 MB 1.8 MB 2.7 MB 70% 1.2 MB 1.2 MB 2 MB 3 MB 90% 1.2 MB 1.2 MB 2.3 MB 3.1 MB I especially like the fact that it saves almost half the memory even for pretty large sets that contain 1/10th of all doc IDs. I have used package o.a.l.util.eliasfano, this could be changed to o.a.l.util.packed for example. Indeed maybe we don't need a dedicated package for this DocIdSet. oal.util.packed would be fine I think. There is a NOCOMMIT for a static longHex method that dumps a long in fixed width hex format, is there a better place for this method? I think it is OK to leave it here. I'll try to dig more thoroughly into the patch in the next few days...
          Hide
          Paul Elschot added a comment -

          For the patch of 30 June 2013:

          This consists of the class EliasFanoDocIdSet and the two classes that it uses, EliasFanoEncoder and EliasFanoDecoder.
          The last two are implemented on long, EliasFanoDocIdSet does the casting to and from int for DocIdSet.

          There are various ways in which the decoding speed could still be improved:
          Addition of an index on the high bits, see the Vigna paper.
          Use of broadword bit searching, actually better done at another issue, this uses Long method bitCount and numberOfTrailingZeros.
          I have not yet profiled for performance bottlenecks.

          The decoder is not really complete, it has an advanceToIndex method but no backToIndex method yet.

          Nevertheless this is usable now because the compression works and the linear searches that are done (because of the lack on indexing) will access no more than roughly 3N bits, where N is the number of doc ids in the set, and FixedBitSet.nextDoc() can (theoretically) access a number of bits equal to the number of docs in a segment.

          TestEliasFanoSequence tests EliasFanoEncoder and EliasFanoDecoder.

          TestEliasFanoDocIdSet tests EliasFanoDocIdSet.

          I have used package o.a.l.util.eliasfano, this could be changed to o.a.l.util.packed for example.

          There is a NOCOMMIT for a static longHex method that dumps a long in fixed width hex format, is there a better place for this method?

          Show
          Paul Elschot added a comment - For the patch of 30 June 2013: This consists of the class EliasFanoDocIdSet and the two classes that it uses, EliasFanoEncoder and EliasFanoDecoder. The last two are implemented on long, EliasFanoDocIdSet does the casting to and from int for DocIdSet. There are various ways in which the decoding speed could still be improved: Addition of an index on the high bits, see the Vigna paper. Use of broadword bit searching, actually better done at another issue, this uses Long method bitCount and numberOfTrailingZeros. I have not yet profiled for performance bottlenecks. The decoder is not really complete, it has an advanceToIndex method but no backToIndex method yet. Nevertheless this is usable now because the compression works and the linear searches that are done (because of the lack on indexing) will access no more than roughly 3N bits, where N is the number of doc ids in the set, and FixedBitSet.nextDoc() can (theoretically) access a number of bits equal to the number of docs in a segment. TestEliasFanoSequence tests EliasFanoEncoder and EliasFanoDecoder. TestEliasFanoDocIdSet tests EliasFanoDocIdSet. I have used package o.a.l.util.eliasfano, this could be changed to o.a.l.util.packed for example. There is a NOCOMMIT for a static longHex method that dumps a long in fixed width hex format, is there a better place for this method?

            People

            • Assignee:
              Adrien Grand
              Reporter:
              Paul Elschot
            • Votes:
              2 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development