Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.5, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Our filters use bit sets a lot to store document IDs. However, it is likely that most of them are sparse hence easily compressible. Having efficient compressed sets would allow for caching more data.

      1. LUCENE-5081.patch
        47 kB
        Adrien Grand
      2. LUCENE-5081.patch
        32 kB
        Adrien Grand

        Activity

        Hide
        Adrien Grand added a comment -

        Here is an implementation of a compressed doc ID set. Although it is immutable, only supports sequential access and requires doc IDs to be provided in order at building time, it supports fast iteration (nextDoc), skipping (advance), union and intersection. The detailed format is a bit complex (see javadocs), but the rough idea is to store large sequences of null bytes as a VInt and sequences of non-null bytes as-is. This implementation can compress data as soon as it can find more than 2 consecutive null bytes. Moreover, even incompressible sets only require a few bytes more than FixedBitSet.

        I ran a quick benchmark to measure the size (as reported by RamUsageEstimator) depending on the percentage of bits set on a bit set containing 2<sup>23</sup> elements (FixedBitSet requires 1MB) as well as the time required to iterate over all document IDs compared to FixedBitSet:

        Percentage of bits set Size Iteration time/FixedBitSet iteration time
        0.001% 360 bytes 0.01
        0.01% 2.8KB 0.10
        0.1% 23.8 KB 0.38
        1% 187.7 KB 0.80
        10% 864 KB 1.3
        50% 1 MB 2.5
        90% 1 MB 2.3
        100% 1 MB 1.7

        Even in the worst case, memory usage exceeds the memory usage of FixedBitSet by a few bytes, and iteration is 2.5 times slower.

        The patch includes the set implementation but it is not used anywhere yet. I was thinking about using it automatically instead of FixedBitSet in CachingWrapperFilter but it looks like ToParentBlockJoinQuery expects to get a FixedBitSet from the cache.

        Show
        Adrien Grand added a comment - Here is an implementation of a compressed doc ID set. Although it is immutable, only supports sequential access and requires doc IDs to be provided in order at building time, it supports fast iteration (nextDoc), skipping (advance), union and intersection. The detailed format is a bit complex (see javadocs), but the rough idea is to store large sequences of null bytes as a VInt and sequences of non-null bytes as-is. This implementation can compress data as soon as it can find more than 2 consecutive null bytes. Moreover, even incompressible sets only require a few bytes more than FixedBitSet. I ran a quick benchmark to measure the size (as reported by RamUsageEstimator) depending on the percentage of bits set on a bit set containing 2<sup>23</sup> elements (FixedBitSet requires 1MB) as well as the time required to iterate over all document IDs compared to FixedBitSet: Percentage of bits set Size Iteration time/FixedBitSet iteration time 0.001% 360 bytes 0.01 0.01% 2.8KB 0.10 0.1% 23.8 KB 0.38 1% 187.7 KB 0.80 10% 864 KB 1.3 50% 1 MB 2.5 90% 1 MB 2.3 100% 1 MB 1.7 Even in the worst case, memory usage exceeds the memory usage of FixedBitSet by a few bytes, and iteration is 2.5 times slower. The patch includes the set implementation but it is not used anywhere yet. I was thinking about using it automatically instead of FixedBitSet in CachingWrapperFilter but it looks like ToParentBlockJoinQuery expects to get a FixedBitSet from the cache.
        Hide
        Robert Muir added a comment -

        Thanks for opening this Adrien!

        The patch includes the set implementation but it is not used anywhere yet. I was thinking about using it automatically instead of FixedBitSet in CachingWrapperFilter but it looks like ToParentBlockJoinQuery expects to get a FixedBitSet from the cache.

        Maybe we should anticipate more implementations in the future? I know there are a lot of algorithms and "Compressed" is pretty general.

        As far as ToParentBlockJoinQuery, i dont think it should make these assumptions... that seems wrong for it to do that.

        Show
        Robert Muir added a comment - Thanks for opening this Adrien! The patch includes the set implementation but it is not used anywhere yet. I was thinking about using it automatically instead of FixedBitSet in CachingWrapperFilter but it looks like ToParentBlockJoinQuery expects to get a FixedBitSet from the cache. Maybe we should anticipate more implementations in the future? I know there are a lot of algorithms and "Compressed" is pretty general. As far as ToParentBlockJoinQuery, i dont think it should make these assumptions... that seems wrong for it to do that.
        Hide
        Uwe Schindler added a comment -

        I dont think it should use CachingWrapperFilter as a "workaround" and assuming its a FixedBitSet.

        If it needs a FixedBitSet, it can use the filter, get its DocIdSetIterator and pass it to FixedBitSet to create a copy. The use case in BlockJoinQuery has nothing to do with filters, it needs a special implementation to seek backwards through the result set. So it has to clone the bitset by itsself by explicitely creating a FixedBitSet internally (if it detects that the DocIdSet returned by the filter is not a FixedBitSet).

        Show
        Uwe Schindler added a comment - I dont think it should use CachingWrapperFilter as a "workaround" and assuming its a FixedBitSet. If it needs a FixedBitSet, it can use the filter, get its DocIdSetIterator and pass it to FixedBitSet to create a copy. The use case in BlockJoinQuery has nothing to do with filters, it needs a special implementation to seek backwards through the result set. So it has to clone the bitset by itsself by explicitely creating a FixedBitSet internally (if it detects that the DocIdSet returned by the filter is not a FixedBitSet).
        Hide
        Michael McCandless added a comment -

        Unfortunately ToParentBlockJoinQuery uses .nextSetBit and .prevSetBit from FixedBitSet, which is why it currently requires a FixedBitSet.

        I guess we could relax it and just have it do the obvious dumb and slow approach when the incoming bitset isn't a FixedBitSet?

        LUCENE-1969 is another effort to add compressed bit sets to Lucene ...

        Show
        Michael McCandless added a comment - Unfortunately ToParentBlockJoinQuery uses .nextSetBit and .prevSetBit from FixedBitSet, which is why it currently requires a FixedBitSet. I guess we could relax it and just have it do the obvious dumb and slow approach when the incoming bitset isn't a FixedBitSet? LUCENE-1969 is another effort to add compressed bit sets to Lucene ...
        Hide
        Robert Muir added a comment -

        yeah couldnt it use instanceof+fall back? how much slower is the obvious dumb and slow approach?

        Show
        Robert Muir added a comment - yeah couldnt it use instanceof+fall back? how much slower is the obvious dumb and slow approach?
        Hide
        Michael McCandless added a comment -

        Let's open a separate issue for how to fix BJQ; it shouldn't hold up this issue. EG for BJQ's tests, can't they override CWF.docIdSetToCache to return a FixedBitSet? Then we can decouple the two issues?

        Show
        Michael McCandless added a comment - Let's open a separate issue for how to fix BJQ; it shouldn't hold up this issue. EG for BJQ's tests, can't they override CWF.docIdSetToCache to return a FixedBitSet? Then we can decouple the two issues?
        Hide
        Uwe Schindler added a comment -

        EG for BJQ's tests, can't they override CWF.docIdSetToCache to return a FixedBitSet? Then we can decouple the two issues?

        Good idea, this patch should just fix the test for now.

        One thing: The Bitset does not implement public Bits bits() so it's not random access. This means that a cached CompressedDocIdSet cannot be applied down-low. Is it possible to make it random access at all? The default method returned by the superclass DocIdSet returns null, so FilteredQuery would fall back to the iterator.

        Show
        Uwe Schindler added a comment - EG for BJQ's tests, can't they override CWF.docIdSetToCache to return a FixedBitSet? Then we can decouple the two issues? Good idea, this patch should just fix the test for now. One thing: The Bitset does not implement public Bits bits() so it's not random access. This means that a cached CompressedDocIdSet cannot be applied down-low. Is it possible to make it random access at all? The default method returned by the superclass DocIdSet returns null, so FilteredQuery would fall back to the iterator.
        Hide
        Paul Elschot added a comment - - edited

        I'm (slowly) working on an implemention of Elias Fano compression, basically as described in in sections 3 and 4 of this article:

        Sebastiano Vigna, "Quasi Succinct Indices", June 19, 2012, http://arxiv.org/pdf/1206.4300
        The article is quite interesting, not in the least because it compares MG4J directly to Lucene 3.6.

        The implementation I am working on also does backward iteration, but it has no index yet, so it is still somewhat slow for advance()'ing to distant targets. Since there is interest in compression, I'll open an issue for this soon.

        For backward iteration I think it would be good to extend DocIdSetIterator with backwards iterating method(s), possibly in a subclass, and use an implementation of that in the block joins.

        Show
        Paul Elschot added a comment - - edited I'm (slowly) working on an implemention of Elias Fano compression, basically as described in in sections 3 and 4 of this article: Sebastiano Vigna, "Quasi Succinct Indices", June 19, 2012, http://arxiv.org/pdf/1206.4300 The article is quite interesting, not in the least because it compares MG4J directly to Lucene 3.6. The implementation I am working on also does backward iteration, but it has no index yet, so it is still somewhat slow for advance()'ing to distant targets. Since there is interest in compression, I'll open an issue for this soon. For backward iteration I think it would be good to extend DocIdSetIterator with backwards iterating method(s), possibly in a subclass, and use an implementation of that in the block joins.
        Hide
        Adrien Grand added a comment -

        Maybe we should anticipate more implementations in the future?

        Good point. I'll rename to WAH8DocIdSet (word-aligned hybrid compression on words of 8 bits).

        LUCENE-1969 is another effort to add compressed bit sets to Lucene ...

        Thanks for the pointer, I'll look into it.

        Is it possible to make it random access at all?

        Unfortunately it is not possible. Although it is easy to save space when having to either support random access or iterate in order, requiring both makes compression much harder especially if the bit set is not very sparse.

        I'm (slowly) working on an implemention of Elias Fano compression, basically as described in in sections 3 and 4 of this article. [...] I'll open an issue for this soon.

        This looks very interesting! I'm looking forward to seeing how this DocIdSet would behave!

        Show
        Adrien Grand added a comment - Maybe we should anticipate more implementations in the future? Good point. I'll rename to WAH8DocIdSet (word-aligned hybrid compression on words of 8 bits). LUCENE-1969 is another effort to add compressed bit sets to Lucene ... Thanks for the pointer, I'll look into it. Is it possible to make it random access at all? Unfortunately it is not possible. Although it is easy to save space when having to either support random access or iterate in order, requiring both makes compression much harder especially if the bit set is not very sparse. I'm (slowly) working on an implemention of Elias Fano compression, basically as described in in sections 3 and 4 of this article. [...] I'll open an issue for this soon. This looks very interesting! I'm looking forward to seeing how this DocIdSet would behave!
        Hide
        Paul Elschot added a comment -

        The issue for EliasFanoDocIdSet is LUCENE-5084

        Show
        Paul Elschot added a comment - The issue for EliasFanoDocIdSet is LUCENE-5084
        Hide
        Adrien Grand added a comment -

        The use case in BlockJoinQuery has nothing to do with filters, it needs a special implementation to seek backwards

        For backward iteration I think it would be good to extend DocIdSetIterator with backwards iterating method(s), possibly in a subclass, and use an implementation of that in the block joins.

        I opened LUCENE-5092.

        Show
        Adrien Grand added a comment - The use case in BlockJoinQuery has nothing to do with filters, it needs a special implementation to seek backwards For backward iteration I think it would be good to extend DocIdSetIterator with backwards iterating method(s), possibly in a subclass, and use an implementation of that in the block joins. I opened LUCENE-5092 .
        Hide
        Adrien Grand added a comment -

        New patch:

        • renamed implementation to WAH8DocIdSet
        • added an index in order to be able to advance() in logarithmic time, this works pretty much like the old terms index impl by storing the position and doc ID encoded at every n-th sequence and then using binary search to find somewhere before the target and close to it,
        • even with the index, WAH8DocIdSet is never larger than FixedBitSet by more than 2% (even when the index interval is 8, which is the lowest accepted value in the current impl),
        • factored some code out of BitVector and OpenBitSetIterator into BitUtil.

        I haven't wired this set implementation anywhere yet but I think always being less than 2% smaller than FixedBitSet and being able to advance in logarithmic time are nice properties so I'm pretty sure some people will be interested in using it for their caches. I'm waiting for the other implementations to get in/improve (eg. when EliasFanoDocIdSet will have an index) to write more detailed benchmarks to compare speed and memory efficiency of the impls we have for our caches (Elias-Fano, WAH8, FixedBitSet so far, maybe something based on PFOR-delta soon too).

        Please let me know if you would like to review this patch. Otherwise I will commit it soon.

        Show
        Adrien Grand added a comment - New patch: renamed implementation to WAH8DocIdSet added an index in order to be able to advance() in logarithmic time, this works pretty much like the old terms index impl by storing the position and doc ID encoded at every n-th sequence and then using binary search to find somewhere before the target and close to it, even with the index, WAH8DocIdSet is never larger than FixedBitSet by more than 2% (even when the index interval is 8, which is the lowest accepted value in the current impl), factored some code out of BitVector and OpenBitSetIterator into BitUtil. I haven't wired this set implementation anywhere yet but I think always being less than 2% smaller than FixedBitSet and being able to advance in logarithmic time are nice properties so I'm pretty sure some people will be interested in using it for their caches. I'm waiting for the other implementations to get in/improve (eg. when EliasFanoDocIdSet will have an index) to write more detailed benchmarks to compare speed and memory efficiency of the impls we have for our caches (Elias-Fano, WAH8, FixedBitSet so far, maybe something based on PFOR-delta soon too). Please let me know if you would like to review this patch. Otherwise I will commit it soon.
        Hide
        Robert Muir added a comment -

        I looked thru it: thanks Adrien! +1

        Perhaps it would be good to add a few extreme tests (like all bits empty/all set).
        Should we implement hashcode/equals? The other impls (fixed/open/etc) have this.
        In general we should maybe open a followup issue to give these docidset classes a base test superclass test.
        TestOpenBitset and TestFixedBitset are almost complete duplicates of each other for example. Any stuff that
        can test via docidset/iterator apis should probably do so, and other impl stuff like intersect() can stay
        in each test (thats fine, at least we improve things).

        Show
        Robert Muir added a comment - I looked thru it: thanks Adrien! +1 Perhaps it would be good to add a few extreme tests (like all bits empty/all set). Should we implement hashcode/equals? The other impls (fixed/open/etc) have this. In general we should maybe open a followup issue to give these docidset classes a base test superclass test. TestOpenBitset and TestFixedBitset are almost complete duplicates of each other for example. Any stuff that can test via docidset/iterator apis should probably do so, and other impl stuff like intersect() can stay in each test (thats fine, at least we improve things).
        Hide
        Adrien Grand added a comment -

        Should we implement hashcode/equals? The other impls (fixed/open/etc) have this.

        If we do this, I think the contract for equals and hashCode should be the same as java.util.List, ie. two DocIdSet are equal if their content is the same (no matter what the implementation is).

        In general we should maybe open a followup issue to give these docidset classes a base test superclass test.
        TestOpenBitset and TestFixedBitset are almost complete duplicates of each other for example. Any stuff that
        can test via docidset/iterator apis should probably do so, and other impl stuff like intersect() can stay
        in each test (thats fine, at least we improve things).

        Good point, I will do that.

        Show
        Adrien Grand added a comment - Should we implement hashcode/equals? The other impls (fixed/open/etc) have this. If we do this, I think the contract for equals and hashCode should be the same as java.util.List, ie. two DocIdSet are equal if their content is the same (no matter what the implementation is). In general we should maybe open a followup issue to give these docidset classes a base test superclass test. TestOpenBitset and TestFixedBitset are almost complete duplicates of each other for example. Any stuff that can test via docidset/iterator apis should probably do so, and other impl stuff like intersect() can stay in each test (thats fine, at least we improve things). Good point, I will do that.
        Hide
        Robert Muir added a comment -

        If we do this, I think the contract for equals and hashCode should be the same as java.util.List, ie. two DocIdSet are equal if their content is the same (no matter what the implementation is).

        I am not sure how these are used today. I was just asking in case there is something relying upon this. If they are just added as a "bonus" and not really needed by our codebase, then I am not sure we should worry... but its up to you.

        Show
        Robert Muir added a comment - If we do this, I think the contract for equals and hashCode should be the same as java.util.List, ie. two DocIdSet are equal if their content is the same (no matter what the implementation is). I am not sure how these are used today. I was just asking in case there is something relying upon this. If they are just added as a "bonus" and not really needed by our codebase, then I am not sure we should worry... but its up to you.
        Hide
        ASF subversion and git services added a comment -

        Commit 1501925 from Adrien Grand
        [ https://svn.apache.org/r1501925 ]

        LUCENE-5081: WAH8DocIdSet.

        Show
        ASF subversion and git services added a comment - Commit 1501925 from Adrien Grand [ https://svn.apache.org/r1501925 ] LUCENE-5081 : WAH8DocIdSet.
        Hide
        ASF subversion and git services added a comment -

        Commit 1501928 from Adrien Grand
        [ https://svn.apache.org/r1501928 ]

        LUCENE-5081: WAH8DocIdSet (merged from r1501925).

        Show
        ASF subversion and git services added a comment - Commit 1501928 from Adrien Grand [ https://svn.apache.org/r1501928 ] LUCENE-5081 : WAH8DocIdSet (merged from r1501925).
        Hide
        Adrien Grand added a comment -

        In general we should maybe open a followup issue to give these docidset classes a base test superclass test.

        LUCENE-5100

        Show
        Adrien Grand added a comment - In general we should maybe open a followup issue to give these docidset classes a base test superclass test. LUCENE-5100
        Hide
        Uwe Schindler added a comment -

        I am not sure how these are used today. I was just asking in case there is something relying upon this. If they are just added as a "bonus" and not really needed by our codebase, then I am not sure we should worry... but its up to you.

        +1, I have no use case at the moment to compare DocIdSets. Especially not cross-implementation.

        Show
        Uwe Schindler added a comment - I am not sure how these are used today. I was just asking in case there is something relying upon this. If they are just added as a "bonus" and not really needed by our codebase, then I am not sure we should worry... but its up to you. +1, I have no use case at the moment to compare DocIdSets. Especially not cross-implementation.
        Hide
        Adrien Grand added a comment -

        4.5 release -> bulk close

        Show
        Adrien Grand added a comment - 4.5 release -> bulk close

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development