Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7253

Make sparse doc values and segments merging more efficient

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 5.5, 6.0
    • Fix Version/s: 7.0
    • Component/s: None
    • Labels:
    • Lucene Fields:
      New

      Description

      Doc Values were optimized recently to efficiently store sparse data. Unfortunately there is still big problem with Doc Values merges for sparse fields. When we imagine 1 billion documents index it seems it doesn't matter if all documents have value for this field or there is only 1 document with value. Segment merge time is the same for both cases. In most cases this is not a problem but there are several cases in which one can expect having many fields with sparse doc values.

      I can describe an example. During performance tests of a system with large number of sparse fields I realized that Doc Values merges are a bottleneck. I had hundreds of different numeric fields. Each document contained only small subset of all fields. Average document contains 5-7 different numeric values. As you can see data was very sparse in these fields. It turned out that ingestion process was CPU-bound. Most of CPU time was spent in DocValues related methods (SingletonSortedNumericDocValues#setDocument, DocValuesConsumer$10$1#next, DocValuesConsumer#isSingleValued, DocValuesConsumer$4$1#setNext, ...) - mostly during merging segments.

      Adrien Grand suggested to reduce the number of sparse fields and replace them with smaller number of denser fields. This helped a lot but complicated fields naming.

      I am not much familiar with Doc Values source code but I have small suggestion how to improve Doc Values merges for sparse fields. I realized that Doc Values producers and consumers use Iterators. Let's take an example of numeric Doc Values. Would it be possible to replace Iterator which "travels" through all documents with Iterator over collection of non empty values? Of course this would require storing object (instead of numeric) which contains value and document ID. Such an iterator could significantly improve merge time of sparse Doc Values fields. IMHO this won't cause big overhead for dense structures but it can be game changer for sparse structures.

      This is what happens in NumericDocValuesWriter on flush

          dvConsumer.addNumericField(fieldInfo,
                                     new Iterable<Number>() {
                                       @Override
                                       public Iterator<Number> iterator() {
                                         return new NumericIterator(maxDoc, values, docsWithField);
                                       }
                                     });
      

      Before this happens during addValue, this loop is executed to fill holes.

          // Fill in any holes:
          for (int i = (int)pending.size(); i < docID; ++i) {
            pending.add(MISSING);
          }
      

      It turns out that variable called pending is used only internally in NumericDocValuesWriter. I know pending is PackedLongValues and it wouldn't be good to change it with different class (some kind of list) because this may break DV performance for dense fields. I hope someone can suggest interesting solutions for this problem .

      It would be great if discussion about sparse Doc Values merge performance can start here.

        Issue Links

          Activity

          Hide
          rcmuir Robert Muir added a comment -

          The correct solution is to have a more next/advance type api geared at forward iteration rather than one that mimics an array. Then nulls can be handled in typical ways in various situations (eg rle). It should be possible esp that scoring is in order.

          But otherwise, i dont think we should do something messy to optimize sparse cases at all. Clean solutions that are simpler and faster are good though.

          Show
          rcmuir Robert Muir added a comment - The correct solution is to have a more next/advance type api geared at forward iteration rather than one that mimics an array. Then nulls can be handled in typical ways in various situations (eg rle). It should be possible esp that scoring is in order. But otherwise, i dont think we should do something messy to optimize sparse cases at all. Clean solutions that are simpler and faster are good though.
          Hide
          prog Pawel Rog added a comment -

          Thank you for response Robert Muir. I understand that it may be a little complex to introduce a fix for sparse cases but why do you think this would be "messy"? I think we find a lot of cases where sparse data make sense. IMHO some people deal with this problem even today by adding more hardware for indexing data. It is always possible to reduce sparsity by user but it often means a little "mess" in users' data structures. I know one could say: "nobody uses hundreds of columns in DB because it is bad practice". In my opinion it is still possible to find many use cases. An alternative (make data denser) is to populate more documents (document per value + label) but this leads to large number of documents in index which is also not good in my opinion.

          Show
          prog Pawel Rog added a comment - Thank you for response Robert Muir . I understand that it may be a little complex to introduce a fix for sparse cases but why do you think this would be "messy"? I think we find a lot of cases where sparse data make sense. IMHO some people deal with this problem even today by adding more hardware for indexing data. It is always possible to reduce sparsity by user but it often means a little "mess" in users' data structures. I know one could say: "nobody uses hundreds of columns in DB because it is bad practice". In my opinion it is still possible to find many use cases. An alternative (make data denser) is to populate more documents (document per value + label) but this leads to large number of documents in index which is also not good in my opinion.
          Hide
          rcmuir Robert Muir added a comment -

          The problem is that catering to sparse/abuse cases quickly becomes the "norm" and the real use cases fall by the wayside.

          See LUCENE-7254 where range query performance across all of lucene is neutered in this way, and nobody cares (they are only concerned about their abuse cases).

          So I think its important to keep any additional sparse optimizations out of lucene: they are pure evil.

          Show
          rcmuir Robert Muir added a comment - The problem is that catering to sparse/abuse cases quickly becomes the "norm" and the real use cases fall by the wayside. See LUCENE-7254 where range query performance across all of lucene is neutered in this way, and nobody cares (they are only concerned about their abuse cases). So I think its important to keep any additional sparse optimizations out of lucene: they are pure evil.
          Hide
          dsmiley David Smiley added a comment -

          So I think its important to keep any additional sparse optimizations out of lucene: they are pure evil.

          LOL gimme a break! Yet you actually seem serious. Sparse fields have always been a feature of Lucene, not an abuse of Lucene. Rob, I can understand if you simply don't want to help out on this issue – no reason needed. But it seems you're -1 to seeing this improved, which I find a bit shocking. Am I right? Granted a specific patch has not been proposed but if one was, and if non-sparse docValues held steady in benchmarks, then would you still be against this in principle? Maybe one of the benchmarks that Adrien Grand did for reducing the size of sparse docValues could be used to assess the indexing speed (which it probably does already).

          See LUCENE-7254 where range query performance across all of lucene is neutered in this way, and nobody cares (they are only concerned about their abuse cases).

          FWIW I was reading along and found most of what you said compelling. I'm surprised you ended it the way you did instead of improving PointInSetQuery. You needn't give up; just a temporary set-back.

          Show
          dsmiley David Smiley added a comment - So I think its important to keep any additional sparse optimizations out of lucene: they are pure evil. LOL gimme a break! Yet you actually seem serious. Sparse fields have always been a feature of Lucene, not an abuse of Lucene. Rob, I can understand if you simply don't want to help out on this issue – no reason needed. But it seems you're -1 to seeing this improved, which I find a bit shocking. Am I right? Granted a specific patch has not been proposed but if one was, and if non-sparse docValues held steady in benchmarks, then would you still be against this in principle? Maybe one of the benchmarks that Adrien Grand did for reducing the size of sparse docValues could be used to assess the indexing speed (which it probably does already). See LUCENE-7254 where range query performance across all of lucene is neutered in this way, and nobody cares (they are only concerned about their abuse cases). FWIW I was reading along and found most of what you said compelling. I'm surprised you ended it the way you did instead of improving PointInSetQuery. You needn't give up; just a temporary set-back.
          Hide
          rcmuir Robert Muir added a comment -

          I am 100% serious.

          Show
          rcmuir Robert Muir added a comment - I am 100% serious.
          Hide
          dsmiley David Smiley added a comment -

          Rob, in your first comment, you indicated favoring an RLE oriented API. Yet you also view sparse docValues an an abuse case you would -1. For the benefit of the contributor (Pawel) or anyone else who might want to tackle this (I'd imagine Adrien might) would you -1 any RLE oriented API contribution in principle that it caters to what you feel is an abuse case? This would help either encourage them or save them from wasting time in vain.

          Show
          dsmiley David Smiley added a comment - Rob, in your first comment, you indicated favoring an RLE oriented API. Yet you also view sparse docValues an an abuse case you would -1. For the benefit of the contributor (Pawel) or anyone else who might want to tackle this (I'd imagine Adrien might) would you -1 any RLE oriented API contribution in principle that it caters to what you feel is an abuse case? This would help either encourage them or save them from wasting time in vain.
          Hide
          rcmuir Robert Muir added a comment -

          I already stated my position clearly. I don't think we should optimize for abuse cases.

          This means:

          if (abuseCase) {
            ...
          }
          

          But if you changed the docvalues API to be next/advance-like, then sparse cases would be automatically handled "naturally". For example the postings lists bulk compression code could be used here as a start. It is like the mathematical inverse of that.

          Show
          rcmuir Robert Muir added a comment - I already stated my position clearly. I don't think we should optimize for abuse cases. This means: if (abuseCase) { ... } But if you changed the docvalues API to be next/advance-like, then sparse cases would be automatically handled "naturally". For example the postings lists bulk compression code could be used here as a start. It is like the mathematical inverse of that.
          Hide
          prog Pawel Rog added a comment -

          Robert Muir this means you are not -1 for making the improvement? You are just -1 for making infinite list of conditions (if this, than if, then if, ...) only to handle corner cases. Am I right?

          Show
          prog Pawel Rog added a comment - Robert Muir this means you are not -1 for making the improvement? You are just -1 for making infinite list of conditions (if this, than if, then if, ...) only to handle corner cases. Am I right?
          Hide
          rcmuir Robert Muir added a comment -

          I am against optimizing for the sparse case. I don't understand why this is difficult.

          Show
          rcmuir Robert Muir added a comment - I am against optimizing for the sparse case. I don't understand why this is difficult.
          Hide
          rcmuir Robert Muir added a comment -

          If postings list compression is used for example, then see here: https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/lucene50/ForUtil.java#L190-L194

          This is really easy to understand: it optimizes for runs of the same value. if that value is 500 or zero it does not care. This was added to the postings list code because it helps real use cases.

          If similar compression is used, sparse cases will work better without being explicitly optimized for. This is what I am asking for: that we optimize for real use cases only.

          Show
          rcmuir Robert Muir added a comment - If postings list compression is used for example, then see here: https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/lucene50/ForUtil.java#L190-L194 This is really easy to understand: it optimizes for runs of the same value. if that value is 500 or zero it does not care. This was added to the postings list code because it helps real use cases. If similar compression is used, sparse cases will work better without being explicitly optimized for. This is what I am asking for: that we optimize for real use cases only.
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          I've seen tons of sparse use-cases over the years, they are both real and important.
          RLE (in the API) may or may not be the best approach... we don't yet know the cost that that would have for cases w/o significant runs.
          If we have a patch for the sparse case, it can be evaluated on it's merits (benefits vs complexity & performance impact to other cases).

          Show
          yseeley@gmail.com Yonik Seeley added a comment - I've seen tons of sparse use-cases over the years, they are both real and important. RLE (in the API) may or may not be the best approach... we don't yet know the cost that that would have for cases w/o significant runs. If we have a patch for the sparse case, it can be evaluated on it's merits (benefits vs complexity & performance impact to other cases).
          Hide
          rcmuir Robert Muir added a comment -

          I stand by what i said. The current "sparse" optimizations make the index smaller, but make access slower (log N). We shouldn't do things like this. We should make things faster.

          as long as the api decompresses one integer at a time and looks like an array, trying to adjust for some sparse case will only be ugly and hackish and not work well.

          Show
          rcmuir Robert Muir added a comment - I stand by what i said. The current "sparse" optimizations make the index smaller, but make access slower (log N). We shouldn't do things like this. We should make things faster. as long as the api decompresses one integer at a time and looks like an array, trying to adjust for some sparse case will only be ugly and hackish and not work well.
          Hide
          dsmiley David Smiley added a comment -

          The current "sparse" optimizations make the index smaller, but make access slower (log N). We shouldn't do things like this. We should make things faster.

          This issue is not about access-time or index size – indeed something that is sacred in terms of upholding access-time speed. LUCENE-6863, what you refer to, was a delicate balancing act.

          I'm trying to understand your point of view better... which is hard because I'm being told something that I think is preposterous. Maybe i'm misunderstanding what you even mean by the term "abuse case". Lets try and communicate with words we hopefully both understand ...

          Do you believe that it's very rare to populate fields sparsely, even those flagged as DocValues? Without much thought, I think probably half the search apps I know have at least one docValues field that isn't fully dense. Yonik is basically saying the same. It isn't rare; I think it's dubious to claim doing this is an abuse if it's popular to do it. So I don't think you mean that.

          Maybe you simply mean that the DocValues API itself shouldn't/doesn't cater to sparsity even though sparsity is allowed and you understand sparsity is popular and useful for some use-cases nonetheless? After all, DocValues.docsWithValue is part of our API including other methods that return -1 when there's no value – it's supported. I agree to this; do you agree to that statement too? Pawel proposes a change to an internal class that, assuming benchmarks show, will have an up-side to a supported capability of DocValues. If it has no technical down-sides, (these are hypotheticals to be proven out first), isn't vetoing now premature?

          This means: if (abusecase)

          What is the abusecase condition? A condition that is true if dense and false if sparse? NumericDocValuesWriter already has such conditions – see the hole filling loop which is a condition that must be evaluated for every sparse whole. Also see the FixedBitSet docsWithField which only exists due to the sparsity notion. Are these now set in stone, not to be changed because you say so? But what of it any way – what's wrong? I know that, if you had the time/inclination or could somehow tell other contributors what do do, that it would be solved in another way (RLE) but why stop someone willing to donate their time to solve it a different way? It would not prevent an RLE API from appearing later, assuming RLE turns out to be better, right?

          Show
          dsmiley David Smiley added a comment - The current "sparse" optimizations make the index smaller, but make access slower (log N). We shouldn't do things like this. We should make things faster. This issue is not about access-time or index size – indeed something that is sacred in terms of upholding access-time speed. LUCENE-6863 , what you refer to, was a delicate balancing act. I'm trying to understand your point of view better... which is hard because I'm being told something that I think is preposterous. Maybe i'm misunderstanding what you even mean by the term "abuse case". Lets try and communicate with words we hopefully both understand ... Do you believe that it's very rare to populate fields sparsely, even those flagged as DocValues? Without much thought, I think probably half the search apps I know have at least one docValues field that isn't fully dense. Yonik is basically saying the same. It isn't rare; I think it's dubious to claim doing this is an abuse if it's popular to do it. So I don't think you mean that. Maybe you simply mean that the DocValues API itself shouldn't/doesn't cater to sparsity even though sparsity is allowed and you understand sparsity is popular and useful for some use-cases nonetheless? After all, DocValues.docsWithValue is part of our API including other methods that return -1 when there's no value – it's supported . I agree to this; do you agree to that statement too? Pawel proposes a change to an internal class that, assuming benchmarks show, will have an up-side to a supported capability of DocValues. If it has no technical down-sides, (these are hypotheticals to be proven out first), isn't vetoing now premature? This means: if (abusecase) What is the abusecase condition? A condition that is true if dense and false if sparse? NumericDocValuesWriter already has such conditions – see the hole filling loop which is a condition that must be evaluated for every sparse whole. Also see the FixedBitSet docsWithField which only exists due to the sparsity notion. Are these now set in stone, not to be changed because you say so? But what of it any way – what's wrong? I know that, if you had the time/inclination or could somehow tell other contributors what do do, that it would be solved in another way (RLE) but why stop someone willing to donate their time to solve it a different way? It would not prevent an RLE API from appearing later, assuming RLE turns out to be better, right?
          Hide
          shaie Shai Erera added a comment -

          To add to the sparsity discussion, when I did the numeric DV updates I already wrote (somewhere) that I think if we could cater for sparse DV fields better, it might also improve the numeric DV updates case. Today when you update a numeric DV field, we rewrite the entire DV for that field in the "stacked" DV. This works well if you perform many updates before you flush/commit, but if you only update the value of one document, that's costly. If we could write just that one update to a stack, we could collapse the stacks at read time.

          Of course, that collapsing might slow searches down, so the whole idea of writing just the updated values needs to be benchmarked before we actually do it, so I'm not proposing that here. Just wanted to give another (potential) use case for sparse DV fields.

          And FWIW, I do agree with Yonik Seeley and David Smiley about sparse DV not being an abuse case, as I'm seeing them very often too. That's of course unless you mean something else by abuse case...

          Show
          shaie Shai Erera added a comment - To add to the sparsity discussion, when I did the numeric DV updates I already wrote (somewhere) that I think if we could cater for sparse DV fields better, it might also improve the numeric DV updates case. Today when you update a numeric DV field, we rewrite the entire DV for that field in the "stacked" DV. This works well if you perform many updates before you flush/commit, but if you only update the value of one document, that's costly. If we could write just that one update to a stack, we could collapse the stacks at read time. Of course, that collapsing might slow searches down, so the whole idea of writing just the updated values needs to be benchmarked before we actually do it, so I'm not proposing that here. Just wanted to give another (potential) use case for sparse DV fields. And FWIW, I do agree with Yonik Seeley and David Smiley about sparse DV not being an abuse case, as I'm seeing them very often too. That's of course unless you mean something else by abuse case...
          Hide
          rcmuir Robert Muir added a comment -

          I never said sparse was rare. Sparse is unfortunately common because of bad design decisions and trappy features, offered especially by solr and elasticsearch.

          I'm not going to argue with you guys here, because your argument is pathetic. Read some actual literature on column store databases, see how these situations are handled. That is all I want, for things to be done "normal", so we can build off of what has been done before. After you have educated yourselves, you will look less silly.

          Show
          rcmuir Robert Muir added a comment - I never said sparse was rare. Sparse is unfortunately common because of bad design decisions and trappy features, offered especially by solr and elasticsearch. I'm not going to argue with you guys here, because your argument is pathetic. Read some actual literature on column store databases, see how these situations are handled. That is all I want, for things to be done "normal", so we can build off of what has been done before. After you have educated yourselves, you will look less silly.
          Hide
          shaie Shai Erera added a comment -

          Read some actual literature on column store databases, see how these situations are handled.

          Would be great if you can recommend some particular references.

          I'm not going to argue with you guys here, because your argument is pathetic ... After you have educated yourselves, you will look less silly.

          I don't get the patronizing tone, really.

          ------

          What if the numeric DV consumer encoded the data differently based on the cardinality of the field? Dense fields would be encoded as today and low cardinality ones encode two arrays of docs and values (over simplifying, I know)? We can then benchmark what 'dense' means (50%/10%/100 docs) based on benchmark results?

          It's hard to overrule an idea without (a) an implementation that we can refer to and (b) proof that it does help in some cases as well not make other cases worse.

          Pawel Rog: maybe you should start playing with the idea, upload some patches, perform some benchmarks etc. Then we'll have more data to discuss and decide if this is worth pursuing or not. What do you think?

          Show
          shaie Shai Erera added a comment - Read some actual literature on column store databases, see how these situations are handled. Would be great if you can recommend some particular references. I'm not going to argue with you guys here, because your argument is pathetic ... After you have educated yourselves, you will look less silly. I don't get the patronizing tone, really. ------ What if the numeric DV consumer encoded the data differently based on the cardinality of the field? Dense fields would be encoded as today and low cardinality ones encode two arrays of docs and values (over simplifying, I know)? We can then benchmark what 'dense' means (50%/10%/100 docs) based on benchmark results? It's hard to overrule an idea without (a) an implementation that we can refer to and (b) proof that it does help in some cases as well not make other cases worse. Pawel Rog : maybe you should start playing with the idea, upload some patches, perform some benchmarks etc. Then we'll have more data to discuss and decide if this is worth pursuing or not. What do you think?
          Hide
          mikemccand Michael McCandless added a comment -

          I think Rob's point is that we should not build sparse DV impls underneath our random-access array-like API that we expose for doc values today: that just leads to poor performance. It's an impedance mismatch of the API to the implementation, a leaky abstraction.

          Rather, we need an iterator like API (like postings), and then sparse can be both compact in storage and fast at search time.

          Show
          mikemccand Michael McCandless added a comment - I think Rob's point is that we should not build sparse DV impls underneath our random-access array-like API that we expose for doc values today: that just leads to poor performance. It's an impedance mismatch of the API to the implementation, a leaky abstraction. Rather, we need an iterator like API (like postings), and then sparse can be both compact in storage and fast at search time.
          Hide
          shaie Shai Erera added a comment -

          I thought so, but that still needs to be benchmarked. Maybe Pawel Rog has an idea of an implementation that will keep both efficient? Maybe read time isn't affected, but merge is? Maybe if we choose to sparsely encode fields that are very sparse, the read time isn't affected? As a first step that can work. Point is we shouldn't shoot down an idea before we have code/results to back the shooting.

          And I agree that if we had iterator-like API it would make a stronger case for sparse DV. Maybe though both need not be coupled and one can be done before the other.

          Show
          shaie Shai Erera added a comment - I thought so, but that still needs to be benchmarked. Maybe Pawel Rog has an idea of an implementation that will keep both efficient? Maybe read time isn't affected, but merge is? Maybe if we choose to sparsely encode fields that are very sparse, the read time isn't affected? As a first step that can work. Point is we shouldn't shoot down an idea before we have code/results to back the shooting. And I agree that if we had iterator-like API it would make a stronger case for sparse DV. Maybe though both need not be coupled and one can be done before the other.
          Hide
          otis Otis Gospodnetic added a comment -

          My take on this:

          1. sparse fields are indeed not an abuse case
          2. my understanding of what Robert is saying is that he agrees with 1), but that current implementation is not geared for 1) and if existing DV code just modified slightly to improve performance then it would not be the right implementation
          3. Robert didn't actually mention -1 explicitly until David brought that up, although we all know that Robert could always throw in his -1 in the end, after a contributor has already spent hours or days making changes, just to have them rejected........ (but this is a general Lucene project problem that, I think, nobody has actually tried solving directly because it'd be painful)
          4. Robert actually proposed "The correct solution is to have a more next/advance type api geared at forward iteration rather than one that mimics an array. Then nulls can be handled in typical ways in various situations (eg rle). It should be possible esp that scoring is in order.", so my take is that if a contributor did exactly what Robert wants then this could potentially be accepted
          5. I assume the "correct approach" involves more changes and more coding and time. I assume it would be useful to make a simpler and maybe not acceptable change first in order to get some numbers and see if it's even worth investing time in "correct approach"
          6. If the numbers look good then, because of a potential -1 from Robert, whoever takes on this challenge would have to be very clear, before any additional dev work, about what Robert wants, what he would -1, and what he would let in
          Show
          otis Otis Gospodnetic added a comment - My take on this: sparse fields are indeed not an abuse case my understanding of what Robert is saying is that he agrees with 1), but that current implementation is not geared for 1) and if existing DV code just modified slightly to improve performance then it would not be the right implementation Robert didn't actually mention -1 explicitly until David brought that up, although we all know that Robert could always throw in his -1 in the end, after a contributor has already spent hours or days making changes, just to have them rejected........ (but this is a general Lucene project problem that, I think, nobody has actually tried solving directly because it'd be painful) Robert actually proposed "The correct solution is to have a more next/advance type api geared at forward iteration rather than one that mimics an array. Then nulls can be handled in typical ways in various situations (eg rle). It should be possible esp that scoring is in order.", so my take is that if a contributor did exactly what Robert wants then this could potentially be accepted I assume the "correct approach" involves more changes and more coding and time. I assume it would be useful to make a simpler and maybe not acceptable change first in order to get some numbers and see if it's even worth investing time in "correct approach" If the numbers look good then, because of a potential -1 from Robert, whoever takes on this challenge would have to be very clear, before any additional dev work, about what Robert wants, what he would -1, and what he would let in
          Hide
          otis Otis Gospodnetic added a comment -

          LUCENE-7457 and many others actually took care of the issue reported here.

          Show
          otis Otis Gospodnetic added a comment - LUCENE-7457 and many others actually took care of the issue reported here.

            People

            • Assignee:
              mikemccand Michael McCandless
              Reporter:
              prog Pawel Rog
            • Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development