Lucene - Core
  1. Lucene - Core
  2. LUCENE-4583

StraightBytesDocValuesField fails if bytes > 32k

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 4.0, 4.1, 6.0
    • Fix Version/s: 4.5, 6.0
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I didn't observe any limitations on the size of a bytes based DocValues field value in the docs. It appears that the limit is 32k, although I didn't get any friendly error telling me that was the limit. 32k is kind of small IMO; I suspect this limit is unintended and as such is a bug. The following test fails:

        public void testBigDocValue() throws IOException {
          Directory dir = newDirectory();
          IndexWriter writer = new IndexWriter(dir, writerConfig(false));
      
          Document doc = new Document();
          BytesRef bytes = new BytesRef((4+4)*4097);//4096 works
          bytes.length = bytes.bytes.length;//byte data doesn't matter
          doc.add(new StraightBytesDocValuesField("dvField", bytes));
          writer.addDocument(doc);
          writer.commit();
          writer.close();
      
          DirectoryReader reader = DirectoryReader.open(dir);
          DocValues docValues = MultiDocValues.getDocValues(reader, "dvField");
          //FAILS IF BYTES IS BIG!
          docValues.getSource().getBytes(0, bytes);
      
          reader.close();
          dir.close();
        }
      
      1. LUCENE-4583.patch
        39 kB
        Michael McCandless
      2. LUCENE-4583.patch
        39 kB
        Michael McCandless
      3. LUCENE-4583.patch
        41 kB
        Michael McCandless
      4. LUCENE-4583.patch
        31 kB
        Michael McCandless
      5. LUCENE-4583.patch
        23 kB
        Michael McCandless
      6. LUCENE-4583.patch
        9 kB
        Michael McCandless
      7. LUCENE-4583.patch
        3 kB
        Barakat Barakat
      8. LUCENE-4583.patch
        5 kB
        Barakat Barakat

        Activity

        Hide
        Yonik Seeley added a comment -

        I'm guessing that limit is due to the implementation with PagedBytes?
        These limits were sensible when applied to indexed values, but obviously not so much to stored values (unless we decide that DocValues are only meant for smallish values and document the limit).

        Show
        Yonik Seeley added a comment - I'm guessing that limit is due to the implementation with PagedBytes? These limits were sensible when applied to indexed values, but obviously not so much to stored values (unless we decide that DocValues are only meant for smallish values and document the limit).
        Hide
        Robert Muir added a comment -

        The most important thing: if this implementation (or if we decide dv itself) should be limited,
        then it should check this at index-time and throw a useful exception.

        Show
        Robert Muir added a comment - The most important thing: if this implementation (or if we decide dv itself) should be limited, then it should check this at index-time and throw a useful exception.
        Hide
        David Smiley added a comment -

        FWIW the app the triggered this has a document requiring ~68k but there's a long tail down such that most documents only need 8 bytes. As a hack, I could use multiple fields to break out of the 32k limit and concatenate each together (yuck). It'd be great if this 32k limit wasn't there.

        Show
        David Smiley added a comment - FWIW the app the triggered this has a document requiring ~68k but there's a long tail down such that most documents only need 8 bytes. As a hack, I could use multiple fields to break out of the 32k limit and concatenate each together (yuck). It'd be great if this 32k limit wasn't there.
        Hide
        Robert Muir added a comment -

        I'm not concerned if the limit is 10 bytes.

        if it is, then it is what it is.

        Its just important that IW throw exception at index-time when any such limit is exceeded.

        Show
        Robert Muir added a comment - I'm not concerned if the limit is 10 bytes. if it is, then it is what it is. Its just important that IW throw exception at index-time when any such limit is exceeded.
        Hide
        Robert Muir added a comment -

        The only correct bugfix here is a missing check.

        any discussion about extending limitations in the supported lengths belongs on another issue.

        Show
        Robert Muir added a comment - The only correct bugfix here is a missing check. any discussion about extending limitations in the supported lengths belongs on another issue.
        Hide
        Yonik Seeley added a comment -

        Looking at the issue name and the problem that David ran into, this issue is certainly about more than a missing check during indexing.
        Small hidden limits can still cause stuff to blow up in production - the user may not have thought to test anything above 32K. Small limits need to be documented.

        Like David, I also suspect that the limit was unintended and represents a bug.
        The question is on a practical level, how easy is it to raise the limit, and are there any negative consequences of doing so? If it's not easy (or there are negative consequences), I think it's OK to leave it at 32K and document it as a current limitation. Off the top of my head, I can't really think of use cases that would require more, but perhaps others might?

        Of course we should also fail early if someone tries to add a value above that limit.

        Show
        Yonik Seeley added a comment - Looking at the issue name and the problem that David ran into, this issue is certainly about more than a missing check during indexing. Small hidden limits can still cause stuff to blow up in production - the user may not have thought to test anything above 32K. Small limits need to be documented. Like David, I also suspect that the limit was unintended and represents a bug. The question is on a practical level, how easy is it to raise the limit, and are there any negative consequences of doing so? If it's not easy (or there are negative consequences), I think it's OK to leave it at 32K and document it as a current limitation. Off the top of my head, I can't really think of use cases that would require more, but perhaps others might? Of course we should also fail early if someone tries to add a value above that limit.
        Hide
        Robert Muir added a comment -

        Again, I really think adding the check and documenting current limits is what should happen here.

        Just like length of indexed terms are limited to 32k, its a bigger issue to try to deal with this (especially in the current DV situation).

        I think also if you are putting very large values in DV, you know its perfectly acceptable to require a custom codec for this kind of situation. the one we provide can be general purpose.

        I dont think we should try to cram in a huge change to the limits masqueraded as a bug... the bug is not recognizing the limit at index time (and of course documenting it in the codec, or indexwriter, depending on where it is (currently here i think its the codec).

        Show
        Robert Muir added a comment - Again, I really think adding the check and documenting current limits is what should happen here. Just like length of indexed terms are limited to 32k, its a bigger issue to try to deal with this (especially in the current DV situation). I think also if you are putting very large values in DV, you know its perfectly acceptable to require a custom codec for this kind of situation. the one we provide can be general purpose. I dont think we should try to cram in a huge change to the limits masqueraded as a bug... the bug is not recognizing the limit at index time (and of course documenting it in the codec, or indexwriter, depending on where it is (currently here i think its the codec).
        Hide
        Yonik Seeley added a comment -

        I dont think we should try to cram in a huge change to the limits masqueraded as a bug...

        If the limit was not intentional then it was certainly a bug (not just a missing check). Now we have to figure out what to do about it.

        This issue is about deciding what the limit should be (and 32K may be fine, depending on the trade-offs, as I said), and then documenting and enforcing that limit.
        For example your previous "I'm not concerned if the limit is 10 bytes." would get a -1 from me as "clearly not big enough... lets fix it".

        Show
        Yonik Seeley added a comment - I dont think we should try to cram in a huge change to the limits masqueraded as a bug... If the limit was not intentional then it was certainly a bug (not just a missing check). Now we have to figure out what to do about it. This issue is about deciding what the limit should be (and 32K may be fine, depending on the trade-offs, as I said), and then documenting and enforcing that limit. For example your previous "I'm not concerned if the limit is 10 bytes." would get a -1 from me as "clearly not big enough... lets fix it".
        Hide
        Robert Muir added a comment -

        Just like changing the length here gets a -1 from me. Pretty simple.

        Show
        Robert Muir added a comment - Just like changing the length here gets a -1 from me. Pretty simple.
        Hide
        Yonik Seeley added a comment -

        Just like changing the length here gets a -1 from me. Pretty simple.

        Regardless of how easy or hard it might be, regardless of what use cases are brought up (I was waiting to hear from David at least, since he hit it), regardless of what the trade-offs might be involved with changing an unintended/accidental limit? That's just silly.

        It also doesn't make sense to -1 something "here" vs somewhere else. +1s/-1s are for code changes, regardless of where they are discussed.

        Show
        Yonik Seeley added a comment - Just like changing the length here gets a -1 from me. Pretty simple. Regardless of how easy or hard it might be, regardless of what use cases are brought up (I was waiting to hear from David at least, since he hit it), regardless of what the trade-offs might be involved with changing an unintended/accidental limit? That's just silly. It also doesn't make sense to -1 something "here" vs somewhere else. +1s/-1s are for code changes, regardless of where they are discussed.
        Hide
        Barakat Barakat added a comment - - edited

        The limitation comes from PagedBytes. When PagedBytes is created it is given a number of bits to use per block. The blockSize is set to (1 << blockBits). From what I've seen, classes that use PagedBytes usually pass in 15 as the blockBits. This leads to the 32768 byte limit.

        The fillSlice function of the PagedBytes.Reader will return a block of bytes that is either inside one block or overlapping two blocks. If you try to give it a length that is over the block size it will hit the out of bounds exception. For the project I am working on, we need more than 32k bytes for our DocValues. We need that much rarely, but we still need that much to keep the search functioning. I fixed this for our project by changing fillSlices to this:

        http://pastebin.com/raw.php?i=TCY8zjAi

        Test unit:
        http://pastebin.com/raw.php?i=Uy29BGGJ

        After placing this in our Solr instance, the search no longer crashes and returns the correct values when the document has a DocValues field more than 32k bytes. As far as I know there is no limit now. I haven't noticed a performance hit. It shouldn't really affect performance unless you have many of these large DocValues fields. Thank you to David for his help with this.

        Edit: This only works when start == 0. Seeing if I can fix it.

        Show
        Barakat Barakat added a comment - - edited The limitation comes from PagedBytes. When PagedBytes is created it is given a number of bits to use per block. The blockSize is set to (1 << blockBits). From what I've seen, classes that use PagedBytes usually pass in 15 as the blockBits. This leads to the 32768 byte limit. The fillSlice function of the PagedBytes.Reader will return a block of bytes that is either inside one block or overlapping two blocks. If you try to give it a length that is over the block size it will hit the out of bounds exception. For the project I am working on, we need more than 32k bytes for our DocValues. We need that much rarely, but we still need that much to keep the search functioning. I fixed this for our project by changing fillSlices to this: http://pastebin.com/raw.php?i=TCY8zjAi Test unit: http://pastebin.com/raw.php?i=Uy29BGGJ After placing this in our Solr instance, the search no longer crashes and returns the correct values when the document has a DocValues field more than 32k bytes. As far as I know there is no limit now. I haven't noticed a performance hit. It shouldn't really affect performance unless you have many of these large DocValues fields. Thank you to David for his help with this. Edit: This only works when start == 0. Seeing if I can fix it.
        Hide
        Yonik Seeley added a comment -

        I am working on, we need more than 32k bytes for our DocValues. We need that much rarely, but we still need that much to keep the search functioning.

        Thanks! I suspected one generic use case would be "normally small, but hard to put an upper bound on". That's great if the only issue really is PagedBytes.fillSlices()! That definitely shouldn't have any performance impact since the first "if" will always be true in the common case.

        Edit: This only works when start == 0. Seeing if I can fix it.

        A simple loop might be easiest to understand, rather than calculating with DIV and MOD?

        Show
        Yonik Seeley added a comment - I am working on, we need more than 32k bytes for our DocValues. We need that much rarely, but we still need that much to keep the search functioning. Thanks! I suspected one generic use case would be "normally small, but hard to put an upper bound on". That's great if the only issue really is PagedBytes.fillSlices()! That definitely shouldn't have any performance impact since the first "if" will always be true in the common case. Edit: This only works when start == 0. Seeing if I can fix it. A simple loop might be easiest to understand, rather than calculating with DIV and MOD?
        Hide
        Barakat Barakat added a comment -

        I updated the code to work when start isn't zero. The code can still crash if you ask for a length that goes beyond the total size of the paged bytes, but I'm not sure how you guys like to prevent things like that. The code seems to be working fine with our Solr core so far. I am new to posting patches and writing test units in Java so please let me know if there is anything wrong with the code.

        Show
        Barakat Barakat added a comment - I updated the code to work when start isn't zero. The code can still crash if you ask for a length that goes beyond the total size of the paged bytes, but I'm not sure how you guys like to prevent things like that. The code seems to be working fine with our Solr core so far. I am new to posting patches and writing test units in Java so please let me know if there is anything wrong with the code.
        Hide
        Barakat Barakat added a comment -

        I recently switched from 4.1 to 4.3, and my patch needed to be updated because of the changes to DocValues. The problem was almost fixed for BinaryDocValues, but it just needed one little change. I've attached a patch that removes the BinaryDocValues exception when the length is over BYTE_BLOCK_SIZE (32k), fixes ByteBlockPool#readBytes:348, and changes the TestDocValuesIndexing#testTooLargeBytes test to check for accuracy.

        Show
        Barakat Barakat added a comment - I recently switched from 4.1 to 4.3, and my patch needed to be updated because of the changes to DocValues. The problem was almost fixed for BinaryDocValues, but it just needed one little change. I've attached a patch that removes the BinaryDocValues exception when the length is over BYTE_BLOCK_SIZE (32k), fixes ByteBlockPool#readBytes:348, and changes the TestDocValuesIndexing#testTooLargeBytes test to check for accuracy.
        Hide
        Robert Muir added a comment -

        I dont think we should just bump the limit like this.

        the patch is not safe: Some codecs rely upon this limit (e.g. they use 64k-size pagedbytes and other things).

        But in general anyway I'm not sure what the real use cases are for storing > 32kb inside a single document value.

        Show
        Robert Muir added a comment - I dont think we should just bump the limit like this. the patch is not safe: Some codecs rely upon this limit (e.g. they use 64k-size pagedbytes and other things). But in general anyway I'm not sure what the real use cases are for storing > 32kb inside a single document value.
        Hide
        Shai Erera added a comment -

        A user hit this limitation on the user list a couple weeks ago while try to index a very large number of facets: http://markmail.org/message/dfn4mk3qe7advzcd. So this is one usecase, and also it appears there is an exception thrown at indexing time?

        Show
        Shai Erera added a comment - A user hit this limitation on the user list a couple weeks ago while try to index a very large number of facets: http://markmail.org/message/dfn4mk3qe7advzcd . So this is one usecase, and also it appears there is an exception thrown at indexing time?
        Hide
        Robert Muir added a comment -

        Over 6,000 facets per doc?

        Yes there is an exception thrown at index time. I added this intentionally and added a test that ensures you get it.

        Show
        Robert Muir added a comment - Over 6,000 facets per doc? Yes there is an exception thrown at index time. I added this intentionally and added a test that ensures you get it.
        Hide
        Shai Erera added a comment -

        Over 6,000 facets per doc?

        Right. Not sure how realistic is his case (something about proteins, he explains it here: http://markmail.org/message/2wxavktzyxjtijqe), and whether he should enable facet partitions or not, but he agreed these are extreme cases and expects only few docs like that. At any rate, you asked for a usecase . Not saying we should support it, but it is one. If it can be supported without too much pain, then perhaps we should. Otherwise, I think we can live w/ the limitation and leave it for the app to worry about a workaround / write its own Codec.

        Show
        Shai Erera added a comment - Over 6,000 facets per doc? Right. Not sure how realistic is his case (something about proteins, he explains it here: http://markmail.org/message/2wxavktzyxjtijqe ), and whether he should enable facet partitions or not, but he agreed these are extreme cases and expects only few docs like that. At any rate, you asked for a usecase . Not saying we should support it, but it is one. If it can be supported without too much pain, then perhaps we should. Otherwise, I think we can live w/ the limitation and leave it for the app to worry about a workaround / write its own Codec.
        Hide
        Robert Muir added a comment -

        For the faceting use case, we have SortedSetDocValuesField, which can hold 2B facets per-field and you can have.... maybe 2B fields per doc.

        So the limitation here is not docvalues.

        BINARY datatype isnt designed for faceting.

        Show
        Robert Muir added a comment - For the faceting use case, we have SortedSetDocValuesField, which can hold 2B facets per-field and you can have.... maybe 2B fields per doc. So the limitation here is not docvalues. BINARY datatype isnt designed for faceting.
        Hide
        Michael McCandless added a comment -

        I think we should fix the limit in core, and then existing codecs should enforce their own limits, at indexing time?

        This way users that sometimes need to store > 32 KB binary doc value can do so, with the right DVFormat.

        Show
        Michael McCandless added a comment - I think we should fix the limit in core, and then existing codecs should enforce their own limits, at indexing time? This way users that sometimes need to store > 32 KB binary doc value can do so, with the right DVFormat.
        Hide
        Shai Erera added a comment -

        For the faceting use case, we have SortedSetDocValuesField

        No, this is for one type of faceting, no hierarchies etc. It's also slower than the BINARY DV method ...

        BINARY datatype isnt designed for faceting.

        Maybe, but that's the best we have for now.

        Show
        Shai Erera added a comment - For the faceting use case, we have SortedSetDocValuesField No, this is for one type of faceting, no hierarchies etc. It's also slower than the BINARY DV method ... BINARY datatype isnt designed for faceting. Maybe, but that's the best we have for now.
        Hide
        Robert Muir added a comment -

        No, this is for one type of faceting, no hierarchies etc. It's also slower than the BINARY DV method ...

        Its not inherently slower. I just didnt spend a month inlining vint codes or writing custom codecs like you did for the other faceting method.
        Instead its the simplest thing that can possibly work right now.

        I will call you guys out on this every single time you bring it up.

        I'm -1 to bumping the limit

        Show
        Robert Muir added a comment - No, this is for one type of faceting, no hierarchies etc. It's also slower than the BINARY DV method ... Its not inherently slower. I just didnt spend a month inlining vint codes or writing custom codecs like you did for the other faceting method. Instead its the simplest thing that can possibly work right now. I will call you guys out on this every single time you bring it up. I'm -1 to bumping the limit
        Hide
        David Smiley added a comment -

        I looked over the patch carefully and stepped through the relevant code in a debugging session and I think it's good.

        I don't see why this arbitrary limit was here. The use-case for why Barakat hit this is related to storing values per document so that a custom ValueSource I wrote can examine it. It's for spatial multi-value fields and some businesses (docs) have a ton of locations out there (e.g. McDonalds). FWIW very few docs have such large values, and if it were to become slow then I have ways to more cleverly examine a subset of the returned bytes. I think its silly to force the app to write a Codec (even if its trivial) to get out of this arbitrary limit.

        Show
        David Smiley added a comment - I looked over the patch carefully and stepped through the relevant code in a debugging session and I think it's good. I don't see why this arbitrary limit was here. The use-case for why Barakat hit this is related to storing values per document so that a custom ValueSource I wrote can examine it. It's for spatial multi-value fields and some businesses (docs) have a ton of locations out there (e.g. McDonalds). FWIW very few docs have such large values, and if it were to become slow then I have ways to more cleverly examine a subset of the returned bytes. I think its silly to force the app to write a Codec (even if its trivial) to get out of this arbitrary limit.
        Hide
        Michael McCandless added a comment -

        I iterated from Barakat's patch: improved the test, and added
        enforcing of the limit in the appropriate DocValuesFormats impls.

        Disk, SimpleText, CheapBastard and (I think ... still need a test
        here) Facet42DVFormat don't have a limit, but Lucene40/42 still do.

        I'm -1 to bumping the limit

        Are you also against just fixing the limit in the core code
        (IndexWriter/BinaryDocValuesWriter) and leaving the limit enforced in
        the existing DVFormats (my patch)?

        I thought that was a good compromise ...

        This way at least users can still build their own / use DVFormats that
        don't have the limit.

        Show
        Michael McCandless added a comment - I iterated from Barakat's patch: improved the test, and added enforcing of the limit in the appropriate DocValuesFormats impls. Disk, SimpleText, CheapBastard and (I think ... still need a test here) Facet42DVFormat don't have a limit, but Lucene40/42 still do. I'm -1 to bumping the limit Are you also against just fixing the limit in the core code (IndexWriter/BinaryDocValuesWriter) and leaving the limit enforced in the existing DVFormats (my patch)? I thought that was a good compromise ... This way at least users can still build their own / use DVFormats that don't have the limit.
        Hide
        David Smiley added a comment -

        the patch is not safe: Some codecs rely upon this limit (e.g. they use 64k-size pagedbytes and other things).

        Rob, can you please elaborate?

        Show
        David Smiley added a comment - the patch is not safe: Some codecs rely upon this limit (e.g. they use 64k-size pagedbytes and other things). Rob, can you please elaborate?
        Hide
        Robert Muir added a comment -

        Are you also against just fixing the limit in the core code
        (IndexWriter/BinaryDocValuesWriter) and leaving the limit enforced in
        the existing DVFormats (my patch)?

        I thought that was a good compromise ...

        This way at least users can still build their own / use DVFormats that
        don't have the limit.

        I'm worried about a few things:

        • I think the limit is ok, because in my eyes its the limit of a single term. I feel that anyone arguing for increasing the limit only has abuse cases (not use cases) in mind. I'm worried about making dv more complicated for no good reason.
        • I'm worried about opening up the possibility of bugs and index corruption (e.g. clearly MULTIPLE people on this issue dont understand why you cannot just remove IndexWriter's limit without causing corruption).
        • I'm really worried about the precedent: once these abuse-case-fans have their way and increase this limit, they will next argue that we should do the same for SORTED, maybe SORTED_SET, maybe even inverted terms. They will make arguments that its the same as binary, just with sorting, and why should sorting bring in additional limits. I can easily see this all spinning out of control.
        • I think that most people hitting the limit are abusing docvalues as stored fields, so the limit is providing a really useful thing today actually, and telling them they are doing something wrong.

        The only argument i have for removing the limit is that by expanding BINARY's possible abuse cases (in my opinion, thats pretty much all its useful for), we might prevent additional complexity from being added elsewhere to DV in the long-term.

        Show
        Robert Muir added a comment - Are you also against just fixing the limit in the core code (IndexWriter/BinaryDocValuesWriter) and leaving the limit enforced in the existing DVFormats (my patch)? I thought that was a good compromise ... This way at least users can still build their own / use DVFormats that don't have the limit. I'm worried about a few things: I think the limit is ok, because in my eyes its the limit of a single term. I feel that anyone arguing for increasing the limit only has abuse cases (not use cases) in mind. I'm worried about making dv more complicated for no good reason. I'm worried about opening up the possibility of bugs and index corruption (e.g. clearly MULTIPLE people on this issue dont understand why you cannot just remove IndexWriter's limit without causing corruption). I'm really worried about the precedent: once these abuse-case-fans have their way and increase this limit, they will next argue that we should do the same for SORTED, maybe SORTED_SET, maybe even inverted terms. They will make arguments that its the same as binary, just with sorting, and why should sorting bring in additional limits. I can easily see this all spinning out of control. I think that most people hitting the limit are abusing docvalues as stored fields, so the limit is providing a really useful thing today actually, and telling them they are doing something wrong. The only argument i have for removing the limit is that by expanding BINARY's possible abuse cases (in my opinion, thats pretty much all its useful for), we might prevent additional complexity from being added elsewhere to DV in the long-term.
        Hide
        Jack Krupansky added a comment -

        abusing docvalues as stored fields

        Great point. I have to admit that I still don't have a 100% handle on the use case(s) for docvalues vs. stored fields, even though I've asked on the list. I mean, sometimes the chatter seems to suggest that dv is the successor to stored values. Hmmm... in that case, I should be able to store the full text of a 24 MB PDF file in a dv. Now, I know that isn't true.

        Maybe we just need to start with some common use cases, based on size: tiny (16 bytes or less), small (256 or 1024 bytes or less), medium (up to 32K), and large (upwards of 1MB, and larger.) It sounds like large implies stored field.

        A related "concern" is dv or stored fields that need a bias towards being in memory and in the heap, vs. a bias towards being "off heap". Maybe the size category is the hint: tiny and small bias towards on-heap, medium and certainly large bias towards off-heap. If people are only going towards DV because they think they get off-heap, then maybe we need to reconsider the model of what DV vs. stored is really all about. But then that leads back to DV somehow morphing out of column-stride fields.

        Show
        Jack Krupansky added a comment - abusing docvalues as stored fields Great point. I have to admit that I still don't have a 100% handle on the use case(s) for docvalues vs. stored fields, even though I've asked on the list. I mean, sometimes the chatter seems to suggest that dv is the successor to stored values. Hmmm... in that case, I should be able to store the full text of a 24 MB PDF file in a dv. Now, I know that isn't true. Maybe we just need to start with some common use cases, based on size: tiny (16 bytes or less), small (256 or 1024 bytes or less), medium (up to 32K), and large (upwards of 1MB, and larger.) It sounds like large implies stored field. A related "concern" is dv or stored fields that need a bias towards being in memory and in the heap, vs. a bias towards being "off heap". Maybe the size category is the hint: tiny and small bias towards on-heap, medium and certainly large bias towards off-heap. If people are only going towards DV because they think they get off-heap, then maybe we need to reconsider the model of what DV vs. stored is really all about. But then that leads back to DV somehow morphing out of column-stride fields.
        Hide
        Michael McCandless added a comment -

        I'm worried about a few things:
        I think the limit is ok, because in my eyes its the limit of a single term. I feel that anyone arguing for increasing the limit only has abuse cases (not use cases) in mind. I'm worried about making dv more complicated for no good reason.

        I guess I see DV binary as more like a stored field, just stored
        column stride for faster access. Faceting (and I guess spatial)
        encode many things inside one DV binary field.

        I'm worried about opening up the possibility of bugs and index corruption (e.g. clearly MULTIPLE people on this issue dont understand why you cannot just remove IndexWriter's limit without causing corruption).

        I agree this is a concern and we need to take it slow, add good
        test coverage.

        I'm really worried about the precedent: once these abuse-case-fans have their way and increase this limit, they will next argue that we should do the same for SORTED, maybe SORTED_SET, maybe even inverted terms. They will make arguments that its the same as binary, just with sorting, and why should sorting bring in additional limits. I can easily see this all spinning out of control.
        I think that most people hitting the limit are abusing docvalues as stored fields, so the limit is providing a really useful thing today actually, and telling them they are doing something wrong.

        I don't think we should change the limit for sorted/set nor terms: I
        think we should raise the limit ONLY for BINARY, and declare that DV
        BINARY is for these "abuse" cases. So if you really really want
        sorted set with a higher limit then you will have to encode yourself
        into DV BINARY.

        The only argument i have for removing the limit is that by expanding BINARY's possible abuse cases (in my opinion, thats pretty much all its useful for), we might prevent additional complexity from being added elsewhere to DV in the long-term.

        +1

        Show
        Michael McCandless added a comment - I'm worried about a few things: I think the limit is ok, because in my eyes its the limit of a single term. I feel that anyone arguing for increasing the limit only has abuse cases (not use cases) in mind. I'm worried about making dv more complicated for no good reason. I guess I see DV binary as more like a stored field, just stored column stride for faster access. Faceting (and I guess spatial) encode many things inside one DV binary field. I'm worried about opening up the possibility of bugs and index corruption (e.g. clearly MULTIPLE people on this issue dont understand why you cannot just remove IndexWriter's limit without causing corruption). I agree this is a concern and we need to take it slow, add good test coverage. I'm really worried about the precedent: once these abuse-case-fans have their way and increase this limit, they will next argue that we should do the same for SORTED, maybe SORTED_SET, maybe even inverted terms. They will make arguments that its the same as binary, just with sorting, and why should sorting bring in additional limits. I can easily see this all spinning out of control. I think that most people hitting the limit are abusing docvalues as stored fields, so the limit is providing a really useful thing today actually, and telling them they are doing something wrong. I don't think we should change the limit for sorted/set nor terms: I think we should raise the limit ONLY for BINARY, and declare that DV BINARY is for these "abuse" cases. So if you really really want sorted set with a higher limit then you will have to encode yourself into DV BINARY. The only argument i have for removing the limit is that by expanding BINARY's possible abuse cases (in my opinion, thats pretty much all its useful for), we might prevent additional complexity from being added elsewhere to DV in the long-term. +1
        Hide
        Michael McCandless added a comment -

        I have to admit that I still don't have a 100% handle on the use case(s) for docvalues vs. stored fields, even though I've asked on the list. I mean, sometimes the chatter seems to suggest that dv is the successor to stored values. Hmmm... in that case, I should be able to store the full text of a 24 MB PDF file in a dv. Now, I know that isn't true.

        The big difference is that DV fields are stored column stride, so you
        can decide on a field by field basis whether it will be in RAM on disk
        etc., and you get faster access if you know you just need to work with
        just one or two fields.

        Vs stored fields where all fields for one document are stored
        "together".

        Each has different tradeoffs so it's really up to the app to decide
        which is best... if you know you need 12 fields loaded for each
        document you are presenting on the current page, stored fields is
        probably best.

        But if you need one field to use as a scoring factor (eg maybe you are
        boosting by recency) then column-stride is better.

        Show
        Michael McCandless added a comment - I have to admit that I still don't have a 100% handle on the use case(s) for docvalues vs. stored fields, even though I've asked on the list. I mean, sometimes the chatter seems to suggest that dv is the successor to stored values. Hmmm... in that case, I should be able to store the full text of a 24 MB PDF file in a dv. Now, I know that isn't true. The big difference is that DV fields are stored column stride, so you can decide on a field by field basis whether it will be in RAM on disk etc., and you get faster access if you know you just need to work with just one or two fields. Vs stored fields where all fields for one document are stored "together". Each has different tradeoffs so it's really up to the app to decide which is best... if you know you need 12 fields loaded for each document you are presenting on the current page, stored fields is probably best. But if you need one field to use as a scoring factor (eg maybe you are boosting by recency) then column-stride is better.
        Hide
        David Smiley added a comment -

        Quoting Mike:

        I don't think we should change the limit for sorted/set nor terms: I

        think we should raise the limit ONLY for BINARY, and declare that DV
        BINARY is for these "abuse" cases. So if you really really want
        sorted set with a higher limit then you will have to encode yourself
        into DV BINARY.

        +1. DV Binary is generic for applications to use as it might see fit. There is no use case to abuse. If this issue passes, I'm not going to then ask for terms > 32k or something silly like that.

        Show
        David Smiley added a comment - Quoting Mike: I don't think we should change the limit for sorted/set nor terms: I think we should raise the limit ONLY for BINARY, and declare that DV BINARY is for these "abuse" cases. So if you really really want sorted set with a higher limit then you will have to encode yourself into DV BINARY. +1. DV Binary is generic for applications to use as it might see fit. There is no use case to abuse. If this issue passes, I'm not going to then ask for terms > 32k or something silly like that.
        Hide
        Robert Muir added a comment -

        I can compromise with this.

        However, I don't like the current patch.

        I don't think we should modify ByteBlockPool. Instead, I think BinaryDocValuesWriter should do the following:

        • use PagedBytes to append the bytes (which has append-only writing, via getDataOutput)
        • implement the iterator with PagedBytes.getDataInput (its just an iterator so this is simple)
        • store the lengths instead as absolute offsets with MonotonicAppendingLongBuffer (this should be more efficient)

        The only thing ByteBlockPool gives is some "automagic" ram accounting, but this is not as good as it looks anyway.

        • today, it seems to me ram accounting is broken for this dv type already (the lengths are not considered or am i missing something!?)
        • since we need to fix that bug anyway (by just adding updateBytesUsed like the other consumers), the magic accounting buys us nothing really anyway.

        Anyway I can help with this, tomorrow.

        Show
        Robert Muir added a comment - I can compromise with this. However, I don't like the current patch. I don't think we should modify ByteBlockPool. Instead, I think BinaryDocValuesWriter should do the following: use PagedBytes to append the bytes (which has append-only writing, via getDataOutput) implement the iterator with PagedBytes.getDataInput (its just an iterator so this is simple) store the lengths instead as absolute offsets with MonotonicAppendingLongBuffer (this should be more efficient) The only thing ByteBlockPool gives is some "automagic" ram accounting, but this is not as good as it looks anyway. today, it seems to me ram accounting is broken for this dv type already (the lengths are not considered or am i missing something!?) since we need to fix that bug anyway (by just adding updateBytesUsed like the other consumers), the magic accounting buys us nothing really anyway. Anyway I can help with this, tomorrow.
        Hide
        Michael McCandless added a comment -

        +1 to cutover to PagedBytes!

        Show
        Michael McCandless added a comment - +1 to cutover to PagedBytes!
        Hide
        Michael McCandless added a comment -

        New patch, cutting over to PagedBytes. I also revived .getDataInput/Output (from 4.x).

        I also added javadocs about these limits in DocValuesType.

        There are still nocommits to resolve ... the biggest one is whether we should fix default DV impl to accept values > 32 KB ... right now it throws IAE.

        Show
        Michael McCandless added a comment - New patch, cutting over to PagedBytes. I also revived .getDataInput/Output (from 4.x). I also added javadocs about these limits in DocValuesType. There are still nocommits to resolve ... the biggest one is whether we should fix default DV impl to accept values > 32 KB ... right now it throws IAE.
        Hide
        Robert Muir added a comment -

        There are still nocommits to resolve ... the biggest one is whether we should fix default DV impl to accept values > 32 KB ... right now it throws IAE.

        I do not think this should change. This disagrees from your earlier comments: its already spinning out of control just as I mentioned it might.

        Show
        Robert Muir added a comment - There are still nocommits to resolve ... the biggest one is whether we should fix default DV impl to accept values > 32 KB ... right now it throws IAE. I do not think this should change. This disagrees from your earlier comments: its already spinning out of control just as I mentioned it might.
        Hide
        Adrien Grand added a comment -

        store the lengths instead as absolute offsets with MonotonicAppendingLongBuffer (this should be more efficient)

        I just had a quick discussion about this with Robert, and since AppendingLongBuffer stores deltas from the minimum value of the block (and not 0), AppendingLongBuffer is better (ie. faster and more compact) than MonotonicAppendingLongBuffer to store lengths. This means that if all lengths are 7, 8 or 9 in a block, it will only require 2 bits per value instead of 4.

        Show
        Adrien Grand added a comment - store the lengths instead as absolute offsets with MonotonicAppendingLongBuffer (this should be more efficient) I just had a quick discussion about this with Robert, and since AppendingLongBuffer stores deltas from the minimum value of the block (and not 0), AppendingLongBuffer is better (ie. faster and more compact) than MonotonicAppendingLongBuffer to store lengths. This means that if all lengths are 7, 8 or 9 in a block, it will only require 2 bits per value instead of 4.
        Hide
        Michael McCandless added a comment -

        I do not think this should change.

        OK I just removed that nocommit.

        I just had a quick discussion about this with Robert, and since AppendingLongBuffer stores deltas from the minimum value of the block (and not 0), AppendingLongBuffer is better (ie. faster and more compact) than MonotonicAppendingLongBuffer to store lengths. This means that if all lengths are 7, 8 or 9 in a block, it will only require 2 bits per value instead of 4.

        Ahh, OK: I switched back to AppendingLongBuffer.

        New patch. I factored the testHugeBinaryValues up into the
        BaseDocValuesFormatTestCase base class, and added protected method so
        the codecs that don't accept huge binary values can say so. I also
        added a test case for Facet42DVFormat, and cut back to
        AppendingLongBuffer.

        I downgraded the nocommit about the spooky unused PagedBytes.blockEnds
        to a TODO ... this class is somewhat dangerous because e.g. you can
        use copyUsingLengthPrefix method and then get a .getDataOutput and get
        corrumpted bytes out.

        Show
        Michael McCandless added a comment - I do not think this should change. OK I just removed that nocommit. I just had a quick discussion about this with Robert, and since AppendingLongBuffer stores deltas from the minimum value of the block (and not 0), AppendingLongBuffer is better (ie. faster and more compact) than MonotonicAppendingLongBuffer to store lengths. This means that if all lengths are 7, 8 or 9 in a block, it will only require 2 bits per value instead of 4. Ahh, OK: I switched back to AppendingLongBuffer. New patch. I factored the testHugeBinaryValues up into the BaseDocValuesFormatTestCase base class, and added protected method so the codecs that don't accept huge binary values can say so. I also added a test case for Facet42DVFormat, and cut back to AppendingLongBuffer. I downgraded the nocommit about the spooky unused PagedBytes.blockEnds to a TODO ... this class is somewhat dangerous because e.g. you can use copyUsingLengthPrefix method and then get a .getDataOutput and get corrumpted bytes out.
        Hide
        Robert Muir added a comment -

        this one is looking pretty good, but needs a few fixes. at least, we should fix the tests:

        NOTE: reproduce with: ant test -Dtestcase=TestDocValuesFormat -Dtests.method=testHugeBinaryValues -Dtests.seed=5F68849875ABAC05 -Dtests.slow=true -Dtests.locale=es_PY -Dtests.timezone=Canada/Mountain -Dtests.file.encoding=ISO-8859-1

        Show
        Robert Muir added a comment - this one is looking pretty good, but needs a few fixes. at least, we should fix the tests: NOTE: reproduce with: ant test -Dtestcase=TestDocValuesFormat -Dtests.method=testHugeBinaryValues -Dtests.seed=5F68849875ABAC05 -Dtests.slow=true -Dtests.locale=es_PY -Dtests.timezone=Canada/Mountain -Dtests.file.encoding=ISO-8859-1
        Hide
        David Smiley added a comment -

        I like the new test, Mike – in particular it doesn't mandate a failure if the codec accepts > 32k.

        I want to make sure it's clear what the logic is behind the decisions being made by Mike & Rob on this thread regarding the limits for binary doc values (not other things). Firstly there is no intrinsic technical limitation that the Lucene42Consumer has on these values to perhaps 2GB (not sure but "big"). Yet it is being decided to artificially neuter it to 32k. I don't see anything in this thread establishing a particular use of binary DocValues that established it's intended use; I see it as general purpose as stored values, with different performance characteristics (clearly it's column-stride, for example). The particular use I established earlier would totally suck if it had to use stored values. And the reason for this limit... I'm struggling to find the arguments in this thread but appears to be that hypothetically in the future, there might evolve newer clever encodings that simply can't handle more than 32k. If that's it then wouldn't such a new implementation simply have this different limit, and leave both as reasonable choices by the application? If that isn't it then what is the reason?

        Show
        David Smiley added a comment - I like the new test, Mike – in particular it doesn't mandate a failure if the codec accepts > 32k. I want to make sure it's clear what the logic is behind the decisions being made by Mike & Rob on this thread regarding the limits for binary doc values (not other things). Firstly there is no intrinsic technical limitation that the Lucene42Consumer has on these values to perhaps 2GB (not sure but "big"). Yet it is being decided to artificially neuter it to 32k. I don't see anything in this thread establishing a particular use of binary DocValues that established it's intended use ; I see it as general purpose as stored values, with different performance characteristics (clearly it's column-stride, for example). The particular use I established earlier would totally suck if it had to use stored values. And the reason for this limit... I'm struggling to find the arguments in this thread but appears to be that hypothetically in the future, there might evolve newer clever encodings that simply can't handle more than 32k. If that's it then wouldn't such a new implementation simply have this different limit, and leave both as reasonable choices by the application? If that isn't it then what is the reason?
        Hide
        Robert Muir added a comment -

        Firstly there is no intrinsic technical limitation that the Lucene42Consumer has on these values to perhaps 2GB (not sure but "big"). Yet it is being decided to artificially neuter it to 32k.

        This is absolutely not correct. The fact that this limitation exists (based on pagedbytes blocksize) and that nobody sees it makes me really want to rethink what we are doing here: I dont want to allow index corruption, sorry.

        Show
        Robert Muir added a comment - Firstly there is no intrinsic technical limitation that the Lucene42Consumer has on these values to perhaps 2GB (not sure but "big"). Yet it is being decided to artificially neuter it to 32k. This is absolutely not correct. The fact that this limitation exists (based on pagedbytes blocksize) and that nobody sees it makes me really want to rethink what we are doing here: I dont want to allow index corruption, sorry.
        Hide
        David Smiley added a comment -

        I don't follow. From the javadocs, I get the impression that PagedBytes can handle basically any data size. blockBits appears to tune the block size, but I don't see how that limits the total capacity in any significant way. The only method I see that appears to have a limit is copy(BytesRef bytes, BytesRef out) which is only used in uninverting doc terms which doesn't apply to binary DocValues.

        Show
        David Smiley added a comment - I don't follow. From the javadocs, I get the impression that PagedBytes can handle basically any data size. blockBits appears to tune the block size, but I don't see how that limits the total capacity in any significant way. The only method I see that appears to have a limit is copy(BytesRef bytes, BytesRef out) which is only used in uninverting doc terms which doesn't apply to binary DocValues.
        Hide
        Robert Muir added a comment -

        just look at the java docs for the only PagedBytes method that Lucene42's binary dv producer actually uses:

             * ...
             * Slices spanning more than one block are not supported.
             * ...
             **/
            public void fillSlice(BytesRef b, long start, int length) {
        
        Show
        Robert Muir added a comment - just look at the java docs for the only PagedBytes method that Lucene42's binary dv producer actually uses: * ... * Slices spanning more than one block are not supported. * ... **/ public void fillSlice(BytesRef b, long start, int length) {
        Hide
        David Smiley added a comment -

        Aha; thanks for the clarification. I see it now. And I see that after I commented the limit check, the assertion was hit. I didn't hit this assertion with Barakat's patch when I last ran it; weird but whatever.

        BTW ByteBlockPool doesn't really have this limit, notwithstanding the bug that Barakat fixed in his patch. It's not a hard limit as BBP.append() and readBytes() will conveniently loop for you whereas if code uses PagedBytes then you could loop on fillSlice() yourself to support big values. That is a bona-fide bug on ByteBlockPool that it didn't implement that loop correctly and it should be fixed if not in this issue then another.

        So a DocValues codec that supports large binary values could be nearly identical to the current codec but call fillSlice() in a loop, and only for variable-sized binary values (just like BBP's algorithm), and that would basically be the only change. Do you support such a change? If not then why not (a technical reason please)? If you can't support such a change, then would you also object to the addition of a new codec that simply lifted this limit as I proposed? Note that would include potentially a bunch of duplicated code just to call fillSlice() in a loop; I propose it would be simpler and more maintainable to not limit binary docvalues to 32k.

        Show
        David Smiley added a comment - Aha; thanks for the clarification. I see it now. And I see that after I commented the limit check, the assertion was hit. I didn't hit this assertion with Barakat's patch when I last ran it; weird but whatever. BTW ByteBlockPool doesn't really have this limit, notwithstanding the bug that Barakat fixed in his patch. It's not a hard limit as BBP.append() and readBytes() will conveniently loop for you whereas if code uses PagedBytes then you could loop on fillSlice() yourself to support big values. That is a bona-fide bug on ByteBlockPool that it didn't implement that loop correctly and it should be fixed if not in this issue then another. So a DocValues codec that supports large binary values could be nearly identical to the current codec but call fillSlice() in a loop, and only for variable-sized binary values (just like BBP's algorithm), and that would basically be the only change. Do you support such a change? If not then why not (a technical reason please)? If you can't support such a change, then would you also object to the addition of a new codec that simply lifted this limit as I proposed? Note that would include potentially a bunch of duplicated code just to call fillSlice() in a loop; I propose it would be simpler and more maintainable to not limit binary docvalues to 32k.
        Hide
        Barakat Barakat added a comment -

        Just to clarify, before 4.3 I was fixing the "bug" by changing PagedBytes#fillSlice to the first patch I posted in this issue.

        Show
        Barakat Barakat added a comment - Just to clarify, before 4.3 I was fixing the "bug" by changing PagedBytes#fillSlice to the first patch I posted in this issue.
        Hide
        Robert Muir added a comment -

        No, I don't support changing this codec. Its an all-in-memory one (which is an unfortunate default, but must be until various algorithms in grouping/join/etc package are improved such that we can safely use something more like DiskDV as a default). Other all-memory implementations like DirectPostingsFormat/MemoryPostings have similar limitations, even the specialized faceting one (e.g. entire segment cannot have more than 2GB total bytes).

        I dont want to add a bunch of stuff in a loop here or any of that, because it only causes additional complexity for the normal case, and I think its unreasonable to use a RAM docvalues impl if you have more than 32KB per-document cost anyway. Sorry, thats just crazy: and I don't think we should add any additional trappy codec to support that.

        So if you want ridiculously huge per-document values, just use DiskDV which supports that. These abuse cases are extreme: if you really really want that all in RAM, then use it with FileSwitchDirectory.

        I mentioned before I was worried about this issue spinning out of control, and it appears this has taken place. Given these developments, i'd rather we not change the current limit at all.

        Show
        Robert Muir added a comment - No, I don't support changing this codec. Its an all-in-memory one (which is an unfortunate default, but must be until various algorithms in grouping/join/etc package are improved such that we can safely use something more like DiskDV as a default). Other all-memory implementations like DirectPostingsFormat/MemoryPostings have similar limitations, even the specialized faceting one (e.g. entire segment cannot have more than 2GB total bytes). I dont want to add a bunch of stuff in a loop here or any of that, because it only causes additional complexity for the normal case, and I think its unreasonable to use a RAM docvalues impl if you have more than 32KB per-document cost anyway. Sorry, thats just crazy: and I don't think we should add any additional trappy codec to support that. So if you want ridiculously huge per-document values, just use DiskDV which supports that. These abuse cases are extreme: if you really really want that all in RAM, then use it with FileSwitchDirectory. I mentioned before I was worried about this issue spinning out of control, and it appears this has taken place. Given these developments, i'd rather we not change the current limit at all.
        Hide
        Michael McCandless added a comment -

        David can you open a separate issue about changing the limit for existing codecs? The default DVFormat today is all-in-RAM, and I think it's OK for it to have limits, and e.g. if/when we change the default to DiskDV, it has no limit.

        I think this issue should focus solely on fixing core/indexer to not enforce the limit (i.e., moving the limit enforcing to those DVFormats that have it).

        Show
        Michael McCandless added a comment - David can you open a separate issue about changing the limit for existing codecs? The default DVFormat today is all-in-RAM, and I think it's OK for it to have limits, and e.g. if/when we change the default to DiskDV, it has no limit. I think this issue should focus solely on fixing core/indexer to not enforce the limit (i.e., moving the limit enforcing to those DVFormats that have it).
        Hide
        David Smiley added a comment -

        I can understand that an all in-RAM codec has size sensitivities. In that light, I can also understand that 32KB per document is a lot. The average per-document variable byte length size for Barakat's index is a measly 10 bytes. The maximum is around 69k. Likewise for the user Shai referenced on the list who was using it for faceting, it's only the worst-case document(s) that exceeded 32KB.

        Might the "new PagedBytes(16)" in Lucene42DocValuesProducer.loadBinary() be made configurable? i.e. Make 16 configurable? And/or perhaps make loadBinary() protected so another codec extending this one can keep the change somewhat minimal.

        Mike, in your latest patch, one improvement that could be made is instead of Lucene42DocValuesConsumer assuming the limit is "ByteBlockPool.BYTE_BLOCK_SIZE - 2" (which it technically is but only by coincidence), you could instead reference a calculated constant shared with the actual code that has this limit which is Lucene42DocValuesProducer.loadBinary(). For example, set the constant to 2^16-2 but then add an assert in loadBinary that the constant is consistent with the PagedBytes instance's config. Or something like that.

        David can you open a separate issue about changing the limit for existing codecs?

        Uh... all the discussion has been here so seems too late to me. And I'm probably done making my arguments. I can't be more convincing than pointing out the 10-byte average figure for my use case.

        Show
        David Smiley added a comment - I can understand that an all in-RAM codec has size sensitivities. In that light, I can also understand that 32KB per document is a lot. The average per-document variable byte length size for Barakat's index is a measly 10 bytes. The maximum is around 69k. Likewise for the user Shai referenced on the list who was using it for faceting, it's only the worst-case document(s) that exceeded 32KB. Might the "new PagedBytes(16)" in Lucene42DocValuesProducer.loadBinary() be made configurable? i.e. Make 16 configurable? And/or perhaps make loadBinary() protected so another codec extending this one can keep the change somewhat minimal. Mike, in your latest patch, one improvement that could be made is instead of Lucene42DocValuesConsumer assuming the limit is "ByteBlockPool.BYTE_BLOCK_SIZE - 2" (which it technically is but only by coincidence ), you could instead reference a calculated constant shared with the actual code that has this limit which is Lucene42DocValuesProducer.loadBinary(). For example, set the constant to 2^16-2 but then add an assert in loadBinary that the constant is consistent with the PagedBytes instance's config. Or something like that. David can you open a separate issue about changing the limit for existing codecs? Uh... all the discussion has been here so seems too late to me. And I'm probably done making my arguments. I can't be more convincing than pointing out the 10-byte average figure for my use case.
        Hide
        Robert Muir added a comment -

        You convinced me dont worry, you convinced me we shouldnt do anything on this whole issue at all. Because the stuff you outlined here is absolutely the wrong path for us to be going down.

        Show
        Robert Muir added a comment - You convinced me dont worry, you convinced me we shouldnt do anything on this whole issue at all. Because the stuff you outlined here is absolutely the wrong path for us to be going down.
        Hide
        Michael McCandless added a comment -

        Mike, in your latest patch, one improvement that could be made is instead of Lucene42DocValuesConsumer assuming the limit is "ByteBlockPool.BYTE_BLOCK_SIZE - 2" (which it technically is but only by coincidence), you could instead reference a calculated constant shared with the actual code that has this limit which is Lucene42DocValuesProducer.loadBinary(). For example, set the constant to 2^16-2 but then add an assert in loadBinary that the constant is consistent with the PagedBytes instance's config. Or something like that.

        +1

        But, again, let's keep this issue focused on not enforcing a limit in the core indexing code.

        Per-codec limits are separate issues.

        Show
        Michael McCandless added a comment - Mike, in your latest patch, one improvement that could be made is instead of Lucene42DocValuesConsumer assuming the limit is "ByteBlockPool.BYTE_BLOCK_SIZE - 2" (which it technically is but only by coincidence), you could instead reference a calculated constant shared with the actual code that has this limit which is Lucene42DocValuesProducer.loadBinary(). For example, set the constant to 2^16-2 but then add an assert in loadBinary that the constant is consistent with the PagedBytes instance's config. Or something like that. +1 But, again, let's keep this issue focused on not enforcing a limit in the core indexing code. Per-codec limits are separate issues.
        Hide
        David Smiley added a comment -

        Ok. I have nothing further to say in this issue; I welcome your improvements, Mike.

        Show
        David Smiley added a comment - Ok. I have nothing further to say in this issue; I welcome your improvements, Mike.
        Hide
        selckin added a comment -

        A few comments up someone asked for a use case, shouldn't something like http://www.elasticsearch.org/guide/reference/mapping/source-field/ be a perfect thing to use BinaryDocValues for?

        I was trying to store something similar using DiskDocValuesFormat and hit the 32k limit

        Show
        selckin added a comment - A few comments up someone asked for a use case, shouldn't something like http://www.elasticsearch.org/guide/reference/mapping/source-field/ be a perfect thing to use BinaryDocValues for? I was trying to store something similar using DiskDocValuesFormat and hit the 32k limit
        Hide
        Robert Muir added a comment -

        good god no.

        DocValues are not stored fields...

        This reinforces the value of the limit!

        Show
        Robert Muir added a comment - good god no. DocValues are not stored fields... This reinforces the value of the limit!
        Hide
        selckin added a comment -

        Ok, from the talks i watched on them & other info gathered it seemed like it would be a good fit, guess i really missed the point somewhere, can't find much info in the javadocs either, but guess this is for the user list and i shouldn't pollute this issue

        Show
        selckin added a comment - Ok, from the talks i watched on them & other info gathered it seemed like it would be a good fit, guess i really missed the point somewhere, can't find much info in the javadocs either, but guess this is for the user list and i shouldn't pollute this issue
        Hide
        David Smiley added a comment -

        Should the "closed" status and resolution change to "not a problem" mean that Michael McCandless improvement's in his patch here (that don't change the limit) won't get applied? They looked good to me. And you?

        Show
        David Smiley added a comment - Should the "closed" status and resolution change to "not a problem" mean that Michael McCandless improvement's in his patch here (that don't change the limit) won't get applied? They looked good to me. And you?
        Hide
        Michael McCandless added a comment -

        I still think we should fix the limitation in core; this way apps that want to store large binary fields per-doc are able to use a custom DVFormat.

        Show
        Michael McCandless added a comment - I still think we should fix the limitation in core; this way apps that want to store large binary fields per-doc are able to use a custom DVFormat.
        Hide
        Yonik Seeley added a comment -

        I still think we should fix the limitation in core; this way apps that want to store large binary fields per-doc are able to use a custom DVFormat.

        +1
        arbitrary limits are not a feature.

        Show
        Yonik Seeley added a comment - I still think we should fix the limitation in core; this way apps that want to store large binary fields per-doc are able to use a custom DVFormat. +1 arbitrary limits are not a feature.
        Hide
        Michael McCandless added a comment -

        Another iteration on the patch:

        • I added constants MAX_BINARY_FIELD_LENGTH to
          Lucene4 {0,2}

          DocValuesFormat, and then reference that in the
          Writer/Consumer to catch too-big values.

        • I added another test to BaseDocValuesFormatTestCase, to test the
          exact maximum length value.
        • I fixed that test failure, by passing the String field to the
          codecAcceptsHugeBinaryValues method, and adding a _TestUtil helper
          method to check this.

        An alternative to the protected method would be to have two separate
        tests in the base class, one test verifying a clean IllegalArgumentExc
        is thrown when the value is too big, and another verifying huge binary
        values can be indexed successfully. And then I'd fix each DVFormat's
        test to subclass and @Ignore whichever base test is not appropriate.

        But I don't think this would simplify things much? Ie,
        TestDocValuesFormat would still need logic to check depending on the
        default codec.

        Show
        Michael McCandless added a comment - Another iteration on the patch: I added constants MAX_BINARY_FIELD_LENGTH to Lucene4 {0,2} DocValuesFormat, and then reference that in the Writer/Consumer to catch too-big values. I added another test to BaseDocValuesFormatTestCase, to test the exact maximum length value. I fixed that test failure, by passing the String field to the codecAcceptsHugeBinaryValues method, and adding a _TestUtil helper method to check this. An alternative to the protected method would be to have two separate tests in the base class, one test verifying a clean IllegalArgumentExc is thrown when the value is too big, and another verifying huge binary values can be indexed successfully. And then I'd fix each DVFormat's test to subclass and @Ignore whichever base test is not appropriate. But I don't think this would simplify things much? Ie, TestDocValuesFormat would still need logic to check depending on the default codec.
        Hide
        Steve Rowe added a comment -

        Bulk move 4.4 issues to 4.5 and 5.0

        Show
        Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
        Hide
        Shai Erera added a comment -

        Patch looks good. I prefer the current way of the test (the 'protected' method).

        Also, you have a printout in Lucene40DocValuesWriter after the "if (b.length > MAX_BINARY)" - remove/comment?

        +1 to commit.

        Show
        Shai Erera added a comment - Patch looks good. I prefer the current way of the test (the 'protected' method). Also, you have a printout in Lucene40DocValuesWriter after the "if (b.length > MAX_BINARY)" - remove/comment? +1 to commit.
        Hide
        David Smiley added a comment -

        Cool; I didn't know of the Facet42 codec with its support for large doc values. Looks like I can use it without faceting. I'll have to try that.

        +1 to commit.

        Show
        David Smiley added a comment - Cool; I didn't know of the Facet42 codec with its support for large doc values. Looks like I can use it without faceting. I'll have to try that. +1 to commit.
        Hide
        Michael McCandless added a comment -

        New patch.

        Thanks for the review Shai; I removed that leftover print and sync'd
        patch to trunk. I think it's ready ... I'll wait a few days.

        Show
        Michael McCandless added a comment - New patch. Thanks for the review Shai; I removed that leftover print and sync'd patch to trunk. I think it's ready ... I'll wait a few days.
        Hide
        Yonik Seeley added a comment -

        I'm confused by the following comment:

        +  /** Maximum length for each binary doc values field,
        +   *  because we use PagedBytes with page size of 16 bits. */
        +  public static final int MAX_BINARY_FIELD_LENGTH = (1 << 16) + 1;
        

        But this patch removes the PagedBytes limitation, right?

        After this patch, are there any remaining code limitations that prevent raising the limit, or is it really just self imposed via

        +      if (v.length > Lucene42DocValuesFormat.MAX_BINARY_FIELD_LENGTH) {
        +        throw new IllegalArgumentException("DocValuesField \"" + field.name + "\" is too large, must be <= " + Lucene42DocValuesFormat.MAX_BINARY_FIELD_LENGTH);
        +      }
        
        Show
        Yonik Seeley added a comment - I'm confused by the following comment: + /** Maximum length for each binary doc values field, + * because we use PagedBytes with page size of 16 bits. */ + public static final int MAX_BINARY_FIELD_LENGTH = (1 << 16) + 1; But this patch removes the PagedBytes limitation, right? After this patch, are there any remaining code limitations that prevent raising the limit, or is it really just self imposed via + if (v.length > Lucene42DocValuesFormat.MAX_BINARY_FIELD_LENGTH) { + throw new IllegalArgumentException( "DocValuesField \" " + field.name + " \ " is too large, must be <= " + Lucene42DocValuesFormat.MAX_BINARY_FIELD_LENGTH); + }
        Hide
        Yonik Seeley added a comment -

        Another user has hit this arbitrary limit: http://markmail.org/message/sotbq6xpib4xwozz
        If it is arbitrary at this point, we should simply remove it.

        Show
        Yonik Seeley added a comment - Another user has hit this arbitrary limit: http://markmail.org/message/sotbq6xpib4xwozz If it is arbitrary at this point, we should simply remove it.
        Hide
        Michael McCandless added a comment -

        I'm confused by the following comment:

        I fixed the comment; it's because those DVFormats use PagedBytes.fillSlice, which cannot handle more than 2 pages.

        New patch w/ that fix ...

        Show
        Michael McCandless added a comment - I'm confused by the following comment: I fixed the comment; it's because those DVFormats use PagedBytes.fillSlice, which cannot handle more than 2 pages. New patch w/ that fix ...
        Hide
        Michael McCandless added a comment -

        Actually, I'm going to roll the limit for 40/42 DVFormats back to what IndexWriter currently enforces ... this way we don't get into a situation where different 40/42 indices out there were built with different limits enforced.

        Show
        Michael McCandless added a comment - Actually, I'm going to roll the limit for 40/42 DVFormats back to what IndexWriter currently enforces ... this way we don't get into a situation where different 40/42 indices out there were built with different limits enforced.
        Hide
        ASF subversion and git services added a comment -

        Commit 1514669 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1514669 ]

        LUCENE-4583: IndexWriter no longer places a limit on length of DV binary fields (individual codecs still have their limits, including the default codec)

        Show
        ASF subversion and git services added a comment - Commit 1514669 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1514669 ] LUCENE-4583 : IndexWriter no longer places a limit on length of DV binary fields (individual codecs still have their limits, including the default codec)
        Hide
        ASF subversion and git services added a comment -

        Commit 1514848 from Michael McCandless in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1514848 ]

        LUCENE-4583: IndexWriter no longer places a limit on length of DV binary fields (individual codecs still have their limits, including the default codec)

        Show
        ASF subversion and git services added a comment - Commit 1514848 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1514848 ] LUCENE-4583 : IndexWriter no longer places a limit on length of DV binary fields (individual codecs still have their limits, including the default codec)
        Hide
        Adrien Grand added a comment -

        4.5 release -> bulk close

        Show
        Adrien Grand added a comment - 4.5 release -> bulk close

          People

          • Assignee:
            Michael McCandless
            Reporter:
            David Smiley
          • Votes:
            1 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development