Lucene - Core
  1. Lucene - Core
  2. LUCENE-5178

doc values should expose missing values (or allow configurable defaults)

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.5, Trunk
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      DocValues should somehow allow a configurable default per-field.
      Possible implementations include setting it on the field in the document or registration of an IndexWriter callback.

      If we don't make the default configurable, then another option is to have DocValues fields keep track of whether a value was indexed for that document or not.

      1. LUCENE-5178_reintegrate.patch
        217 kB
        Robert Muir
      2. LUCENE-5178.patch
        395 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          I dont think lucene needs to do anything here. As i explained on LUCENE-5177 (but you refused to listen), you just have a separate numericdocvalues field just like fieldcache does (it has a separate bitset).

          So you write a 1 there, when the document has a value for the field. Otherwise it will contain a 0 (because it gets filled with that).

          This is very easy to do and will use ~ 1 bit per document just like fieldcache.

          Then its just a matter of making it easy to pass this NumericDV to FIeldComparator (currently: it always pulls a Bits directly from FC). this can be something like:

          if (ndv instanceof Bits) { // codec already specializes here
            docsWithField = (Bits) ndv;
          } else {
            docsWithField = new Bits() {
              boolean get(int index) {
                return ndv.get(index) != 0;
              }
          }
          
          Show
          Robert Muir added a comment - I dont think lucene needs to do anything here. As i explained on LUCENE-5177 (but you refused to listen), you just have a separate numericdocvalues field just like fieldcache does (it has a separate bitset). So you write a 1 there, when the document has a value for the field. Otherwise it will contain a 0 (because it gets filled with that). This is very easy to do and will use ~ 1 bit per document just like fieldcache. Then its just a matter of making it easy to pass this NumericDV to FIeldComparator (currently: it always pulls a Bits directly from FC). this can be something like: if (ndv instanceof Bits) { // codec already specializes here docsWithField = (Bits) ndv; } else { docsWithField = new Bits() { boolean get( int index) { return ndv.get(index) != 0; } }
          Hide
          Yonik Seeley added a comment -

          Then its just a matter of [...]

          Well, at least you're now acknowledging that there is a problem and that it will take work somewhere in Lucene to fix it!

          It's still not clear why you're against making the hard-coded default configurable however.

          Show
          Yonik Seeley added a comment - Then its just a matter of [...] Well, at least you're now acknowledging that there is a problem and that it will take work somewhere in Lucene to fix it! It's still not clear why you're against making the hard-coded default configurable however.
          Hide
          Yonik Seeley added a comment -

          Somewhat related, I've opened SOLR-5165 to remove Solr's mandatory default values for DocValue fields.

          Show
          Yonik Seeley added a comment - Somewhat related, I've opened SOLR-5165 to remove Solr's mandatory default values for DocValue fields.
          Hide
          Robert Muir added a comment -

          I am not acknowledging there is a problem: I'm just telling you if you have 'sparse' values in a docvalues field, and you want to emulate what fieldcache does in allowing you to optionally pull a bitset telling you when a value does/doesnt exist: you can do the same thing at index-time yourself today.

          I'm against changing the "default" of 0 because its both unnecessary and unhelpful to differentiate whether a value exists in the field (it wont work: for numeric types it could be a "real value". Thats why FieldCache does this as a bitset, thats why FieldCache has a "hardcoded default" of 0). I don't want to add unnecessary complexity that ultimately provides no benefit (because that solves nothing, sorry).

          I'm not opposed to allowing the comparators to take in a bits from somewhere other than the fieldcache (which i think always returns MatchAllBits for dv fields). This way if someone wants this: they can do it. I do have some reservations about it, because it doesnt give a 1-1 consistency with FieldCache api (so wouldnt "automatically" work for function queries without giving them special ctors too). So this would make APIs harder to use: and I don't like that... but its an option and its totally clear to the user what is happening.

          I'm significantly less opposed to supporting an equivalent to FieldCache.getDocsWithField for docvalues. The advantage is we could pass FieldCache.getDocsWithField thru to it, and the sort missing-first/last, function queries exist() and so on would automatically work. The downsides are: it adds some complexity under the hood to deal with (e.g. indexwriter consumers, codec apis need change, codecs need to optimize for the case where none are missing, etc). And is this really complexity we should be adding for what is supposed to be a column-stride type (like norms?)... I'm just not sure its the right tradeoff.

          Show
          Robert Muir added a comment - I am not acknowledging there is a problem: I'm just telling you if you have 'sparse' values in a docvalues field, and you want to emulate what fieldcache does in allowing you to optionally pull a bitset telling you when a value does/doesnt exist: you can do the same thing at index-time yourself today. I'm against changing the "default" of 0 because its both unnecessary and unhelpful to differentiate whether a value exists in the field (it wont work: for numeric types it could be a "real value". Thats why FieldCache does this as a bitset, thats why FieldCache has a "hardcoded default" of 0). I don't want to add unnecessary complexity that ultimately provides no benefit (because that solves nothing, sorry). I'm not opposed to allowing the comparators to take in a bits from somewhere other than the fieldcache (which i think always returns MatchAllBits for dv fields). This way if someone wants this: they can do it. I do have some reservations about it, because it doesnt give a 1-1 consistency with FieldCache api (so wouldnt "automatically" work for function queries without giving them special ctors too). So this would make APIs harder to use: and I don't like that... but its an option and its totally clear to the user what is happening. I'm significantly less opposed to supporting an equivalent to FieldCache.getDocsWithField for docvalues. The advantage is we could pass FieldCache.getDocsWithField thru to it, and the sort missing-first/last, function queries exist() and so on would automatically work. The downsides are: it adds some complexity under the hood to deal with (e.g. indexwriter consumers, codec apis need change, codecs need to optimize for the case where none are missing, etc). And is this really complexity we should be adding for what is supposed to be a column-stride type (like norms?)... I'm just not sure its the right tradeoff.
          Hide
          Yonik Seeley added a comment -

          This way if someone wants this: they can do it.

          Shouldn't this be handled at the lucene level?

          Anyway, this whole discussion all started from this:

          There are currently one or two disadvantages to using docvalues in solr currently. The major disadvantage is being forced to specify a default value in Solr,

          That's the major issue, to which you replied "why would have fields defined you arent using?". Although the lack of resolution on this issue would be an annoyance, the real problem is SOLR-5165. There should have never been a required default value for Solr, and it should be removed.

          Show
          Yonik Seeley added a comment - This way if someone wants this: they can do it. Shouldn't this be handled at the lucene level? Anyway, this whole discussion all started from this: There are currently one or two disadvantages to using docvalues in solr currently. The major disadvantage is being forced to specify a default value in Solr, That's the major issue, to which you replied "why would have fields defined you arent using?". Although the lack of resolution on this issue would be an annoyance, the real problem is SOLR-5165 . There should have never been a required default value for Solr, and it should be removed.
          Hide
          Robert Muir added a comment -

          Shouldn't this be handled at the lucene level?

          Well thats the question with my last 2 paragraphs, where i list the tradeoffs between two approaches. I will hack up a patch to see just how bad it is to add this 'Bits getDocWithField' support. Then we can see this more visually.

          There are currently one or two disadvantages to using docvalues in solr currently. The major disadvantage is being forced to specify a default value in Solr,

          That's the major issue, to which you replied "why would have fields defined you arent using?". Although the lack of resolution on this issue would be an annoyance, the real problem is SOLR-5165. There should have never been a required default value for Solr, and it should be removed.

          I agree the default value is "brutal", but as far as I know, Adrien did it in this way to leave no surprises (so users arent complaining when things like sort missing last/first dont work).

          Another approach would be (right now) to only fail when those options are set, another option would be to handle it at the solr fieldtype level.

          But I'm happy to write a patch for it on the lucene side so we can see how complicated it really is, its better than arguing about it.

          Show
          Robert Muir added a comment - Shouldn't this be handled at the lucene level? Well thats the question with my last 2 paragraphs, where i list the tradeoffs between two approaches. I will hack up a patch to see just how bad it is to add this 'Bits getDocWithField' support. Then we can see this more visually. There are currently one or two disadvantages to using docvalues in solr currently. The major disadvantage is being forced to specify a default value in Solr, That's the major issue, to which you replied "why would have fields defined you arent using?". Although the lack of resolution on this issue would be an annoyance, the real problem is SOLR-5165 . There should have never been a required default value for Solr, and it should be removed. I agree the default value is "brutal", but as far as I know, Adrien did it in this way to leave no surprises (so users arent complaining when things like sort missing last/first dont work). Another approach would be (right now) to only fail when those options are set, another option would be to handle it at the solr fieldtype level. But I'm happy to write a patch for it on the lucene side so we can see how complicated it really is, its better than arguing about it.
          Hide
          ASF subversion and git services added a comment -

          Commit 1514642 from Robert Muir in branch 'dev/branches/lucene5178'
          [ https://svn.apache.org/r1514642 ]

          LUCENE-5178: add 'missing' support to docvalues (simpletext only)

          Show
          ASF subversion and git services added a comment - Commit 1514642 from Robert Muir in branch 'dev/branches/lucene5178' [ https://svn.apache.org/r1514642 ] LUCENE-5178 : add 'missing' support to docvalues (simpletext only)
          Hide
          Robert Muir added a comment -

          I created a patch with getDocsWithField (and the current fieldcache.getDocsWithField passing thru to it) for docvalues so you know if a value is missing.

          It also means e.g. SortedDocValues returns -1 ord for missing like fieldcache. so its completely consistent there with FC.

          currently simpletext is the only one implementing it: the other codecs return MatchAllBits (and thats how the backcompat will work, because they never had missing values before).

          all tests are passing, but I want to think about strategies for the efficient codecs (Memory/Disk) before doing anything.

          one other thing i like is, if we do it this way, the codec has the chance to represent missing values in a more efficient way than if the users do it themselves "on top".

          Show
          Robert Muir added a comment - I created a patch with getDocsWithField (and the current fieldcache.getDocsWithField passing thru to it) for docvalues so you know if a value is missing. It also means e.g. SortedDocValues returns -1 ord for missing like fieldcache. so its completely consistent there with FC. currently simpletext is the only one implementing it: the other codecs return MatchAllBits (and thats how the backcompat will work, because they never had missing values before). all tests are passing, but I want to think about strategies for the efficient codecs (Memory/Disk) before doing anything. one other thing i like is, if we do it this way, the codec has the chance to represent missing values in a more efficient way than if the users do it themselves "on top".
          Hide
          Yonik Seeley added a comment -

          Yes, I think tracking/exposing missing values is the best option, esp for numerics where you can use the full range and still tell of there was a value or not.

          Show
          Yonik Seeley added a comment - Yes, I think tracking/exposing missing values is the best option, esp for numerics where you can use the full range and still tell of there was a value or not.
          Hide
          Robert Muir added a comment -

          OK. I can remove the solr defaultValue check here too: i have to fix the tests to test sort missing first/last / facet missing etc anyway (currently the dv tests avoid that).

          Show
          Robert Muir added a comment - OK. I can remove the solr defaultValue check here too: i have to fix the tests to test sort missing first/last / facet missing etc anyway (currently the dv tests avoid that).
          Hide
          Robert Muir added a comment -

          Here is a patch of 5178 branch:

          • Adds AtomicReader.getDocsWithField(String) that works just like FieldCache.getDocsWithField(String) and fieldcacheimpl just forwards to it: this means things like facet missing/function queries exists()/sort missing last work with docvalues already.
          • SortedSet returns -1 ord for missing value from DV (not just fieldcache). So now these apis work exactly the same.
          • Codec api is unchanged, codecs get null values for missing.
          • new Lucene45 codec (works off heap: LUCENE-5178) that supports this, older codecs do not (they return MatchAllBits, essentially they never had missing values, only "filled" ones).
          • Lucene42 becomes read-only and the writers in test-framework for Lucene40 and Lucene42 "simulates" what happened before by filling themselves and always returning MatchAllBits
          • new Memory codec (works on heap) works like Lucene42 did, except supports missing values too, as does simpletext.
          • Facet42 dv is unchanged (for its use cases, the distinction between "missing" and "empty ordinal list" is not useful). Facet42Codec (which was just sugar to use it) is renamed to Facet45Codec, but of course old segments can be read since the dv name is the same.
          • Removed solr restrictions for single-valued docvalues fields (they dont need to have a default value or be required anymore).
          • Improved testing for solr docvalues faceting (single valued fields, numericfacets, and so on). This was previously not possible because only multi-valued types could support missing.
          • I reviewed all consumers and ensure they are optimized (e.g. check that value == 0 before wasting their time and so on, this was already this way in lucene's sort comparators)
          Show
          Robert Muir added a comment - Here is a patch of 5178 branch: Adds AtomicReader.getDocsWithField(String) that works just like FieldCache.getDocsWithField(String) and fieldcacheimpl just forwards to it: this means things like facet missing/function queries exists()/sort missing last work with docvalues already. SortedSet returns -1 ord for missing value from DV (not just fieldcache). So now these apis work exactly the same. Codec api is unchanged, codecs get null values for missing. new Lucene45 codec (works off heap: LUCENE-5178 ) that supports this, older codecs do not (they return MatchAllBits, essentially they never had missing values, only "filled" ones). Lucene42 becomes read-only and the writers in test-framework for Lucene40 and Lucene42 "simulates" what happened before by filling themselves and always returning MatchAllBits new Memory codec (works on heap) works like Lucene42 did, except supports missing values too, as does simpletext. Facet42 dv is unchanged (for its use cases, the distinction between "missing" and "empty ordinal list" is not useful). Facet42Codec (which was just sugar to use it) is renamed to Facet45Codec, but of course old segments can be read since the dv name is the same. Removed solr restrictions for single-valued docvalues fields (they dont need to have a default value or be required anymore). Improved testing for solr docvalues faceting (single valued fields, numericfacets, and so on). This was previously not possible because only multi-valued types could support missing. I reviewed all consumers and ensure they are optimized (e.g. check that value == 0 before wasting their time and so on, this was already this way in lucene's sort comparators)
          Hide
          Robert Muir added a comment -

          Here is the result of svn merge --reintegrate, i dont think its necessarily that easier to work with unfortunately

          Show
          Robert Muir added a comment - Here is the result of svn merge --reintegrate, i dont think its necessarily that easier to work with unfortunately
          Hide
          Michael McCandless added a comment -

          +1, I reviewed the first patch.

          We'll need to fix facet module's dynamic range faceting to skip missing values; I can do this after you commit this patch...

          Show
          Michael McCandless added a comment - +1, I reviewed the first patch. We'll need to fix facet module's dynamic range faceting to skip missing values; I can do this after you commit this patch...
          Hide
          ASF subversion and git services added a comment -

          Commit 1515977 from Robert Muir in branch 'dev/trunk'
          [ https://svn.apache.org/r1515977 ]

          LUCENE-5178: add missing support for docvalues

          Show
          ASF subversion and git services added a comment - Commit 1515977 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1515977 ] LUCENE-5178 : add missing support for docvalues
          Hide
          Robert Muir added a comment -

          Thanks Mike: I added additional tests and ensured the 'missing' stuff in solr is fully functional.

          I'll let jenkins chomp on it for a while in trunk.

          Show
          Robert Muir added a comment - Thanks Mike: I added additional tests and ensured the 'missing' stuff in solr is fully functional. I'll let jenkins chomp on it for a while in trunk.
          Hide
          ASF subversion and git services added a comment -

          Commit 1515999 from Michael McCandless in branch 'dev/trunk'
          [ https://svn.apache.org/r1515999 ]

          LUCENE-5178: handle missing values in range facets

          Show
          ASF subversion and git services added a comment - Commit 1515999 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1515999 ] LUCENE-5178 : handle missing values in range facets
          Hide
          ASF subversion and git services added a comment -

          Commit 1516003 from Michael McCandless in branch 'dev/trunk'
          [ https://svn.apache.org/r1516003 ]

          LUCENE-5178: test requires codec that supports docsWithField

          Show
          ASF subversion and git services added a comment - Commit 1516003 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1516003 ] LUCENE-5178 : test requires codec that supports docsWithField
          Hide
          ASF subversion and git services added a comment -

          Commit 1516012 from Robert Muir in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1516012 ]

          LUCENE-5178: add missing support for docvalues

          Show
          ASF subversion and git services added a comment - Commit 1516012 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1516012 ] LUCENE-5178 : add missing support for docvalues
          Hide
          Han Jiang added a comment -

          During test I somehow hit a failure:

             [junit4] FAILURE 0.27s | TestRangeAccumulator.testMissingValues <<<
             [junit4]    > Throwable #1: org.junit.ComparisonFailure: expected:<...(0)
             [junit4]    >   less than 10 ([8)
             [junit4]    >   less than or equal to 10 (]8)
             [junit4]    >   over 90 (8)
             [junit4]    >   9...> but was:<...(0)
             [junit4]    >   less than 10 ([28)
             [junit4]    >   less than or equal to 10 (2]8)
             [junit4]    >   over 90 (8)
             [junit4]    >   9...>
             [junit4]    > 	at __randomizedtesting.SeedInfo.seed([815B6AA86D05329C:EBC638EE498F066D]:0)
             [junit4]    > 	at org.apache.lucene.facet.range.TestRangeAccumulator.testMissingValues(TestRangeAccumulator.java:670)
             [junit4]    > 	at java.lang.Thread.run(Thread.java:722)
          

          Seed:

          ant test  -Dtestcase=TestRangeAccumulator -Dtests.method=testMissingValues -Dtests.seed=815B6AA86D05329C -Dtests.slow=true -Dtests.postingsformat=Lucene41 -Dtests.locale=ca -Dtests.timezone=Australia/Currie -Dtests.file.encoding=UTF-8
          
          Show
          Han Jiang added a comment - During test I somehow hit a failure: [junit4] FAILURE 0.27s | TestRangeAccumulator.testMissingValues <<< [junit4] > Throwable #1: org.junit.ComparisonFailure: expected:<...(0) [junit4] > less than 10 ([8) [junit4] > less than or equal to 10 (]8) [junit4] > over 90 (8) [junit4] > 9...> but was:<...(0) [junit4] > less than 10 ([28) [junit4] > less than or equal to 10 (2]8) [junit4] > over 90 (8) [junit4] > 9...> [junit4] > at __randomizedtesting.SeedInfo.seed([815B6AA86D05329C:EBC638EE498F066D]:0) [junit4] > at org.apache.lucene.facet.range.TestRangeAccumulator.testMissingValues(TestRangeAccumulator.java:670) [junit4] > at java.lang.Thread.run(Thread.java:722) Seed: ant test -Dtestcase=TestRangeAccumulator -Dtests.method=testMissingValues -Dtests.seed=815B6AA86D05329C -Dtests.slow=true -Dtests.postingsformat=Lucene41 -Dtests.locale=ca -Dtests.timezone=Australia/Currie -Dtests.file.encoding=UTF-8
          Hide
          Shai Erera added a comment -

          I manage to reproduce it only if I set both tests.seed and tests.postingsformat. What I see is that Lucene42DVProducer is used, which returns MatchAllBits (it supports only SortedSet?), which leads to the incorrect counts. The initialized Codec is Lucene45Codec, which returns Lucene42DVF. The test does use defaultCodecSupportsDocsWithField(), but the latter only asserts that the name of the Codec is not "40", "41", "42", yet it does not check if the DVF supports it. That seems wrong?

          Show
          Shai Erera added a comment - I manage to reproduce it only if I set both tests.seed and tests.postingsformat. What I see is that Lucene42DVProducer is used, which returns MatchAllBits (it supports only SortedSet?), which leads to the incorrect counts. The initialized Codec is Lucene45Codec, which returns Lucene42DVF. The test does use defaultCodecSupportsDocsWithField(), but the latter only asserts that the name of the Codec is not "40", "41", "42", yet it does not check if the DVF supports it. That seems wrong?
          Hide
          Michael McCandless added a comment -

          It sounds like we need to check the actual DVFormat for that field (_TestUtil.getDocValuesFormat("field")) and then test whether that format supports missing values.

          I think this failure can only happen if you explicitly set -Dtests.postingsformat, because then we make an anon subclass of Lucene45 (TestRuleSetupAndRestoreClassEnv.java at line 194) ... so it sounds like in general we should not be using defaultCodecSupportsDocsWithField() but rather something like defaultDVFormatSupportsDocsWithField(String field) ...

          Show
          Michael McCandless added a comment - It sounds like we need to check the actual DVFormat for that field (_TestUtil.getDocValuesFormat("field")) and then test whether that format supports missing values. I think this failure can only happen if you explicitly set -Dtests.postingsformat, because then we make an anon subclass of Lucene45 (TestRuleSetupAndRestoreClassEnv.java at line 194) ... so it sounds like in general we should not be using defaultCodecSupportsDocsWithField() but rather something like defaultDVFormatSupportsDocsWithField(String field) ...
          Hide
          Shai Erera added a comment -

          I see. I think this can also happen if you use RandomCodec and it draws Lucene42DVF? So in this case, with this seed, it trips if you set postingsformat, but I'm not sure that in general this assume() is correct.

          The ugly part of having a test calling _TestUtil.geDVF(field) (or we wrap it in a nice method) is that the test will need to decide up front on all the fields it uses, and if there's a mistake, the error may happen in the future and harder to debug (i.e. spot that the test uses a different field than what it passed to assume()). But I don't think that asserting the Codec is the right test here, so this has to change.

          Show
          Shai Erera added a comment - I see. I think this can also happen if you use RandomCodec and it draws Lucene42DVF? So in this case, with this seed, it trips if you set postingsformat, but I'm not sure that in general this assume() is correct. The ugly part of having a test calling _TestUtil.geDVF(field) (or we wrap it in a nice method) is that the test will need to decide up front on all the fields it uses, and if there's a mistake, the error may happen in the future and harder to debug (i.e. spot that the test uses a different field than what it passed to assume()). But I don't think that asserting the Codec is the right test here, so this has to change.
          Hide
          Shai Erera added a comment -

          Opened LUCENE-5199.

          Show
          Shai Erera added a comment - Opened LUCENE-5199 .
          Hide
          Robert Muir added a comment -

          Can this commit please be reverted?

          The change makes the test API so complicated for something that cannot happen:
          You cannot have "unsupported fields" its all or none.

          This is a bug in LuceneTestCase, it should not do this when someone uses -Dtests.postingsformat.

          Show
          Robert Muir added a comment - Can this commit please be reverted? The change makes the test API so complicated for something that cannot happen: You cannot have "unsupported fields" its all or none. This is a bug in LuceneTestCase, it should not do this when someone uses -Dtests.postingsformat.
          Hide
          Adrien Grand added a comment -

          4.5 release -> bulk close

          Show
          Adrien Grand added a comment - 4.5 release -> bulk close

            People

            • Assignee:
              Unassigned
              Reporter:
              Yonik Seeley
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development