Lucene - Core
  1. Lucene - Core
  2. LUCENE-4547

DocValues field broken on large indexes

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.2, Trunk
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I tried to write a test to sanity check LUCENE-4536 (first running against svn revision 1406416, before the change).

      But i found docvalues is already broken here for large indexes that have a PackedLongDocValues field:

      final int numDocs = 500000000;
      for (int i = 0; i < numDocs; ++i) {
        if (i == 0) {
          field.setLongValue(0L); // force > 32bit deltas
        } else {
          field.setLongValue(1<<33L); 
        }
        w.addDocument(doc);
      }
      w.forceMerge(1);
      w.close();
      dir.close(); // checkindex
      
      [junit4:junit4]   2> WARNING: Uncaught exception in thread: Thread[Lucene Merge Thread #0,6,TGRP-Test2GBDocValues]
      [junit4:junit4]   2> org.apache.lucene.index.MergePolicy$MergeException: java.lang.ArrayIndexOutOfBoundsException: -65536
      [junit4:junit4]   2> 	at __randomizedtesting.SeedInfo.seed([5DC54DB14FA5979]:0)
      [junit4:junit4]   2> 	at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:535)
      [junit4:junit4]   2> 	at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:508)
      [junit4:junit4]   2> Caused by: java.lang.ArrayIndexOutOfBoundsException: -65536
      [junit4:junit4]   2> 	at org.apache.lucene.util.ByteBlockPool.deref(ByteBlockPool.java:305)
      [junit4:junit4]   2> 	at org.apache.lucene.codecs.lucene40.values.FixedStraightBytesImpl$FixedBytesWriterBase.set(FixedStraightBytesImpl.java:115)
      [junit4:junit4]   2> 	at org.apache.lucene.codecs.lucene40.values.PackedIntValues$PackedIntsWriter.writePackedInts(PackedIntValues.java:109)
      [junit4:junit4]   2> 	at org.apache.lucene.codecs.lucene40.values.PackedIntValues$PackedIntsWriter.finish(PackedIntValues.java:80)
      [junit4:junit4]   2> 	at org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:130)
      [junit4:junit4]   2> 	at org.apache.lucene.codecs.PerDocConsumer.merge(PerDocConsumer.java:65)
      
      1. test.patch
        3 kB
        Robert Muir
      2. LUCENE-4547.patch
        1.86 MB
        Robert Muir

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          Here was my initial test, just screwing around.

          I ran with 'ant test -Dtestcase=Test2GBDocValues -Dtests.nightly=true -Dtests.heapsize=5G'

          Show
          Robert Muir added a comment - Here was my initial test, just screwing around. I ran with 'ant test -Dtestcase=Test2GBDocValues -Dtests.nightly=true -Dtests.heapsize=5G'
          Hide
          Robert Muir added a comment -

          There is even a out-of-coffee bug in the test, its only using like 2 bits per value
          So this is really even worse.

          I'm not sure we should be using ByteBlockPool etc here. I think it shouldnt be used outside of the indexer.

          Show
          Robert Muir added a comment - There is even a out-of-coffee bug in the test, its only using like 2 bits per value So this is really even worse. I'm not sure we should be using ByteBlockPool etc here. I think it shouldnt be used outside of the indexer.
          Hide
          Robert Muir added a comment -

          editing description: I think it affects more than PackedIntValues actually?

          I think the bug is in how FixedStraightBytesImpl uses byteblockpool.

          So this means the problem should be way more widespread: e.g. if you have lots of documents in general I think you are fucked (as norms should trip it too).

          Show
          Robert Muir added a comment - editing description: I think it affects more than PackedIntValues actually? I think the bug is in how FixedStraightBytesImpl uses byteblockpool. So this means the problem should be way more widespread: e.g. if you have lots of documents in general I think you are fucked (as norms should trip it too).
          Hide
          Robert Muir added a comment -

          Another bug is that I had to pass tests.heapsize at all.

          I think its bad that docvalues gobbles up so much ram when merging.
          Cant we merge this stuff from disk?

          Show
          Robert Muir added a comment - Another bug is that I had to pass tests.heapsize at all. I think its bad that docvalues gobbles up so much ram when merging. Cant we merge this stuff from disk?
          Hide
          Adrien Grand added a comment -

          I'm having a look at the branch, and it looks great! I like the fact that there are less types and that values are buffered into memory so that the doc values format can make decisions depending on the number of distinct values, ... Still, I have some questions on what you plan to do with this branch:

          • do you plan to use this branch to:
          • fix other issues such as LUCENE-3862?
          • merge the FieldCache / FunctionValues / DocValues.Source APIs?
          • are you going to remove DocValues.Type.FLOAT_*?
          • are SimpleDVConsumer and SimpleDocValuesFormat going to replace PerDocConsumer and DocValuesFormat?
          • are you going to remove hasArray/getArray?
          • will there still be a direct=true|false option at load-time or will it depend on the format impl (potentially with a PerFieldPerDocProducer similarly to the postings formats)?
          Show
          Adrien Grand added a comment - I'm having a look at the branch, and it looks great! I like the fact that there are less types and that values are buffered into memory so that the doc values format can make decisions depending on the number of distinct values, ... Still, I have some questions on what you plan to do with this branch: do you plan to use this branch to: fix other issues such as LUCENE-3862 ? merge the FieldCache / FunctionValues / DocValues.Source APIs? are you going to remove DocValues.Type.FLOAT_*? are SimpleDVConsumer and SimpleDocValuesFormat going to replace PerDocConsumer and DocValuesFormat? are you going to remove hasArray/getArray? will there still be a direct=true|false option at load-time or will it depend on the format impl (potentially with a PerFieldPerDocProducer similarly to the postings formats)?
          Hide
          Robert Muir added a comment -

          These are hard questions. My personal goal here for this prototype (currently SimpleText only!) was to:

          1. Making merging use (significantly) less RAM, to fix this bug.
          2. Make it easier to write docvalues codecs, to encourage innovations (e.g. FST impls, etc etc)
          3. Simplify the types to make it easier on the user.

          the consumer api I think is simpler (part of #2), but I would like to (in the future) simplify the producer API too.
          I'm not sure if we should do it here though? anyway we can think about the issues you raised one by one and do them separately on their own issues.

          fix other issues such as LUCENE-3862?

          Its my opinion we should do this sooner than later.

          merge the FieldCache / FunctionValues / DocValues.Source APIs?

          This really needs to be addressed, but I think not here. Its horrific that algorithms like grouping, sorting, and maybe faceting have to be duplicated for 2 different things (fieldcache and docvalues).

          are you going to remove DocValues.Type.FLOAT_*?

          I think the 3 types we have here are enough. Someone can do a float or double type "on top of" the "number" type we have.
          Lucene is already doing this today: look at norms. I think lucene should just have a number type that stores bits.

          are SimpleDVConsumer and SimpleDocValuesFormat going to replace PerDocConsumer and DocValuesFormat?

          This is the idea, once we are happy with the APIs we would implement the 4.0 ones with these apis.

          are you going to remove hasArray/getArray?

          I don't care about this. I am unsure similarity impls should be calling this though, definitely at least
          it would be better for them to fall-back: I just cant bring myself to fix it until LUCENE-3862 is fixed

          will there still be a direct=true|false option at load-time or will it depend on the format impl (potentially with a PerFieldPerDocProducer similarly to the postings formats)?

          I don't want to change this in the branch. Personally i feel like a codec/segmentreader/etc should generally only manage
          direct, producer exposing the same "stats" (minimum, maximum, fixed, whatever) that the consumer apis get (which will also make merging more efficient!) default source impl can be something nice, read the direct impl into a packed ints,
          and so on. Codec could override to e.g. just slurp in their on-disk packed ints directly. So codec still has control
          of the in-memory RAM representation, i think this is important. But i think codec and segmentreader should somehow not
          be in control of caching: this should be elsewhere (FieldCache.DOCVALUES.xxx????)...

          Show
          Robert Muir added a comment - These are hard questions. My personal goal here for this prototype (currently SimpleText only!) was to: 1. Making merging use (significantly) less RAM, to fix this bug. 2. Make it easier to write docvalues codecs, to encourage innovations (e.g. FST impls, etc etc) 3. Simplify the types to make it easier on the user. the consumer api I think is simpler (part of #2), but I would like to (in the future) simplify the producer API too. I'm not sure if we should do it here though? anyway we can think about the issues you raised one by one and do them separately on their own issues. fix other issues such as LUCENE-3862 ? Its my opinion we should do this sooner than later. merge the FieldCache / FunctionValues / DocValues.Source APIs? This really needs to be addressed, but I think not here. Its horrific that algorithms like grouping, sorting, and maybe faceting have to be duplicated for 2 different things (fieldcache and docvalues). are you going to remove DocValues.Type.FLOAT_*? I think the 3 types we have here are enough. Someone can do a float or double type "on top of" the "number" type we have. Lucene is already doing this today: look at norms. I think lucene should just have a number type that stores bits. are SimpleDVConsumer and SimpleDocValuesFormat going to replace PerDocConsumer and DocValuesFormat? This is the idea, once we are happy with the APIs we would implement the 4.0 ones with these apis. are you going to remove hasArray/getArray? I don't care about this. I am unsure similarity impls should be calling this though, definitely at least it would be better for them to fall-back: I just cant bring myself to fix it until LUCENE-3862 is fixed will there still be a direct=true|false option at load-time or will it depend on the format impl (potentially with a PerFieldPerDocProducer similarly to the postings formats)? I don't want to change this in the branch. Personally i feel like a codec/segmentreader/etc should generally only manage direct, producer exposing the same "stats" (minimum, maximum, fixed, whatever) that the consumer apis get (which will also make merging more efficient!) default source impl can be something nice, read the direct impl into a packed ints, and so on. Codec could override to e.g. just slurp in their on-disk packed ints directly. So codec still has control of the in-memory RAM representation, i think this is important. But i think codec and segmentreader should somehow not be in control of caching: this should be elsewhere (FieldCache.DOCVALUES.xxx????)...
          Hide
          Adrien Grand added a comment -

          About the current SimpleDVConsumer API: NumericDocValuesConsumer addNumericField(FieldInfo field, long minValue, long maxValue) only allows for compression based on the width of the range but there are many other ways to compress values (maybe there are very few unique values, maybe the last 3 bits are always the same, etc...). Now that we buffer values into memory, what would you think of changing the API to pass an iterable of longs instead so that the doc values format can make better decisions?

          Show
          Adrien Grand added a comment - About the current SimpleDVConsumer API: NumericDocValuesConsumer addNumericField(FieldInfo field, long minValue, long maxValue) only allows for compression based on the width of the range but there are many other ways to compress values (maybe there are very few unique values, maybe the last 3 bits are always the same, etc...). Now that we buffer values into memory, what would you think of changing the API to pass an iterable of longs instead so that the doc values format can make better decisions?
          Hide
          Michael McCandless added a comment -

          I like that idea! So codec could iterate once, gathering whatever stats it needs, and then iterate again to do the writing. Should we not include the long min/maxValues then...?

          Show
          Michael McCandless added a comment - I like that idea! So codec could iterate once, gathering whatever stats it needs, and then iterate again to do the writing. Should we not include the long min/maxValues then...?
          Hide
          Adrien Grand added a comment -

          So codec could iterate once, gathering whatever stats it needs, and then iterate again to do the writing.

          Yep.

          Should we not include the long min/maxValues then...?

          I think so? And we could do something similar with addBinaryField and addSortedField. I think there are many possible optimizations based on how much lengths vary, whether bytes refs share prefixes or not, ...

          Show
          Adrien Grand added a comment - So codec could iterate once, gathering whatever stats it needs, and then iterate again to do the writing. Yep. Should we not include the long min/maxValues then...? I think so? And we could do something similar with addBinaryField and addSortedField . I think there are many possible optimizations based on how much lengths vary, whether bytes refs share prefixes or not, ...
          Hide
          Simon Willnauer added a comment -

          FYI - I just added simple numeric & binary impls for Lucene41 test demo passes for Lucene41 execpt of the sorted test (no impls yet)

          Show
          Simon Willnauer added a comment - FYI - I just added simple numeric & binary impls for Lucene41 test demo passes for Lucene41 execpt of the sorted test (no impls yet)
          Hide
          Robert Muir added a comment -

          What would the flush/merge api look like?
          Would it get simple or more complicated?
          Could we still require certain stats from the Producer, so that we can have a default, efficient in-RAM Source impl?

          I think there are many possible optimizations based on how much lengths vary, whether bytes refs share prefixes or not, ...

          maybe, but arguably we should do the simplest possible thing that can work given the codecs we have today. When
          designing these apis, to me these are the only ones that exist...

          Show
          Robert Muir added a comment - What would the flush/merge api look like? Would it get simple or more complicated? Could we still require certain stats from the Producer, so that we can have a default, efficient in-RAM Source impl? I think there are many possible optimizations based on how much lengths vary, whether bytes refs share prefixes or not, ... maybe, but arguably we should do the simplest possible thing that can work given the codecs we have today. When designing these apis, to me these are the only ones that exist...
          Hide
          Adrien Grand added a comment -

          The API could look like

          class DocValuesConsumer {
          
            // add all values in a single call
            void addNumericField(FieldInfo field,  Collection<Long> values); // values.size() == numDocs
            void addBinaryField(FieldInfo field, Collection<BytesRef> values); // values.size() == numDocs
            void addSortedField(FieldInfo field, Collection<BytesRef> ordToValue, Collection<Long> docToOrd); // docToOrd.size() == numDocs, ordToValue.size() == valueCount
          
            // same merge API
          
          }
          

          I don't see why merging would be more complicated but maybe I'm missing something? The default merge impls would need to use lazy collection impls in order to remain memory-efficient.

          Could we still require certain stats from the Producer, so that we can have a default, efficient in-RAM Source impl?

          Why would we need stats to make a Source impl efficient?

          maybe, but arguably we should do the simplest possible thing that can work given the codecs we have today

          To me it looks as simple as the current API.

          Show
          Adrien Grand added a comment - The API could look like class DocValuesConsumer { // add all values in a single call void addNumericField(FieldInfo field, Collection< Long > values); // values.size() == numDocs void addBinaryField(FieldInfo field, Collection<BytesRef> values); // values.size() == numDocs void addSortedField(FieldInfo field, Collection<BytesRef> ordToValue, Collection< Long > docToOrd); // docToOrd.size() == numDocs, ordToValue.size() == valueCount // same merge API } I don't see why merging would be more complicated but maybe I'm missing something? The default merge impls would need to use lazy collection impls in order to remain memory-efficient. Could we still require certain stats from the Producer, so that we can have a default, efficient in-RAM Source impl? Why would we need stats to make a Source impl efficient? maybe, but arguably we should do the simplest possible thing that can work given the codecs we have today To me it looks as simple as the current API.
          Hide
          Robert Muir added a comment -

          1. We don't need to be passing numDocs today: this should be removed because its duplicated with SegmentWriteState

          2. Merging becomes more complex because the api requires this collection view: today it does not.

          3. We need stats to make the Source impl efficient so we can e.g. use packed integers for Number type. Bye bye arrays.

          I don't think we should make anything more flexible than necessary until things like the producer api are sorted out.
          Otherwise its difficult to see the tradeoffs.

          Show
          Robert Muir added a comment - 1. We don't need to be passing numDocs today: this should be removed because its duplicated with SegmentWriteState 2. Merging becomes more complex because the api requires this collection view: today it does not. 3. We need stats to make the Source impl efficient so we can e.g. use packed integers for Number type. Bye bye arrays. I don't think we should make anything more flexible than necessary until things like the producer api are sorted out. Otherwise its difficult to see the tradeoffs.
          Hide
          Adrien Grand added a comment -

          I don't think we should make anything more flexible than necessary until things like the producer api are sorted out.

          I can wait for the producer API. I just wanted to point out that by exposing all the values at once, the DocValuesFormat could make more clever choices than by just exposing the width of the range of values.

          Show
          Adrien Grand added a comment - I don't think we should make anything more flexible than necessary until things like the producer api are sorted out. I can wait for the producer API. I just wanted to point out that by exposing all the values at once, the DocValuesFormat could make more clever choices than by just exposing the width of the range of values.
          Hide
          Michael McCandless added a comment -

          I committed an initial attempt at a simpler producer API ... it's rough!! But at least TestDemoDocValue passes w/ SimpleText.

          I moved SimpleText's "loaded into RAM" wrappers up into SimpleDVProducer; this way a codec only must impl the direct source and can impl in-RAM source if it wants to.

          SegmentCoreReaders now does the caching of in-RAM sources.

          Simon I temporarily disabled the Lucene41DV producer ... I'll go get it working too ...

          Show
          Michael McCandless added a comment - I committed an initial attempt at a simpler producer API ... it's rough!! But at least TestDemoDocValue passes w/ SimpleText. I moved SimpleText's "loaded into RAM" wrappers up into SimpleDVProducer; this way a codec only must impl the direct source and can impl in-RAM source if it wants to. SegmentCoreReaders now does the caching of in-RAM sources. Simon I temporarily disabled the Lucene41DV producer ... I'll go get it working too ...
          Hide
          Simon Willnauer added a comment -

          hey folks,

          I looked at the branch and I would want to suggest we move a little slower here. we are doing too many things at once. Like I really don't like the trend to make FieldCache the single source for caching. FieldCache has many problems in my opininon like is uses this DEFAULT singleton, has a single way of how things are cached per reader some users might want to use different access to DV like in ES we don't use FieldCache at all for many reasons.I think we are going into the right direction here but exposing everything through FC is a no-go IMO. I do see why we should merge the interfaces and expose un-inverted fields via the new DV interface - nice! but hiding it behind FC is no good.
          I also don't like the way how "in-ram" DV are exposed. I don't think we should have newRAMInstance() on the interface. Lets keep the interface clean and don't mix in how it is represented. I'd rather vote for dedicated producers or SimpleDocValuesProducer#getNumericDocValues(boolean inMemory). Then we can still do caching on top. The producer should really be simple and shouldn't do caching. We can also separate the default in-memory impls in a simple helper class with methods like static NumericDocValues load(NumericDocValues directDocValues)

          Show
          Simon Willnauer added a comment - hey folks, I looked at the branch and I would want to suggest we move a little slower here. we are doing too many things at once. Like I really don't like the trend to make FieldCache the single source for caching. FieldCache has many problems in my opininon like is uses this DEFAULT singleton, has a single way of how things are cached per reader some users might want to use different access to DV like in ES we don't use FieldCache at all for many reasons.I think we are going into the right direction here but exposing everything through FC is a no-go IMO. I do see why we should merge the interfaces and expose un-inverted fields via the new DV interface - nice! but hiding it behind FC is no good. I also don't like the way how "in-ram" DV are exposed. I don't think we should have newRAMInstance() on the interface. Lets keep the interface clean and don't mix in how it is represented. I'd rather vote for dedicated producers or SimpleDocValuesProducer#getNumericDocValues(boolean inMemory). Then we can still do caching on top. The producer should really be simple and shouldn't do caching. We can also separate the default in-memory impls in a simple helper class with methods like static NumericDocValues load(NumericDocValues directDocValues)
          Hide
          Robert Muir added a comment -

          If you don't want like FieldCache in ES, then use something else.

          Its really broken that things like grouping and faceting are coded to a separate API. This makes DV unsuccessful. If we arent going to fix DocValues and Faceting APIs then perhaps we shoudl consider removing DocValues completely from lucene.

          Please try to give the branch some time. I committed some work last night very late and got tired and went to sleep. Its not "ready" and i'm not threatening to commit to trunk.

          I created this branch to develop publicly. I don't have to do that: I can develop privately instead and not deal with complaints about my every step before I even say things are ready. I just thought it would make it easier for other people to help.

          Show
          Robert Muir added a comment - If you don't want like FieldCache in ES, then use something else. Its really broken that things like grouping and faceting are coded to a separate API. This makes DV unsuccessful. If we arent going to fix DocValues and Faceting APIs then perhaps we shoudl consider removing DocValues completely from lucene. Please try to give the branch some time. I committed some work last night very late and got tired and went to sleep. Its not "ready" and i'm not threatening to commit to trunk. I created this branch to develop publicly. I don't have to do that: I can develop privately instead and not deal with complaints about my every step before I even say things are ready. I just thought it would make it easier for other people to help.
          Hide
          Simon Willnauer added a comment -

          Its really broken that things like grouping and faceting are coded to a separate API. This makes DV unsuccessful.

          don't get me wrong I think this adds a lot of value. BUT, if the only way to get a DV instance that is cached is FC then this entire thing is inconsistent. We don't ask people to cache TermEnum which can be heavy if you use MemoryPostings for instance. We neither do for StoredFields nor do we have a caching layer that is not used by default. All I am arguing for is that if somebody wants to use DV it should be simple to do so. The distinction between in-memory and on disk should not be in FC.

          If we arent going to fix DocValues and Faceting APIs then perhaps we shoudl consider removing DocValues completely from lucene.

          you mean remove fieldcache?

          Show
          Simon Willnauer added a comment - Its really broken that things like grouping and faceting are coded to a separate API. This makes DV unsuccessful. don't get me wrong I think this adds a lot of value. BUT, if the only way to get a DV instance that is cached is FC then this entire thing is inconsistent. We don't ask people to cache TermEnum which can be heavy if you use MemoryPostings for instance. We neither do for StoredFields nor do we have a caching layer that is not used by default. All I am arguing for is that if somebody wants to use DV it should be simple to do so. The distinction between in-memory and on disk should not be in FC. If we arent going to fix DocValues and Faceting APIs then perhaps we shoudl consider removing DocValues completely from lucene. you mean remove fieldcache?
          Hide
          Simon Willnauer added a comment -

          Just an idea though... while we are on it should we maybe add a 4th type that allows multiple values. that way we can just pull DV from any field and uninvert if needed?

          Show
          Simon Willnauer added a comment - Just an idea though... while we are on it should we maybe add a 4th type that allows multiple values. that way we can just pull DV from any field and uninvert if needed?
          Hide
          Shai Erera added a comment -

          while we are on it should we maybe add a 4th type that allows multiple values

          +1. That might allow using DVs for faceted search.

          Show
          Shai Erera added a comment - while we are on it should we maybe add a 4th type that allows multiple values +1. That might allow using DVs for faceted search.
          Hide
          Robert Muir added a comment -

          I don't think this is necessary. Someone can do this "on top" of a binary impl themselves.

          Show
          Robert Muir added a comment - I don't think this is necessary. Someone can do this "on top" of a binary impl themselves.
          Hide
          Simon Willnauer added a comment -

          Hey folks,

          I thought about the IN-RAM vs. ON-Disk distinction we have in DV at this point and how we distinguish API wise. IMO calling $TypeDocValues#newRAMInstance() with different behavior if the instance is already a ram instance is kind of ugly API wise as well as having a binary distinction here might not be sufficient. From my point of view it would logically make most sense to allow the codec to decide if it is in ram or not or if only parts of the values are in memory like in the sorted case where you might wanna use a FST holding a subset of the values. Now giving the control entirely to the code might not be practical. Think about merging where you really don't want to load into memory you should be able to tell don't pull into memory. We can do this already today if we pass in IOContext. Yet, IOContext is the wrong level since its a reader wide setting and might not be true for all fields in the case we open a reader for handling searches. Yet, the idea of IOContext is basically to pass information about the access pattern where merge means sequential access. We might want to use something similar for docvalues that allows us to leave most of the decisions to the codec but if a user decides he really needs stuff in memory he can still pass in something like AccessPattern.SEQUENTIAL and load the values into an auxiliary datastructure. This would allow the codec to optimize under the hood but not making any promises if it's in ram or on disk if AccessPattern.DEFAULT is passed.

          Show
          Simon Willnauer added a comment - Hey folks, I thought about the IN-RAM vs. ON-Disk distinction we have in DV at this point and how we distinguish API wise. IMO calling $TypeDocValues#newRAMInstance() with different behavior if the instance is already a ram instance is kind of ugly API wise as well as having a binary distinction here might not be sufficient. From my point of view it would logically make most sense to allow the codec to decide if it is in ram or not or if only parts of the values are in memory like in the sorted case where you might wanna use a FST holding a subset of the values. Now giving the control entirely to the code might not be practical. Think about merging where you really don't want to load into memory you should be able to tell don't pull into memory. We can do this already today if we pass in IOContext. Yet, IOContext is the wrong level since its a reader wide setting and might not be true for all fields in the case we open a reader for handling searches. Yet, the idea of IOContext is basically to pass information about the access pattern where merge means sequential access. We might want to use something similar for docvalues that allows us to leave most of the decisions to the codec but if a user decides he really needs stuff in memory he can still pass in something like AccessPattern.SEQUENTIAL and load the values into an auxiliary datastructure. This would allow the codec to optimize under the hood but not making any promises if it's in ram or on disk if AccessPattern.DEFAULT is passed.
          Hide
          Michael McCandless added a comment -

          I think multi-valued case could be compelling, but we should probably do later / outside this branch. EG FieldCache already supports this (DocTermOrds). It's true that app could do this on top of Binary DV, but I think it's useful enough that a real impl would be worthwhile (eg for facets).

          Show
          Michael McCandless added a comment - I think multi-valued case could be compelling, but we should probably do later / outside this branch. EG FieldCache already supports this (DocTermOrds). It's true that app could do this on top of Binary DV, but I think it's useful enough that a real impl would be worthwhile (eg for facets).
          Hide
          Michael McCandless added a comment -

          I think letting the codec control in-RAM vs on-disk is a great idea!

          Why not let merging load values into RAM if your DVFormat is a RAM-backed impl? The codec can always override merging if it wants to ...

          Show
          Michael McCandless added a comment - I think letting the codec control in-RAM vs on-disk is a great idea! Why not let merging load values into RAM if your DVFormat is a RAM-backed impl? The codec can always override merging if it wants to ...
          Hide
          Simon Willnauer added a comment -

          I think letting the codec control in-RAM vs on-disk is a great idea!

          actually that is not what I was saying and I strongly discourage that we require people to make ram vs. on disk decisions ahead of time. Most of those decisions need to be made dynamically based on ram availability and growth.

          what I was saying is that the user should provide its intend so the codec can optimize.

          Show
          Simon Willnauer added a comment - I think letting the codec control in-RAM vs on-disk is a great idea! actually that is not what I was saying and I strongly discourage that we require people to make ram vs. on disk decisions ahead of time. Most of those decisions need to be made dynamically based on ram availability and growth. what I was saying is that the user should provide its intend so the codec can optimize.
          Hide
          Michael McCandless added a comment -

          I think letting the codec control in-RAM vs on-disk is a great idea!

          actually that is not what I was saying and I strongly discourage that we require people to make ram vs. on disk decisions ahead of time.

          I think this is actually a clean way to do it, and it matches what we
          do with other codec parts. Eg with postings you pick MemoryPF if you
          have the free RAM and want fast lookups for that field, else you pick
          an on-disk postings format.

          Most of those decisions need to be made dynamically based on ram availability and growth.

          I think making dynamic decisions based on ram availability and growth
          is a more expert use case; eg in Lucene today we don't give you that:
          Deleted docs, norms, field cache entries, doc values (if you sort by
          them), terms index are all loaded into RAM. So the only control users
          have now is which fields they index/sort on...

          If we give control to the codec over whether the DV format is in RAM
          or on disk or something in between (like the terms index), and we make
          a PerFieldDVFormat so you can easily switch impls by field, then users
          can make the decisions themselves, field by field.

          If a given field will be used for sorting or faceting, they can use
          the fast RAM-based format, but if they are tight on RAM and have lots
          of scoring factors, maybe they use the disk-based impl for those fields.

          If an expert app really need to pick & choose ram vs disk dynamically,
          depending on how many other indices are open and how much RAM they are
          using, etc., they can always make a custom DV format ...

          Show
          Michael McCandless added a comment - I think letting the codec control in-RAM vs on-disk is a great idea! actually that is not what I was saying and I strongly discourage that we require people to make ram vs. on disk decisions ahead of time. I think this is actually a clean way to do it, and it matches what we do with other codec parts. Eg with postings you pick MemoryPF if you have the free RAM and want fast lookups for that field, else you pick an on-disk postings format. Most of those decisions need to be made dynamically based on ram availability and growth. I think making dynamic decisions based on ram availability and growth is a more expert use case; eg in Lucene today we don't give you that: Deleted docs, norms, field cache entries, doc values (if you sort by them), terms index are all loaded into RAM. So the only control users have now is which fields they index/sort on... If we give control to the codec over whether the DV format is in RAM or on disk or something in between (like the terms index), and we make a PerFieldDVFormat so you can easily switch impls by field, then users can make the decisions themselves, field by field. If a given field will be used for sorting or faceting, they can use the fast RAM-based format, but if they are tight on RAM and have lots of scoring factors, maybe they use the disk-based impl for those fields. If an expert app really need to pick & choose ram vs disk dynamically, depending on how many other indices are open and how much RAM they are using, etc., they can always make a custom DV format ...
          Hide
          Simon Willnauer added a comment -

          If an expert app really need to pick & choose ram vs disk dynamically, depending on how many other indices are open and how much RAM they are using, etc., they can always make a custom DV format ...

          what I am worried about is the lack of communication between the app and the codec. something like this is going to be a major hassle. all I am asking about is to pass in "hints" to the codec what I need at a certain point per field. We can't do this and I think we shouldn't allow this. its an encoding / decoding layer and it should be simple. pushing what you call "experts" to write their own codecs is a major trap I think. writing a codec is last resort and causes major trouble for non-lucene devs IMO. This is expertexpert

          I really like the idea of perfieldDV and I think we should do it. I am just not a big fan of making up-front decisions for this stuff when it comes to on-disk vs. ram. PostingsFormat is a different story, the on disk (low ram useage) have such a perf characteristics that you very unlikely need something else useing lots of ram. For sorting, grouping or scoring you will certainly need that.

          Show
          Simon Willnauer added a comment - If an expert app really need to pick & choose ram vs disk dynamically, depending on how many other indices are open and how much RAM they are using, etc., they can always make a custom DV format ... what I am worried about is the lack of communication between the app and the codec. something like this is going to be a major hassle. all I am asking about is to pass in "hints" to the codec what I need at a certain point per field. We can't do this and I think we shouldn't allow this. its an encoding / decoding layer and it should be simple. pushing what you call "experts" to write their own codecs is a major trap I think. writing a codec is last resort and causes major trouble for non-lucene devs IMO. This is expertexpert I really like the idea of perfieldDV and I think we should do it. I am just not a big fan of making up-front decisions for this stuff when it comes to on-disk vs. ram. PostingsFormat is a different story, the on disk (low ram useage) have such a perf characteristics that you very unlikely need something else useing lots of ram. For sorting, grouping or scoring you will certainly need that.
          Hide
          Simon Willnauer added a comment -

          one way of merging the two approaches would be a simple boolean that forces on-disk access. that way we can solve the following problems:

          • default merge impls. a format doesn't need to override merge if its an in-ram by default impl
          • people can build auxiliary structures on top without worrying about ram
          • leave decisions to the codec what in-memory structure or on-disk structure it uses by default

          this way we can keep the api clean and support expert users. We can even make this package private so that "normal" users won't see it?

          Show
          Simon Willnauer added a comment - one way of merging the two approaches would be a simple boolean that forces on-disk access. that way we can solve the following problems: default merge impls. a format doesn't need to override merge if its an in-ram by default impl people can build auxiliary structures on top without worrying about ram leave decisions to the codec what in-memory structure or on-disk structure it uses by default this way we can keep the api clean and support expert users. We can even make this package private so that "normal" users won't see it?
          Hide
          Robert Muir added a comment -

          This is sounding too complicated.
          I think it sounds ok to remove the distinction of ram/on-disk and just have codec.
          If someone wants to do expert stuff in their codec thats fine, but we don't need it in our impls or abstract APIS.

          Lets start with just getting the basics working here and the integration with the rest of lucene simple.

          Today (in trunk) on-disk access is a pipe dream (thus, this issue) because the codec api is responsible for too much.

          Expert users and flexibility should be our last priority.

          Show
          Robert Muir added a comment - This is sounding too complicated. I think it sounds ok to remove the distinction of ram/on-disk and just have codec. If someone wants to do expert stuff in their codec thats fine, but we don't need it in our impls or abstract APIS. Lets start with just getting the basics working here and the integration with the rest of lucene simple. Today (in trunk) on-disk access is a pipe dream (thus, this issue) because the codec api is responsible for too much. Expert users and flexibility should be our last priority.
          Hide
          Simon Willnauer added a comment - - edited

          This is sounding too complicated.

          a single boolean is too complicated? all I ask for is a way to prevent loading into ram if not necessary. We had this in 4.0 and I think we should make this work in 4.1 too. remember this is a different use-case than postings. I really don't think I ask for much here.

          Show
          Simon Willnauer added a comment - - edited This is sounding too complicated. a single boolean is too complicated? all I ask for is a way to prevent loading into ram if not necessary. We had this in 4.0 and I think we should make this work in 4.1 too. remember this is a different use-case than postings. I really don't think I ask for much here.
          Hide
          Robert Muir added a comment -

          a single boolean is too complicated?

          I think it is, I feel like it really confuses the API and makes writing codecs harder.

          I think it would be better if the codec impl determined this, just like MemoryPostings and so on.
          So I'd rather have Per-field dv wrapper that configures this.

          For example someone would use a different implementation for their solr __version field than they
          would use for a scoring factor, and maybe a different implementation for a sort field than a faceting one.

          I don't think there is a use case to be able to access a single field's values both from RAM and on disk,
          and for the codec to have to deal with that. It makes things currently very complicated.

          We had this in 4.0 and I think we should make this work in 4.1 too.

          I don't think thats necessarily true. In 4.0 the one DV impl we had could do a lot, but the codec API is
          very difficult. I actually contributed to a lot of the codec apis in Lucene, and as a committer I was unable
          to figure out how to write a working DV impl to this api. I think this says a lot.

          I'd rather have a simpler codec API, that enables innovation so that we can see cool shit in the future,
          like implementations geared at sorting and faceting that use less RAM, and so on.

          If someone really needs more fine-grained control than per-field codec API, then there are other ways to achieve
          that: FileSwitchDirectory, adding such APIs to their own codec, etc. But I'm not sure its mainstream and should
          be required by all codecs.

          Show
          Robert Muir added a comment - a single boolean is too complicated? I think it is, I feel like it really confuses the API and makes writing codecs harder. I think it would be better if the codec impl determined this, just like MemoryPostings and so on. So I'd rather have Per-field dv wrapper that configures this. For example someone would use a different implementation for their solr __version field than they would use for a scoring factor, and maybe a different implementation for a sort field than a faceting one. I don't think there is a use case to be able to access a single field's values both from RAM and on disk, and for the codec to have to deal with that. It makes things currently very complicated. We had this in 4.0 and I think we should make this work in 4.1 too. I don't think thats necessarily true. In 4.0 the one DV impl we had could do a lot, but the codec API is very difficult. I actually contributed to a lot of the codec apis in Lucene, and as a committer I was unable to figure out how to write a working DV impl to this api. I think this says a lot. I'd rather have a simpler codec API, that enables innovation so that we can see cool shit in the future, like implementations geared at sorting and faceting that use less RAM, and so on. If someone really needs more fine-grained control than per-field codec API, then there are other ways to achieve that: FileSwitchDirectory, adding such APIs to their own codec, etc. But I'm not sure its mainstream and should be required by all codecs.
          Hide
          David Smiley added a comment -

          I just got an error at search time fetching DocValues in 4.0.0:

          SEVERE: null:java.lang.ArrayIndexOutOfBoundsException
                  at org.apache.lucene.util.PagedBytes$Reader.fillSlice(PagedBytes.java:97)
                  at org.apache.lucene.codecs.lucene40.values.VarStraightBytesImpl$VarStraightSource.getBytes(VarStraightBytesImpl.java:273)
          

          I pass in a scratch BytesRef, and give a docId local to the segment.

          Could it be related to this issue? This bug is freaking me out a bit as I may be forced to abandon DocValues.

          Show
          David Smiley added a comment - I just got an error at search time fetching DocValues in 4.0.0: SEVERE: null:java.lang.ArrayIndexOutOfBoundsException at org.apache.lucene.util.PagedBytes$Reader.fillSlice(PagedBytes.java:97) at org.apache.lucene.codecs.lucene40.values.VarStraightBytesImpl$VarStraightSource.getBytes(VarStraightBytesImpl.java:273) I pass in a scratch BytesRef, and give a docId local to the segment. Could it be related to this issue? This bug is freaking me out a bit as I may be forced to abandon DocValues.
          Hide
          Robert Muir added a comment -

          David: I'm not sure. the problem I opened this issue for actually relates to things like merging.

          Can you do a manual bounds check, as docvalues at read time doesn't have explicit checks (IndexOutOfBounds is expected/best effort if you pass a wrong docid):

              if (docID < 0 || docID >= reader.maxDoc())    
          

          Otherwise if thats not the problem, do you have a test or something you could upload to an issue?

          Show
          Robert Muir added a comment - David: I'm not sure. the problem I opened this issue for actually relates to things like merging. Can you do a manual bounds check, as docvalues at read time doesn't have explicit checks (IndexOutOfBounds is expected/best effort if you pass a wrong docid): if (docID < 0 || docID >= reader.maxDoc()) Otherwise if thats not the problem, do you have a test or something you could upload to an issue?
          Hide
          Michael McCandless added a comment -

          Just a quick recap on where things stand on the branch:

          • We have the DV 2.0 API, shadowing DV 1.0 API.
          • We have one codec (SimpleText) that implements it, passes tests
          • CheckIndex does basic tests of DV 2.0, and we also have
            TestDemoDocValue, but nothing else is cutover yet.
          • Lucene41 codec's impl is I think close but was failing some tests
            (not sure why yet)
          • We have a MemoryDV but it's very RAM inefficient now
          • We have Norms 2.0 API too, shadowing current norms, and only
            SimpleText implements it (but should be easy to get Lucene41 to
            impl it too).
          • We need to cut over all uses/tests of DV 1.0 / norms 1.0 and then
            remove DV/norms 1.0 shadow code.
          • There are still tons and tons of nocommits ...
          Show
          Michael McCandless added a comment - Just a quick recap on where things stand on the branch: We have the DV 2.0 API, shadowing DV 1.0 API. We have one codec (SimpleText) that implements it, passes tests CheckIndex does basic tests of DV 2.0, and we also have TestDemoDocValue, but nothing else is cutover yet. Lucene41 codec's impl is I think close but was failing some tests (not sure why yet) We have a MemoryDV but it's very RAM inefficient now We have Norms 2.0 API too, shadowing current norms, and only SimpleText implements it (but should be easy to get Lucene41 to impl it too). We need to cut over all uses/tests of DV 1.0 / norms 1.0 and then remove DV/norms 1.0 shadow code. There are still tons and tons of nocommits ...
          Hide
          Robert Muir added a comment -

          Applyable patch to trunk r1442822.

          I think this is close: jenkins-blasted, benchmarked, beefed up tests and added many new ones, several codecs with different tradeoffs, per-field configuration, 4.0 file format compat, more efficient 4.2 format, and so on.

          Show
          Robert Muir added a comment - Applyable patch to trunk r1442822. I think this is close: jenkins-blasted, benchmarked, beefed up tests and added many new ones, several codecs with different tradeoffs, per-field configuration, 4.0 file format compat, more efficient 4.2 format, and so on.
          Hide
          Adrien Grand added a comment -

          Big +1!

          I especially like the fact that doc values

          • now use the same interface as FieldCache,
          • can be configured on a per-field basis thanks to a SPI,
          • are first buffered in memory (with RAM accounting) so that the codec has more opportunities to perform optimizations.
          Show
          Adrien Grand added a comment - Big +1! I especially like the fact that doc values now use the same interface as FieldCache, can be configured on a per-field basis thanks to a SPI, are first buffered in memory (with RAM accounting) so that the codec has more opportunities to perform optimizations.
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Robert Muir
          http://svn.apache.org/viewvc?view=revision&revision=1443717

          LUCENE-4547: DocValues improvements

          Show
          Commit Tag Bot added a comment - [trunk commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1443717 LUCENE-4547 : DocValues improvements
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Robert Muir
          http://svn.apache.org/viewvc?view=revision&revision=1443834

          LUCENE-4547: DocValues improvements

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1443834 LUCENE-4547 : DocValues improvements
          Hide
          Markus Jelsma added a comment -

          Revision 1443717 breaks custom collector code such as:

          int[] ints = FieldCache.DEFAULT.getInts(context.reader(), this.fieldName, false);
          

          Do you suggest we should keep the returned instance of Ints but where is the concrete Ints class? Can't seem to find it.

          Thanks

          Show
          Markus Jelsma added a comment - Revision 1443717 breaks custom collector code such as: int [] ints = FieldCache.DEFAULT.getInts(context.reader(), this .fieldName, false ); Do you suggest we should keep the returned instance of Ints but where is the concrete Ints class? Can't seem to find it. Thanks
          Hide
          Adrien Grand added a comment -

          where is the concrete Ints class?

          This class is an inner class of FieldCache: oal.search.FieldCache.Ints. Then you should be able to fix your code by replacing ints[i] with ints.get( i ).

          Show
          Adrien Grand added a comment - where is the concrete Ints class? This class is an inner class of FieldCache: oal.search.FieldCache.Ints. Then you should be able to fix your code by replacing ints [i] with ints.get( i ) .
          Hide
          Markus Jelsma added a comment -

          Oh i see, FieldCacheImpl returns and overrides Ints.get(int docId).

          Show
          Markus Jelsma added a comment - Oh i see, FieldCacheImpl returns and overrides Ints.get(int docId).
          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.

            People

            • Assignee:
              Unassigned
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development